import dask.dataframe as dd
import re
17 Data I/O
17.1 Read CSV
The API mirrors that of Pandas.
= dd.read_csv('~/icloud/Data/iris.csv')
iris iris
Dask DataFrame Structure:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
npartitions=1 | |||||
float64 | float64 | float64 | float64 | object | |
... | ... | ... | ... | ... |
Dask Name: read-csv, 1 graph layer
One main difference is you need to .compute()
to see results.
iris.compute()
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
150 rows × 5 columns
17.2 Types
= dd.read_csv('~/icloud/Data/iris.csv',
iris = {'Species': 'category'}).compute()
dtype iris.dtypes
Sepal.Length float64
Sepal.Width float64
Petal.Length float64
Petal.Width float64
Species category
dtype: object
17.3 Drop duplicates
= iris.drop_duplicates()
iris iris.shape
(149, 5)
= dd.read_csv('~/icloud/Data/iris.csv').drop_duplicates().compute()
iris iris
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
149 rows × 5 columns
17.4 Clean column names
Just like in Pandas
= [re.sub("\.", "_", col) for col in list(iris.columns)]
iris.columns iris.columns
Index(['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width',
'Species'],
dtype='object')
17.5 Write CSV
'~/icloud/data/irisp.csv') iris.to_csv(
17.6 Write parquet
'~/icloud/Data/irisp.parquet') iris.to_parquet(