17  Data I/O

import dask.dataframe as dd
import re

17.1 Read CSV

The API mirrors that of Pandas.

iris = dd.read_csv('~/icloud/Data/iris.csv')
iris
Dask DataFrame Structure:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
npartitions=1
float64 float64 float64 float64 object
... ... ... ... ...
Dask Name: read-csv, 1 graph layer

One main difference is you need to .compute() to see results.

iris.compute()
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

17.2 Types

iris = dd.read_csv('~/icloud/Data/iris.csv',
    dtype = {'Species': 'category'}).compute()
iris.dtypes
Sepal.Length     float64
Sepal.Width      float64
Petal.Length     float64
Petal.Width      float64
Species         category
dtype: object

17.3 Drop duplicates

iris = iris.drop_duplicates()
iris.shape
(149, 5)
iris = dd.read_csv('~/icloud/Data/iris.csv').drop_duplicates().compute()
iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

149 rows × 5 columns

17.4 Clean column names

Just like in Pandas

iris.columns = [re.sub("\.", "_", col) for col in list(iris.columns)]
iris.columns
Index(['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width',
       'Species'],
      dtype='object')

17.5 Write CSV

iris.to_csv('~/icloud/data/irisp.csv')

17.6 Write parquet

iris.to_parquet('~/icloud/Data/irisp.parquet')

17.7 Resources