20  Aggregate

import dask.dataframe as dd
import pandas as pd
import numpy as np
df = pd.DataFrame([('bird', 'Falconiformes', 389.0),
                   ('bird', 'Psittaciformes', 24.0),
                   ('mammal', 'Carnivora', 80.2),
                   ('mammal', 'Primates', np.nan),
                   ('mammal', 'Carnivora', 58)],
                  index=['falcon', 'parrot', 'lion', 'monkey', 'leopard'],
                  columns=('class', 'order', 'max_speed'))
df = dd.from_pandas(df, npartitions=1)
df
Dask DataFrame Structure:
class order max_speed
npartitions=1
falcon object object float64
parrot ... ... ...
Dask Name: from_pandas, 1 graph layer

20.0.1 groupby(): group by categorical

grouped = df.groupby('class')
grouped2 = df.groupby(['class', 'order'])
grouped.size().compute()
class
bird      2
mammal    3
dtype: int64
grouped.mean().compute()
max_speed
class
bird 206.5
mammal 69.1
grouped2.mean().compute()
max_speed
class order
bird Falconiformes 389.0
Psittaciformes 24.0
mammal Carnivora 69.1
Primates NaN

or in a single step:

df.groupby('class').mean().compute()
max_speed
class
bird 206.5
mammal 69.1

20.1 Resources