22  Data I/O

import polars as pl
import re

22.1 Read CSV

iris = pl.read_csv("~/icloud/Data/iris.csv")
iris
shape: (150, 5)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
f64 f64 f64 f64 str
5.1 3.5 1.4 0.2 "setosa"
4.9 3.0 1.4 0.2 "setosa"
4.7 3.2 1.3 0.2 "setosa"
4.6 3.1 1.5 0.2 "setosa"
5.0 3.6 1.4 0.2 "setosa"
5.4 3.9 1.7 0.4 "setosa"
4.6 3.4 1.4 0.3 "setosa"
5.0 3.4 1.5 0.2 "setosa"
4.4 2.9 1.4 0.2 "setosa"
4.9 3.1 1.5 0.1 "setosa"
5.4 3.7 1.5 0.2 "setosa"
4.8 3.4 1.6 0.2 "setosa"
... ... ... ... ...
6.0 3.0 4.8 1.8 "virginica"
6.9 3.1 5.4 2.1 "virginica"
6.7 3.1 5.6 2.4 "virginica"
6.9 3.1 5.1 2.3 "virginica"
5.8 2.7 5.1 1.9 "virginica"
6.8 3.2 5.9 2.3 "virginica"
6.7 3.3 5.7 2.5 "virginica"
6.7 3.0 5.2 2.3 "virginica"
6.3 2.5 5.0 1.9 "virginica"
6.5 3.0 5.2 2.0 "virginica"
6.2 3.4 5.4 2.3 "virginica"
5.9 3.0 5.1 1.8 "virginica"

22.2 Lazy Read CSV

iris = pl.scan_csv("~/icloud/Data/iris.csv")
iris

NAIVE QUERY PLAN

run LazyFrame.show_graph() to see the optimized version

polars_query

Fetch the lazy-read DataFrame:

iris = iris.fetch()
iris
shape: (150, 5)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
f64 f64 f64 f64 str
5.1 3.5 1.4 0.2 "setosa"
4.9 3.0 1.4 0.2 "setosa"
4.7 3.2 1.3 0.2 "setosa"
4.6 3.1 1.5 0.2 "setosa"
5.0 3.6 1.4 0.2 "setosa"
5.4 3.9 1.7 0.4 "setosa"
4.6 3.4 1.4 0.3 "setosa"
5.0 3.4 1.5 0.2 "setosa"
4.4 2.9 1.4 0.2 "setosa"
4.9 3.1 1.5 0.1 "setosa"
5.4 3.7 1.5 0.2 "setosa"
4.8 3.4 1.6 0.2 "setosa"
... ... ... ... ...
6.0 3.0 4.8 1.8 "virginica"
6.9 3.1 5.4 2.1 "virginica"
6.7 3.1 5.6 2.4 "virginica"
6.9 3.1 5.1 2.3 "virginica"
5.8 2.7 5.1 1.9 "virginica"
6.8 3.2 5.9 2.3 "virginica"
6.7 3.3 5.7 2.5 "virginica"
6.7 3.0 5.2 2.3 "virginica"
6.3 2.5 5.0 1.9 "virginica"
6.5 3.0 5.2 2.0 "virginica"
6.2 3.4 5.4 2.3 "virginica"
5.9 3.0 5.1 1.8 "virginica"

22.3 Column names

Get column names:

iris.columns
['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']

Set column names:

iris.columns = [re.sub("\.", "_", col) for col in iris.columns]
iris.columns
['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width', 'Species']

22.4 Apply function to column names at read time

Need to use pl.scan_csv() to allow setting argument with_column_names to a function that works on the column names. Here, we convert all symbols to underscores using rtemis.strng.clean_names().

from rtemis.strng import clean_names
iris = pl.scan_csv(
    "~/icloud/Data/iris.csv",
    with_column_names = clean_names).collect()
▄▄▄  ▄▄▄▄▄▄▄▄ .• ▌ ▄ ·. ▪  .▄▄ ·
▀▄ █·•██  ▀▄.▀··██ ▐███▪██ ▐█ ▀.
▐▀▀▄  ▐█.▪▐▀▀▪▄▐█ ▌▐▌▐█·▐█·▄▀▀▀█▄
▐█•█▌ ▐█▌·▐█▄▄▌██ ██▌▐█▌▐█▌▐█▄▪▐█
.▀  ▀ ▀▀▀  ▀▀▀ ▀▀  █▪▀▀▀▀▀▀ ▀▀▀▀ py
.:rtemis 🏝 macOS-12.6-arm64-arm-64bit

22.5 Unique rows

iris = iris.unique()
iris.shape
(149, 5)

22.6 Types

22.6.1 Convert column to Categorical

iris = iris.with_columns(
    pl.col("Species").cast(pl.Categorical)
)
list(zip(iris.columns, iris.dtypes))
[('Sepal_Length', polars.datatypes.Float64),
 ('Sepal_Width', polars.datatypes.Float64),
 ('Petal_Length', polars.datatypes.Float64),
 ('Petal_Width', polars.datatypes.Float64),
 ('Species', polars.datatypes.Categorical)]

22.6.2 Specify data types at read time

iris = pl.read_csv("~/icloud/Data/iris.csv",
    dtypes = {"Species": pl.Categorical})
iris
shape: (150, 5)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
f64 f64 f64 f64 cat
5.1 3.5 1.4 0.2 "setosa"
4.9 3.0 1.4 0.2 "setosa"
4.7 3.2 1.3 0.2 "setosa"
4.6 3.1 1.5 0.2 "setosa"
5.0 3.6 1.4 0.2 "setosa"
5.4 3.9 1.7 0.4 "setosa"
4.6 3.4 1.4 0.3 "setosa"
5.0 3.4 1.5 0.2 "setosa"
4.4 2.9 1.4 0.2 "setosa"
4.9 3.1 1.5 0.1 "setosa"
5.4 3.7 1.5 0.2 "setosa"
4.8 3.4 1.6 0.2 "setosa"
... ... ... ... ...
6.0 3.0 4.8 1.8 "virginica"
6.9 3.1 5.4 2.1 "virginica"
6.7 3.1 5.6 2.4 "virginica"
6.9 3.1 5.1 2.3 "virginica"
5.8 2.7 5.1 1.9 "virginica"
6.8 3.2 5.9 2.3 "virginica"
6.7 3.3 5.7 2.5 "virginica"
6.7 3.0 5.2 2.3 "virginica"
6.3 2.5 5.0 1.9 "virginica"
6.5 3.0 5.2 2.0 "virginica"
6.2 3.4 5.4 2.3 "virginica"
5.9 3.0 5.1 1.8 "virginica"

We can define all of them, either as a Dictionary or as a list, in order - seems not to work but should (eval: false)

iris = pl.read_csv("~/icloud/Data/iris.csv",
    dtypes = [pl.Float64]*4 + [pl.Categorical])
iris

One way to manually assign dtypes is to get column names and create the dtypes Dict. (A list of dtypes without column names does not seem to work for Categorical columns)

dtypes = {column: dtype for column, dtype in zip(iris.columns, iris.dtypes)}
pl.read_csv(
    file="/Users/egenn/icloud/Data/iris.csv",
    dtypes=dtypes
)
shape: (150, 5)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
f64 f64 f64 f64 cat
5.1 3.5 1.4 0.2 "setosa"
4.9 3.0 1.4 0.2 "setosa"
4.7 3.2 1.3 0.2 "setosa"
4.6 3.1 1.5 0.2 "setosa"
5.0 3.6 1.4 0.2 "setosa"
5.4 3.9 1.7 0.4 "setosa"
4.6 3.4 1.4 0.3 "setosa"
5.0 3.4 1.5 0.2 "setosa"
4.4 2.9 1.4 0.2 "setosa"
4.9 3.1 1.5 0.1 "setosa"
5.4 3.7 1.5 0.2 "setosa"
4.8 3.4 1.6 0.2 "setosa"
... ... ... ... ...
6.0 3.0 4.8 1.8 "virginica"
6.9 3.1 5.4 2.1 "virginica"
6.7 3.1 5.6 2.4 "virginica"
6.9 3.1 5.1 2.3 "virginica"
5.8 2.7 5.1 1.9 "virginica"
6.8 3.2 5.9 2.3 "virginica"
6.7 3.3 5.7 2.5 "virginica"
6.7 3.0 5.2 2.3 "virginica"
6.3 2.5 5.0 1.9 "virginica"
6.5 3.0 5.2 2.0 "virginica"
6.2 3.4 5.4 2.3 "virginica"
5.9 3.0 5.1 1.8 "virginica"

To do this without reading the full file first, we can scan_csv and get columns, then make the dtypes Dict:

fpath = "/Users/egenn/icloud/Data/iris.csv"
# Get column names
columns = pl.scan_csv(fpath).columns
# manually assign dtypes
dtypes = [pl.Float64]*4 + [pl.Categorical]
# Convert to Dict
dtypes = {column: dtype for column, dtype in zip(iris.columns, dtypes)}
pl.read_csv(
    file="/Users/egenn/icloud/Data/iris.csv",
    dtypes = dtypes)
shape: (150, 5)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
f64 f64 f64 f64 cat
5.1 3.5 1.4 0.2 "setosa"
4.9 3.0 1.4 0.2 "setosa"
4.7 3.2 1.3 0.2 "setosa"
4.6 3.1 1.5 0.2 "setosa"
5.0 3.6 1.4 0.2 "setosa"
5.4 3.9 1.7 0.4 "setosa"
4.6 3.4 1.4 0.3 "setosa"
5.0 3.4 1.5 0.2 "setosa"
4.4 2.9 1.4 0.2 "setosa"
4.9 3.1 1.5 0.1 "setosa"
5.4 3.7 1.5 0.2 "setosa"
4.8 3.4 1.6 0.2 "setosa"
... ... ... ... ...
6.0 3.0 4.8 1.8 "virginica"
6.9 3.1 5.4 2.1 "virginica"
6.7 3.1 5.6 2.4 "virginica"
6.9 3.1 5.1 2.3 "virginica"
5.8 2.7 5.1 1.9 "virginica"
6.8 3.2 5.9 2.3 "virginica"
6.7 3.3 5.7 2.5 "virginica"
6.7 3.0 5.2 2.3 "virginica"
6.3 2.5 5.0 1.9 "virginica"
6.5 3.0 5.2 2.0 "virginica"
6.2 3.4 5.4 2.3 "virginica"
5.9 3.0 5.1 1.8 "virginica"

22.6.3 Get all columns of type

Select Float64 columns:

iris.select(pl.col(pl.Float64))
shape: (150, 4)
Sepal.Length Sepal.Width Petal.Length Petal.Width
f64 f64 f64 f64
5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.1 1.5 0.2
5.0 3.6 1.4 0.2
5.4 3.9 1.7 0.4
4.6 3.4 1.4 0.3
5.0 3.4 1.5 0.2
4.4 2.9 1.4 0.2
4.9 3.1 1.5 0.1
5.4 3.7 1.5 0.2
4.8 3.4 1.6 0.2
... ... ... ...
6.0 3.0 4.8 1.8
6.9 3.1 5.4 2.1
6.7 3.1 5.6 2.4
6.9 3.1 5.1 2.3
5.8 2.7 5.1 1.9
6.8 3.2 5.9 2.3
6.7 3.3 5.7 2.5
6.7 3.0 5.2 2.3
6.3 2.5 5.0 1.9
6.5 3.0 5.2 2.0
6.2 3.4 5.4 2.3
5.9 3.0 5.1 1.8

Get names of all Float64 columns:

iris.select(pl.col(pl.Float64)).columns
['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width']

Select Categorical columns:

iris.select(pl.col(pl.Categorical))
shape: (150, 1)
Species
cat
"setosa"
"setosa"
"setosa"
"setosa"
"setosa"
"setosa"
"setosa"
"setosa"
"setosa"
"setosa"
"setosa"
"setosa"
...
"virginica"
"virginica"
"virginica"
"virginica"
"virginica"
"virginica"
"virginica"
"virginica"
"virginica"
"virginica"
"virginica"
"virginica"

22.7 Write CSV

iris.write_csv("~/icloud/Data/iris_p.csv")

22.8 Write Arrow parquet

You can easily save a polars DataFrame as a parquet file:

iris.write_parquet("~/icloud/Data/iris.parquet")

22.9 Resources