import polars as pl
import re
22 Data I/O
22.1 Read CSV
= pl.read_csv("~/icloud/Data/iris.csv")
iris iris
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
f64 | f64 | f64 | f64 | str |
5.1 | 3.5 | 1.4 | 0.2 | "setosa" |
4.9 | 3.0 | 1.4 | 0.2 | "setosa" |
4.7 | 3.2 | 1.3 | 0.2 | "setosa" |
4.6 | 3.1 | 1.5 | 0.2 | "setosa" |
5.0 | 3.6 | 1.4 | 0.2 | "setosa" |
5.4 | 3.9 | 1.7 | 0.4 | "setosa" |
4.6 | 3.4 | 1.4 | 0.3 | "setosa" |
5.0 | 3.4 | 1.5 | 0.2 | "setosa" |
4.4 | 2.9 | 1.4 | 0.2 | "setosa" |
4.9 | 3.1 | 1.5 | 0.1 | "setosa" |
5.4 | 3.7 | 1.5 | 0.2 | "setosa" |
4.8 | 3.4 | 1.6 | 0.2 | "setosa" |
... | ... | ... | ... | ... |
6.0 | 3.0 | 4.8 | 1.8 | "virginica" |
6.9 | 3.1 | 5.4 | 2.1 | "virginica" |
6.7 | 3.1 | 5.6 | 2.4 | "virginica" |
6.9 | 3.1 | 5.1 | 2.3 | "virginica" |
5.8 | 2.7 | 5.1 | 1.9 | "virginica" |
6.8 | 3.2 | 5.9 | 2.3 | "virginica" |
6.7 | 3.3 | 5.7 | 2.5 | "virginica" |
6.7 | 3.0 | 5.2 | 2.3 | "virginica" |
6.3 | 2.5 | 5.0 | 1.9 | "virginica" |
6.5 | 3.0 | 5.2 | 2.0 | "virginica" |
6.2 | 3.4 | 5.4 | 2.3 | "virginica" |
5.9 | 3.0 | 5.1 | 1.8 | "virginica" |
22.2 Lazy Read CSV
= pl.scan_csv("~/icloud/Data/iris.csv")
iris iris
NAIVE QUERY PLAN
run LazyFrame.show_graph() to see the optimized version
Fetch the lazy-read DataFrame:
= iris.fetch()
iris iris
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
f64 | f64 | f64 | f64 | str |
5.1 | 3.5 | 1.4 | 0.2 | "setosa" |
4.9 | 3.0 | 1.4 | 0.2 | "setosa" |
4.7 | 3.2 | 1.3 | 0.2 | "setosa" |
4.6 | 3.1 | 1.5 | 0.2 | "setosa" |
5.0 | 3.6 | 1.4 | 0.2 | "setosa" |
5.4 | 3.9 | 1.7 | 0.4 | "setosa" |
4.6 | 3.4 | 1.4 | 0.3 | "setosa" |
5.0 | 3.4 | 1.5 | 0.2 | "setosa" |
4.4 | 2.9 | 1.4 | 0.2 | "setosa" |
4.9 | 3.1 | 1.5 | 0.1 | "setosa" |
5.4 | 3.7 | 1.5 | 0.2 | "setosa" |
4.8 | 3.4 | 1.6 | 0.2 | "setosa" |
... | ... | ... | ... | ... |
6.0 | 3.0 | 4.8 | 1.8 | "virginica" |
6.9 | 3.1 | 5.4 | 2.1 | "virginica" |
6.7 | 3.1 | 5.6 | 2.4 | "virginica" |
6.9 | 3.1 | 5.1 | 2.3 | "virginica" |
5.8 | 2.7 | 5.1 | 1.9 | "virginica" |
6.8 | 3.2 | 5.9 | 2.3 | "virginica" |
6.7 | 3.3 | 5.7 | 2.5 | "virginica" |
6.7 | 3.0 | 5.2 | 2.3 | "virginica" |
6.3 | 2.5 | 5.0 | 1.9 | "virginica" |
6.5 | 3.0 | 5.2 | 2.0 | "virginica" |
6.2 | 3.4 | 5.4 | 2.3 | "virginica" |
5.9 | 3.0 | 5.1 | 1.8 | "virginica" |
22.3 Column names
Get column names:
iris.columns
['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']
Set column names:
= [re.sub("\.", "_", col) for col in iris.columns]
iris.columns iris.columns
['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width', 'Species']
22.4 Apply function to column names at read time
Need to use pl.scan_csv()
to allow setting argument with_column_names
to a function that works on the column names. Here, we convert all symbols to underscores using rtemis.strng.clean_names()
.
from rtemis.strng import clean_names
= pl.scan_csv(
iris "~/icloud/Data/iris.csv",
= clean_names).collect() with_column_names
▄▄▄ ▄▄▄▄▄▄▄▄ .• ▌ ▄ ·. ▪ .▄▄ ·
▀▄ █·•██ ▀▄.▀··██ ▐███▪██ ▐█ ▀.
▐▀▀▄ ▐█.▪▐▀▀▪▄▐█ ▌▐▌▐█·▐█·▄▀▀▀█▄
▐█•█▌ ▐█▌·▐█▄▄▌██ ██▌▐█▌▐█▌▐█▄▪▐█
.▀ ▀ ▀▀▀ ▀▀▀ ▀▀ █▪▀▀▀▀▀▀ ▀▀▀▀ py
.:rtemis 🏝 macOS-12.6-arm64-arm-64bit
22.5 Unique rows
= iris.unique()
iris iris.shape
(149, 5)
22.6 Types
22.6.1 Convert column to Categorical
= iris.with_columns(
iris "Species").cast(pl.Categorical)
pl.col(
)list(zip(iris.columns, iris.dtypes))
[('Sepal_Length', polars.datatypes.Float64),
('Sepal_Width', polars.datatypes.Float64),
('Petal_Length', polars.datatypes.Float64),
('Petal_Width', polars.datatypes.Float64),
('Species', polars.datatypes.Categorical)]
22.6.2 Specify data types at read time
= pl.read_csv("~/icloud/Data/iris.csv",
iris = {"Species": pl.Categorical})
dtypes iris
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
f64 | f64 | f64 | f64 | cat |
5.1 | 3.5 | 1.4 | 0.2 | "setosa" |
4.9 | 3.0 | 1.4 | 0.2 | "setosa" |
4.7 | 3.2 | 1.3 | 0.2 | "setosa" |
4.6 | 3.1 | 1.5 | 0.2 | "setosa" |
5.0 | 3.6 | 1.4 | 0.2 | "setosa" |
5.4 | 3.9 | 1.7 | 0.4 | "setosa" |
4.6 | 3.4 | 1.4 | 0.3 | "setosa" |
5.0 | 3.4 | 1.5 | 0.2 | "setosa" |
4.4 | 2.9 | 1.4 | 0.2 | "setosa" |
4.9 | 3.1 | 1.5 | 0.1 | "setosa" |
5.4 | 3.7 | 1.5 | 0.2 | "setosa" |
4.8 | 3.4 | 1.6 | 0.2 | "setosa" |
... | ... | ... | ... | ... |
6.0 | 3.0 | 4.8 | 1.8 | "virginica" |
6.9 | 3.1 | 5.4 | 2.1 | "virginica" |
6.7 | 3.1 | 5.6 | 2.4 | "virginica" |
6.9 | 3.1 | 5.1 | 2.3 | "virginica" |
5.8 | 2.7 | 5.1 | 1.9 | "virginica" |
6.8 | 3.2 | 5.9 | 2.3 | "virginica" |
6.7 | 3.3 | 5.7 | 2.5 | "virginica" |
6.7 | 3.0 | 5.2 | 2.3 | "virginica" |
6.3 | 2.5 | 5.0 | 1.9 | "virginica" |
6.5 | 3.0 | 5.2 | 2.0 | "virginica" |
6.2 | 3.4 | 5.4 | 2.3 | "virginica" |
5.9 | 3.0 | 5.1 | 1.8 | "virginica" |
We can define all of them, either as a Dictionary or as a list, in order - seems not to work but should (eval: false)
= pl.read_csv("~/icloud/Data/iris.csv",
iris = [pl.Float64]*4 + [pl.Categorical])
dtypes iris
One way to manually assign dtypes is to get column names and create the dtypes Dict. (A list of dtypes without column names does not seem to work for Categorical columns)
= {column: dtype for column, dtype in zip(iris.columns, iris.dtypes)}
dtypes
pl.read_csv(file="/Users/egenn/icloud/Data/iris.csv",
=dtypes
dtypes )
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
f64 | f64 | f64 | f64 | cat |
5.1 | 3.5 | 1.4 | 0.2 | "setosa" |
4.9 | 3.0 | 1.4 | 0.2 | "setosa" |
4.7 | 3.2 | 1.3 | 0.2 | "setosa" |
4.6 | 3.1 | 1.5 | 0.2 | "setosa" |
5.0 | 3.6 | 1.4 | 0.2 | "setosa" |
5.4 | 3.9 | 1.7 | 0.4 | "setosa" |
4.6 | 3.4 | 1.4 | 0.3 | "setosa" |
5.0 | 3.4 | 1.5 | 0.2 | "setosa" |
4.4 | 2.9 | 1.4 | 0.2 | "setosa" |
4.9 | 3.1 | 1.5 | 0.1 | "setosa" |
5.4 | 3.7 | 1.5 | 0.2 | "setosa" |
4.8 | 3.4 | 1.6 | 0.2 | "setosa" |
... | ... | ... | ... | ... |
6.0 | 3.0 | 4.8 | 1.8 | "virginica" |
6.9 | 3.1 | 5.4 | 2.1 | "virginica" |
6.7 | 3.1 | 5.6 | 2.4 | "virginica" |
6.9 | 3.1 | 5.1 | 2.3 | "virginica" |
5.8 | 2.7 | 5.1 | 1.9 | "virginica" |
6.8 | 3.2 | 5.9 | 2.3 | "virginica" |
6.7 | 3.3 | 5.7 | 2.5 | "virginica" |
6.7 | 3.0 | 5.2 | 2.3 | "virginica" |
6.3 | 2.5 | 5.0 | 1.9 | "virginica" |
6.5 | 3.0 | 5.2 | 2.0 | "virginica" |
6.2 | 3.4 | 5.4 | 2.3 | "virginica" |
5.9 | 3.0 | 5.1 | 1.8 | "virginica" |
To do this without reading the full file first, we can scan_csv
and get columns, then make the dtypes Dict:
= "/Users/egenn/icloud/Data/iris.csv"
fpath # Get column names
= pl.scan_csv(fpath).columns
columns # manually assign dtypes
= [pl.Float64]*4 + [pl.Categorical]
dtypes # Convert to Dict
= {column: dtype for column, dtype in zip(iris.columns, dtypes)}
dtypes
pl.read_csv(file="/Users/egenn/icloud/Data/iris.csv",
= dtypes) dtypes
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
f64 | f64 | f64 | f64 | cat |
5.1 | 3.5 | 1.4 | 0.2 | "setosa" |
4.9 | 3.0 | 1.4 | 0.2 | "setosa" |
4.7 | 3.2 | 1.3 | 0.2 | "setosa" |
4.6 | 3.1 | 1.5 | 0.2 | "setosa" |
5.0 | 3.6 | 1.4 | 0.2 | "setosa" |
5.4 | 3.9 | 1.7 | 0.4 | "setosa" |
4.6 | 3.4 | 1.4 | 0.3 | "setosa" |
5.0 | 3.4 | 1.5 | 0.2 | "setosa" |
4.4 | 2.9 | 1.4 | 0.2 | "setosa" |
4.9 | 3.1 | 1.5 | 0.1 | "setosa" |
5.4 | 3.7 | 1.5 | 0.2 | "setosa" |
4.8 | 3.4 | 1.6 | 0.2 | "setosa" |
... | ... | ... | ... | ... |
6.0 | 3.0 | 4.8 | 1.8 | "virginica" |
6.9 | 3.1 | 5.4 | 2.1 | "virginica" |
6.7 | 3.1 | 5.6 | 2.4 | "virginica" |
6.9 | 3.1 | 5.1 | 2.3 | "virginica" |
5.8 | 2.7 | 5.1 | 1.9 | "virginica" |
6.8 | 3.2 | 5.9 | 2.3 | "virginica" |
6.7 | 3.3 | 5.7 | 2.5 | "virginica" |
6.7 | 3.0 | 5.2 | 2.3 | "virginica" |
6.3 | 2.5 | 5.0 | 1.9 | "virginica" |
6.5 | 3.0 | 5.2 | 2.0 | "virginica" |
6.2 | 3.4 | 5.4 | 2.3 | "virginica" |
5.9 | 3.0 | 5.1 | 1.8 | "virginica" |
22.6.3 Get all columns of type
Select Float64 columns:
iris.select(pl.col(pl.Float64))
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width |
---|---|---|---|
f64 | f64 | f64 | f64 |
5.1 | 3.5 | 1.4 | 0.2 |
4.9 | 3.0 | 1.4 | 0.2 |
4.7 | 3.2 | 1.3 | 0.2 |
4.6 | 3.1 | 1.5 | 0.2 |
5.0 | 3.6 | 1.4 | 0.2 |
5.4 | 3.9 | 1.7 | 0.4 |
4.6 | 3.4 | 1.4 | 0.3 |
5.0 | 3.4 | 1.5 | 0.2 |
4.4 | 2.9 | 1.4 | 0.2 |
4.9 | 3.1 | 1.5 | 0.1 |
5.4 | 3.7 | 1.5 | 0.2 |
4.8 | 3.4 | 1.6 | 0.2 |
... | ... | ... | ... |
6.0 | 3.0 | 4.8 | 1.8 |
6.9 | 3.1 | 5.4 | 2.1 |
6.7 | 3.1 | 5.6 | 2.4 |
6.9 | 3.1 | 5.1 | 2.3 |
5.8 | 2.7 | 5.1 | 1.9 |
6.8 | 3.2 | 5.9 | 2.3 |
6.7 | 3.3 | 5.7 | 2.5 |
6.7 | 3.0 | 5.2 | 2.3 |
6.3 | 2.5 | 5.0 | 1.9 |
6.5 | 3.0 | 5.2 | 2.0 |
6.2 | 3.4 | 5.4 | 2.3 |
5.9 | 3.0 | 5.1 | 1.8 |
Get names of all Float64 columns:
iris.select(pl.col(pl.Float64)).columns
['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width']
Select Categorical columns:
iris.select(pl.col(pl.Categorical))
Species |
---|
cat |
"setosa" |
"setosa" |
"setosa" |
"setosa" |
"setosa" |
"setosa" |
"setosa" |
"setosa" |
"setosa" |
"setosa" |
"setosa" |
"setosa" |
... |
"virginica" |
"virginica" |
"virginica" |
"virginica" |
"virginica" |
"virginica" |
"virginica" |
"virginica" |
"virginica" |
"virginica" |
"virginica" |
"virginica" |
22.7 Write CSV
"~/icloud/Data/iris_p.csv") iris.write_csv(
22.8 Write Arrow parquet
You can easily save a polars DataFrame as a parquet file:
"~/icloud/Data/iris.parquet") iris.write_parquet(