using DataFrames, CategoricalArrays
14 Basics
We start by loading both DataFrames
and CategoricalArrays
packages.
CategoricalArrays
adds support for categorical variables (similar to factors in R)
14.1 Create a DataFrame
Some synthetic arrays:
= [1, 2, 3, 4, 5];
a = collect(25:-1:21);
b = categorical(["a", "b", "c", "c", "a"]); c
Combine into a DataFrame. Note that the second argument, :auto
, is required to automatically generate column names:
= DataFrame([a, b, c], :auto) dat
5×3 DataFrame
Row | x1 | x2 | x3 |
---|---|---|---|
Int64 | Int64 | Cat… | |
1 | 1 | 25 | a |
2 | 2 | 24 | b |
3 | 3 | 23 | c |
4 | 4 | 22 | c |
5 | 5 | 21 | a |
14.2 Get column names: names()
names(dat)
3-element Vector{String}:
"x1"
"x2"
"x3"
14.3 Set column names
To specify column names when creating a new DataFrame you can pass a vector of Symbols:
= DataFrame([a, b, c], [:alpha, :beta, :gamma]) dat
5×3 DataFrame
Row | alpha | beta | gamma |
---|---|---|---|
Int64 | Int64 | Cat… | |
1 | 1 | 25 | a |
2 | 2 | 24 | b |
3 | 3 | 23 | c |
4 | 4 | 22 | c |
5 | 5 | 21 | a |
..or pass named arguments:
= DataFrame(ey = a, bee = b, cee = c) dat
5×3 DataFrame
Row | ey | bee | cee |
---|---|---|---|
Int64 | Int64 | Cat… | |
1 | 1 | 25 | a |
2 | 2 | 24 | b |
3 | 3 | 23 | c |
4 | 4 | 22 | c |
5 | 5 | 21 | a |
14.4 Rename columns of a DataFrame: rename!()
.
The !
signifies the change happens in-place.
rename!(dat, [:alpha, :beta, :gamma])
5×3 DataFrame
Row | alpha | beta | gamma |
---|---|---|---|
Int64 | Int64 | Cat… | |
1 | 1 | 25 | a |
2 | 2 | 24 | b |
3 | 3 | 23 | c |
4 | 4 | 22 | c |
5 | 5 | 21 | a |
14.5 Get dimensions: size()
, nrow()
, ncol()
size(dat)
(5, 3)
nrow(dat)
5
ncol(dat)
3
14.6 DataFrame summary: describe()
describe(dat)
3×7 DataFrame
Row | variable | mean | min | median | max | nmissing | eltype |
---|---|---|---|---|---|---|---|
Symbol | Union… | Any | Union… | Any | Int64 | DataType | |
1 | alpha | 3.0 | 1 | 3.0 | 5 | 0 | Int64 |
2 | beta | 23.0 | 21 | 23.0 | 25 | 0 | Int64 |
3 | gamma | a | c | 0 | CategoricalValue{String, UInt32} |
14.7 Sort: sort()
& sort!()
Print sorted DataFrame without altering it with sort()
:
sort(dat, :beta)
5×3 DataFrame
Row | alpha | beta | gamma |
---|---|---|---|
Int64 | Int64 | Cat… | |
1 | 5 | 21 | a |
2 | 4 | 22 | c |
3 | 3 | 23 | c |
4 | 2 | 24 | b |
5 | 1 | 25 | a |
dat has not changed:
dat
5×3 DataFrame
Row | alpha | beta | gamma |
---|---|---|---|
Int64 | Int64 | Cat… | |
1 | 1 | 25 | a |
2 | 2 | 24 | b |
3 | 3 | 23 | c |
4 | 4 | 22 | c |
5 | 5 | 21 | a |
Change order of DataFrame rows in-place with sort()
:
sort!(dat, :beta)
5×3 DataFrame
Row | alpha | beta | gamma |
---|---|---|---|
Int64 | Int64 | Cat… | |
1 | 5 | 21 | a |
2 | 4 | 22 | c |
3 | 3 | 23 | c |
4 | 2 | 24 | b |
5 | 1 | 25 | a |
14.8 Indexing
return a vector:
:, 2] dat[
5-element Vector{Int64}:
21
22
23
24
25
:, :alpha] dat[
5-element Vector{Int64}:
5
4
3
2
1
:alpha] dat[!,
5-element Vector{Int64}:
5
4
3
2
1
return a DataFrame
:, [2]] dat[
5×1 DataFrame
Row | beta |
---|---|
Int64 | |
1 | 21 |
2 | 22 |
3 | 23 |
4 | 24 |
5 | 25 |
:, [:alpha]] dat[
5×1 DataFrame
Row | alpha |
---|---|
Int64 | |
1 | 5 |
2 | 4 |
3 | 3 |
4 | 2 |
5 | 1 |
14.9 Access columns by name saved in variable
= "gamma"
var Symbol(var)] dat[!,
5-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
"a"
"c"
"c"
"b"
"a"
Symbol(var)]] dat[!, [
5×1 DataFrame
Row | gamma |
---|---|
Cat… | |
1 | a |
2 | c |
3 | c |
4 | b |
5 | a |
14.10 Add column to DataFrame
= dat.alpha .^ 2 dat.asq
5-element Vector{Int64}:
25
16
9
4
1
dat
5×4 DataFrame
Row | alpha | beta | gamma | asq |
---|---|---|---|---|
Int64 | Int64 | Cat… | Int64 | |
1 | 5 | 21 | a | 25 |
2 | 4 | 22 | c | 16 |
3 | 3 | 23 | c | 9 |
4 | 2 | 24 | b | 4 |
5 | 1 | 25 | a | 1 |
You can also do the same, using transform!()
(useful if performed programmatically)
transform!(dat, :alpha => (x -> x .^ 2) => :asqtoo)
5×5 DataFrame
Row | alpha | beta | gamma | asq | asqtoo |
---|---|---|---|---|---|
Int64 | Int64 | Cat… | Int64 | Int64 | |
1 | 5 | 21 | a | 25 | 25 |
2 | 4 | 22 | c | 16 | 16 |
3 | 3 | 23 | c | 9 | 9 |
4 | 2 | 24 | b | 4 | 4 |
5 | 1 | 25 | a | 1 | 1 |
dat
5×5 DataFrame
Row | alpha | beta | gamma | asq | asqtoo |
---|---|---|---|---|---|
Int64 | Int64 | Cat… | Int64 | Int64 | |
1 | 5 | 21 | a | 25 | 25 |
2 | 4 | 22 | c | 16 | 16 |
3 | 3 | 23 | c | 9 | 9 |
4 | 2 | 24 | b | 4 | 4 |
5 | 1 | 25 | a | 1 | 1 |
14.10.1 Insert column(s) at location
= DataFrame(v = 1:5, w = 16:20) x
5×2 DataFrame
Row | v | w |
---|---|---|
Int64 | Int64 | |
1 | 1 | 16 |
2 | 2 | 17 |
3 | 3 | 18 |
4 | 4 | 19 |
5 | 5 | 20 |
insertcols!(x, 2, :v² => x.v .^2)
5×3 DataFrame
Row | v | v² | w |
---|---|---|---|
Int64 | Int64 | Int64 | |
1 | 1 | 1 | 16 |
2 | 2 | 4 | 17 |
3 | 3 | 9 | 18 |
4 | 4 | 16 | 19 |
5 | 5 | 25 | 20 |
= DataFrame(w² = x.w .^ 2, w³ = x.w .^ 3) wt
5×2 DataFrame
Row | w² | w³ |
---|---|---|
Int64 | Int64 | |
1 | 256 | 4096 |
2 | 289 | 4913 |
3 | 324 | 5832 |
4 | 361 | 6859 |
5 | 400 | 8000 |
insertcols!(x, ([:w², :w³] .=> eachcol(wt))...)
5×5 DataFrame
Row | v | v² | w | w² | w³ |
---|---|---|---|---|---|
Int64 | Int64 | Int64 | Int64 | Int64 | |
1 | 1 | 1 | 16 | 256 | 4096 |
2 | 2 | 4 | 17 | 289 | 4913 |
3 | 3 | 9 | 18 | 324 | 5832 |
4 | 4 | 16 | 19 | 361 | 6859 |
5 | 5 | 25 | 20 | 400 | 8000 |