14  Basics

We start by loading both DataFrames and CategoricalArrays packages.

CategoricalArrays adds support for categorical variables (similar to factors in R)

using DataFrames, CategoricalArrays

14.1 Create a DataFrame

Some synthetic arrays:

a = [1, 2, 3, 4, 5];
b = collect(25:-1:21);
c = categorical(["a", "b", "c", "c", "a"]);

Combine into a DataFrame. Note that the second argument, :auto, is required to automatically generate column names:

dat = DataFrame([a, b, c], :auto)
5×3 DataFrame
Row x1 x2 x3
Int64 Int64 Cat…
1 1 25 a
2 2 24 b
3 3 23 c
4 4 22 c
5 5 21 a

14.2 Get column names: names()

names(dat)
3-element Vector{String}:
 "x1"
 "x2"
 "x3"

14.3 Set column names

To specify column names when creating a new DataFrame you can pass a vector of Symbols:

dat = DataFrame([a, b, c], [:alpha, :beta, :gamma])
5×3 DataFrame
Row alpha beta gamma
Int64 Int64 Cat…
1 1 25 a
2 2 24 b
3 3 23 c
4 4 22 c
5 5 21 a

..or pass named arguments:

dat = DataFrame(ey = a, bee = b, cee = c)
5×3 DataFrame
Row ey bee cee
Int64 Int64 Cat…
1 1 25 a
2 2 24 b
3 3 23 c
4 4 22 c
5 5 21 a

14.4 Rename columns of a DataFrame: rename!().

The ! signifies the change happens in-place.

rename!(dat, [:alpha, :beta, :gamma])
5×3 DataFrame
Row alpha beta gamma
Int64 Int64 Cat…
1 1 25 a
2 2 24 b
3 3 23 c
4 4 22 c
5 5 21 a

14.5 Get dimensions: size(), nrow(), ncol()

size(dat)
(5, 3)
nrow(dat)
5
ncol(dat)
3

14.6 DataFrame summary: describe()

describe(dat)
3×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Union… Any Union… Any Int64 DataType
1 alpha 3.0 1 3.0 5 0 Int64
2 beta 23.0 21 23.0 25 0 Int64
3 gamma a c 0 CategoricalValue{String, UInt32}

14.7 Sort: sort() & sort!()

Print sorted DataFrame without altering it with sort():

sort(dat, :beta)
5×3 DataFrame
Row alpha beta gamma
Int64 Int64 Cat…
1 5 21 a
2 4 22 c
3 3 23 c
4 2 24 b
5 1 25 a

dat has not changed:

dat
5×3 DataFrame
Row alpha beta gamma
Int64 Int64 Cat…
1 1 25 a
2 2 24 b
3 3 23 c
4 4 22 c
5 5 21 a

Change order of DataFrame rows in-place with sort():

sort!(dat, :beta)
5×3 DataFrame
Row alpha beta gamma
Int64 Int64 Cat…
1 5 21 a
2 4 22 c
3 3 23 c
4 2 24 b
5 1 25 a

14.8 Indexing

return a vector:

dat[:, 2]
5-element Vector{Int64}:
 21
 22
 23
 24
 25
dat[:, :alpha]
5-element Vector{Int64}:
 5
 4
 3
 2
 1
dat[!, :alpha]
5-element Vector{Int64}:
 5
 4
 3
 2
 1

return a DataFrame

dat[:, [2]]
5×1 DataFrame
Row beta
Int64
1 21
2 22
3 23
4 24
5 25
dat[:, [:alpha]]
5×1 DataFrame
Row alpha
Int64
1 5
2 4
3 3
4 2
5 1

14.9 Access columns by name saved in variable

var = "gamma"
dat[!, Symbol(var)]
5-element CategoricalArrays.CategoricalArray{String,1,UInt32}:
 "a"
 "c"
 "c"
 "b"
 "a"
dat[!, [Symbol(var)]]
5×1 DataFrame
Row gamma
Cat…
1 a
2 c
3 c
4 b
5 a

14.10 Add column to DataFrame

dat.asq = dat.alpha .^ 2
5-element Vector{Int64}:
 25
 16
  9
  4
  1
dat
5×4 DataFrame
Row alpha beta gamma asq
Int64 Int64 Cat… Int64
1 5 21 a 25
2 4 22 c 16
3 3 23 c 9
4 2 24 b 4
5 1 25 a 1

You can also do the same, using transform!() (useful if performed programmatically)

transform!(dat, :alpha => (x -> x .^ 2) => :asqtoo)
5×5 DataFrame
Row alpha beta gamma asq asqtoo
Int64 Int64 Cat… Int64 Int64
1 5 21 a 25 25
2 4 22 c 16 16
3 3 23 c 9 9
4 2 24 b 4 4
5 1 25 a 1 1
dat
5×5 DataFrame
Row alpha beta gamma asq asqtoo
Int64 Int64 Cat… Int64 Int64
1 5 21 a 25 25
2 4 22 c 16 16
3 3 23 c 9 9
4 2 24 b 4 4
5 1 25 a 1 1

14.10.1 Insert column(s) at location

x = DataFrame(v = 1:5, w = 16:20)
5×2 DataFrame
Row v w
Int64 Int64
1 1 16
2 2 17
3 3 18
4 4 19
5 5 20
insertcols!(x, 2, :=> x.v .^2)
5×3 DataFrame
Row v w
Int64 Int64 Int64
1 1 1 16
2 2 4 17
3 3 9 18
4 4 16 19
5 5 25 20
wt = DataFrame(w² = x.w .^ 2, w³ = x.w .^ 3)
5×2 DataFrame
Row
Int64 Int64
1 256 4096
2 289 4913
3 324 5832
4 361 6859
5 400 8000
insertcols!(x, ([:w², :w³] .=> eachcol(wt))...)
5×5 DataFrame
Row v w
Int64 Int64 Int64 Int64 Int64
1 1 1 16 256 4096
2 2 4 17 289 4913
3 3 9 18 324 5832
4 4 16 19 361 6859
5 5 25 20 400 8000

14.11 Resources