34 data.table basics
34.1 data.table
extends the functionality of data.frame
Some of the ways in which a data.table
differs from a data.frame
:
- Many operations can be performed within a
data.table
’s “frame” (dt[i, j, by]
): filter cases, select columns & operate on columns, group-by operations - Access column names directly without quoting
- Many operations can be performed “in-place” (i.e. with no assignment)
- Working on large datasets (e.g. millions of rows) can be orders of magnitude faster with a
data.table
than adata.frame
.
data.table
operations remain as close as possible to data.frame
operations, trying to extend rather than replace data.frame
functionality.
data.table includes thorough and helpful error messages that often point to a solution. This includes common mistakes new users may make when trying commands that would work on a data.frame
but not on a data.table
.
34.1.1 Load the data.table
package
34.2 Create a data.table
34.2.1 By assignment: data.table()
Let’s create a data.frame
and a data.table
to explore side by side.
df <- data.frame(A = 1:5,
B = c(1.2, 4.3, 9.7, 5.6, 8.1),
C = c("a", "b", "b", "a", "a"))
class(df)
[1] "data.frame"
df
A B C
1 1 1.2 a
2 2 4.3 b
3 3 9.7 b
4 4 5.6 a
5 5 8.1 a
data.table()
syntax is similar to data.frame()
(differs in some arguments)
dt <- data.table(A = 1:5,
B = c(1.2, 4.3, 9.7, 5.6, 8.1),
C = c("a", "b", "b", "a", "a"))
class(dt)
[1] "data.table" "data.frame"
dt
A B C
<int> <num> <char>
1: 1 1.2 a
2: 2 4.3 b
3: 3 9.7 b
4: 4 5.6 a
5: 5 8.1 a
Notice how a data.table
object also inherits from data.frame
. This means that if a method does not exist for data.table
, the method for data.frame
will be used (See classes and generic functions).
As part of improving efficiency, data.tables do away with row names. Instead of using row names, you should use a dedicated column or column with a row identifier/s (e.g. “ID”). this is advisable when working with data.frames as well.
A rather convenient option is to have data.tables print each column’s class below the column name. You can pass the argument class = TRUE
to print()
or set the global option datatable.print.class
using options()
options(datatable.print.class = TRUE)
dt
A B C
<int> <num> <char>
1: 1 1.2 a
2: 2 4.3 b
3: 3 9.7 b
4: 4 5.6 a
5: 5 8.1 a
Same as with a data.frame, to automatically convert strings to factors, you can use the stringsAsFactors
argument:
dt2 <- data.table(A = 1:5,
B = c(1.2, 4.3, 9.7, 5.6, 8.1),
C = c("a", "b", "b", "a", "a"),
stringsAsFactors = TRUE)
dt2
A B C
<int> <num> <fctr>
1: 1 1.2 a
2: 2 4.3 b
3: 3 9.7 b
4: 4 5.6 a
5: 5 8.1 a
34.2.2 By coercion: as.data.table()
dat <- data.frame(A = 1:5,
B = c(1.2, 4.3, 9.7, 5.6, 8.1),
C = c("a", "b", "b", "a", "a"),
stringsAsFactors = TRUE)
dat
A B C
1 1 1.2 a
2 2 4.3 b
3 3 9.7 b
4 4 5.6 a
5 5 8.1 a
dat2 <- as.data.table(dat)
dat2
A B C
<int> <num> <fctr>
1: 1 1.2 a
2: 2 4.3 b
3: 3 9.7 b
4: 4 5.6 a
5: 5 8.1 a
34.2.3 By coercion in-place: setDT()
setDT()
converts a list or data.frame into a data.table
in-place. This means the object passed to setDT()
is changed and you do not need to assign the output to a new object.
dat <- data.frame(A = 1:5,
B = c(1.2, 4.3, 9.7, 5.6, 8.1),
C = c("a", "b", "b", "a", "a"))
class(dat)
[1] "data.frame"
You can similarly convert a data.table
to a data.frame
, in-place:
34.3 Display data.table
structure with str()
str()
works the same (and you should keep using it!)
str(df)
'data.frame': 5 obs. of 3 variables:
$ A: int 1 2 3 4 5
$ B: num 1.2 4.3 9.7 5.6 8.1
$ C: chr "a" "b" "b" "a" ...
str(dt)
Classes 'data.table' and 'data.frame': 5 obs. of 3 variables:
$ A: int 1 2 3 4 5
$ B: num 1.2 4.3 9.7 5.6 8.1
$ C: chr "a" "b" "b" "a" ...
- attr(*, ".internal.selfref")=<externalptr>
34.4 Combine data.tables
cbind()
and rbind()
work on data.table
s the same as on data.frame
s:
dt1 <- data.table(a = 1:5)
dt2 <- data.table(b = 11:15)
cbind(dt1, dt2)
a b
<int> <int>
1: 1 11
2: 2 12
3: 3 13
4: 4 14
5: 5 15
rbind(dt1, dt1)
a
<int>
1: 1
2: 2
3: 3
4: 4
5: 5
6: 1
7: 2
8: 3
9: 4
10: 5
34.5 Set column names in-place
dta <- data.table(
ID = sample(8000:9000, size = 10),
A = rnorm(10, mean = 47, sd = 8),
W = rnorm(10, mean = 87, sd = 7)
)
dta
ID A W
<int> <num> <num>
1: 8912 40.16238 92.27844
2: 8677 53.40798 93.30358
3: 8079 58.66035 88.90666
4: 8306 55.99422 79.04154
5: 8263 53.18652 83.83766
6: 8452 41.67441 81.07083
7: 8073 48.48667 89.69182
8: 8139 43.49363 90.74941
9: 8627 44.59427 87.92574
10: 8244 41.51665 81.57333
Use the syntax:
setnames(dt, old, new)
to change the column names of a data.table in-place.
Changes all column names:
Patient_ID Age Weight
<int> <num> <num>
1: 8912 40.16238 92.27844
2: 8677 53.40798 93.30358
3: 8079 58.66035 88.90666
4: 8306 55.99422 79.04154
5: 8263 53.18652 83.83766
6: 8452 41.67441 81.07083
7: 8073 48.48667 89.69182
8: 8139 43.49363 90.74941
9: 8627 44.59427 87.92574
10: 8244 41.51665 81.57333
Change subset of names:
Patient_ID Age_at_Admission Weight_at_Admission
<int> <num> <num>
1: 8912 40.16238 92.27844
2: 8677 53.40798 93.30358
3: 8079 58.66035 88.90666
4: 8306 55.99422 79.04154
5: 8263 53.18652 83.83766
6: 8452 41.67441 81.07083
7: 8073 48.48667 89.69182
8: 8139 43.49363 90.74941
9: 8627 44.59427 87.92574
10: 8244 41.51665 81.57333
old
argument can also be integer index of column(s).
For example, change the name of the first column:
setnames(dta, 1, "Hospital_ID")
dta
Hospital_ID Age_at_Admission Weight_at_Admission
<int> <num> <num>
1: 8912 40.16238 92.27844
2: 8677 53.40798 93.30358
3: 8079 58.66035 88.90666
4: 8306 55.99422 79.04154
5: 8263 53.18652 83.83766
6: 8452 41.67441 81.07083
7: 8073 48.48667 89.69182
8: 8139 43.49363 90.74941
9: 8627 44.59427 87.92574
10: 8244 41.51665 81.57333
34.6 Filter rows
There are many similarities and some notable differences in how indexing works in a data.table
vs. a data.frame
.
Filtering rows with an integer or logical index is largely the same in a data.frame
and a data.table
, but in a data.table
you can omit the comma to select all columns:
df[c(1, 3, 5), ]
A B C
1 1 1.2 a
3 3 9.7 b
5 5 8.1 a
dt[c(1, 3, 5), ]
A B C
<int> <num> <char>
1: 1 1.2 a
2: 3 9.7 b
3: 5 8.1 a
dt[c(1, 3, 5)]
A B C
<int> <num> <char>
1: 1 1.2 a
2: 3 9.7 b
3: 5 8.1 a
Using a variable that holds a row index, whether integer or logical:
rowid <- c(1, 3, 5)
df[rowid, ]
A B C
1 1 1.2 a
3 3 9.7 b
5 5 8.1 a
dt[rowid, ]
A B C
<int> <num> <char>
1: 1 1.2 a
2: 3 9.7 b
3: 5 8.1 a
dt[rowid]
A B C
<int> <num> <char>
1: 1 1.2 a
2: 3 9.7 b
3: 5 8.1 a
rowbn <- c(T, F, T, F, T)
df[rowbn, ]
A B C
1 1 1.2 a
3 3 9.7 b
5 5 8.1 a
dt[rowbn, ]
A B C
<int> <num> <char>
1: 1 1.2 a
2: 3 9.7 b
3: 5 8.1 a
dt[rowbn]
A B C
<int> <num> <char>
1: 1 1.2 a
2: 3 9.7 b
3: 5 8.1 a
34.6.1 Conditional filtering
As a reminder, there are a few ways to conditionally filter cases in a data.frame:
A B C
5 5 8.1 a
A B C
5 5 8.1 a
A B C
5 5 8.1 a
data.table
allows you to refer to column names directly and unquoted, which makes writing filter conditions easier/more compact:
The data.table
package also includes an S3 method for subset()
that works the same way as with a data.frame
:
As another example, exclude cases based on missingness in a specific column:
adf <- as.data.frame(sapply(1:5, function(i) rnorm(10)))
adf |> head()
V1 V2 V3 V4 V5
1 -0.6974634 1.3614559 0.7019747 0.49300261 0.8172266
2 -0.5646908 0.9028258 1.1200609 -0.06263677 1.6638757
3 -0.2138363 -0.7494856 -2.8516044 -0.71938009 -1.3365015
4 -0.9833590 0.4919551 1.9010847 -0.28031395 0.3194036
5 0.2487369 3.0569109 -0.1765964 -0.10343995 -0.1788702
6 1.3591861 -0.4989326 0.2304201 0.53382002 -0.1463504
adf[1, 3] <- adf[3, 4] <- adf[5, 3] <- adf[7, 3] <- NA
adt <- as.data.table(adf)
adf[!is.na(adf$V3), ]
V1 V2 V3 V4 V5
2 -0.56469083 0.9028258 1.1200609 -0.06263677 1.66387568
3 -0.21383631 -0.7494856 -2.8516044 NA -1.33650152
4 -0.98335900 0.4919551 1.9010847 -0.28031395 0.31940360
6 1.35918611 -0.4989326 0.2304201 0.53382002 -0.14635038
8 0.01868719 1.8381113 -2.0441653 -0.22916360 -0.03021756
9 -0.25226871 1.6206832 0.8369118 1.26759617 -0.99596141
10 1.40595781 1.2314515 -0.8814674 0.34815599 -0.57199391
adt[!is.na(V3)]
V1 V2 V3 V4 V5
<num> <num> <num> <num> <num>
1: -0.56469083 0.9028258 1.1200609 -0.06263677 1.66387568
2: -0.21383631 -0.7494856 -2.8516044 NA -1.33650152
3: -0.98335900 0.4919551 1.9010847 -0.28031395 0.31940360
4: 1.35918611 -0.4989326 0.2304201 0.53382002 -0.14635038
5: 0.01868719 1.8381113 -2.0441653 -0.22916360 -0.03021756
6: -0.25226871 1.6206832 0.8369118 1.26759617 -0.99596141
7: 1.40595781 1.2314515 -0.8814674 0.34815599 -0.57199391
34.7 Select columns
34.7.1 By position(s)
Selecting a single column in data.table
does not drop to a vector, similar to using drop = FALSE
in a data.frame:
df[, 1]
[1] 1 2 3 4 5
df[, 1, drop = FALSE]
A
1 1
2 2
3 3
4 4
5 5
dt[, 1]
A
<int>
1: 1
2: 2
3: 3
4: 4
5: 5
Double bracket indexing of a single column works the same on a data.frame
and a data.table
, returning a vector:
df[[2]]
[1] 1.2 4.3 9.7 5.6 8.1
dt[[2]]
[1] 1.2 4.3 9.7 5.6 8.1
A vector of column positions returns a smaller data.table
, similar to how it returns a smaller data.frame
:
34.7.2 By name(s)
In data.table
, you access column names directly without quoting or using the $
notation:
df[, "B"]
[1] 1.2 4.3 9.7 5.6 8.1
df$B
[1] 1.2 4.3 9.7 5.6 8.1
dt[, B]
[1] 1.2 4.3 9.7 5.6 8.1
Because of this, data.table
requires a slightly different syntax to use a variable as a column index which can contain integer positions, logical index, or column names. While on a data.frame you can do pass an index vector directly:
A B
1 1 1.2
2 2 4.3
3 3 9.7
4 4 5.6
5 5 8.1
df[, colbn]
B C
1 1.2 a
2 4.3 b
3 9.7 b
4 5.6 a
5 8.1 a
df[, colnm]
A C
1 1 a
2 2 b
3 3 b
4 4 a
5 5 a
To do the same in a data.table
, you must prefix the index vector with two dots:
dt[, ..colid]
A B
<int> <num>
1: 1 1.2
2: 2 4.3
3: 3 9.7
4: 4 5.6
5: 5 8.1
dt[, ..colbn]
B C
<num> <char>
1: 1.2 a
2: 4.3 b
3: 9.7 b
4: 5.6 a
5: 8.1 a
dt[, ..colnm]
A C
<int> <char>
1: 1 a
2: 2 b
3: 3 b
4: 4 a
5: 5 a
Think of working inside the data.table
frame (i.e. within the “[…]”) like an environment: you have direct access to the variables, i.e. columns within it. If you want to refer to variables outside the data.table
, you must prefix their names with ..
(similar to how you access the directory above your current working directory in the system shell).
Always read error messages carefully, no matter what function or package you are using. In the case of data.table
, the error messages are very informative and often point to the solution.
See what happens if you try to use the data.frame syntax by accident:
dt[, colid]
Error in `[.data.table`(dt, , colid): j (the 2nd argument inside [...]) is a single symbol but column name 'colid' is not found. If you intended to select columns using a variable in calling scope, please try DT[, ..colid]. The .. prefix conveys one-level-up similar to a file system path.
Selecting a single column by name returns a vector:
dt[, A]
[1] 1 2 3 4 5
Selecting one or more columns by name enclosed in list()
or .()
(which, in this case, is short for list()
), always returns a data.table
:
dt[, .(A)]
A
<int>
1: 1
2: 2
3: 3
4: 4
5: 5
dt[, .(A, B)]
A B
<int> <num>
1: 1 1.2
2: 2 4.3
3: 3 9.7
4: 4 5.6
5: 5 8.1
34.7.3 .SD
& .SDcols
.SDcols
is a special symbol that can be used to select columns of a data.table as an alternative to j
. It can accept a vector of integer positions or column names. .SD
is another special symbol that can be used in j
and refers to either the entire data.table, or the subset defined by .SDcols
, if present. The following can be used to select columns:
dt[, .SD, .SDcols = colid]
A B
<int> <num>
1: 1 1.2
2: 2 4.3
3: 3 9.7
4: 4 5.6
5: 5 8.1
One of the main uses of .SD is shown below in combination with lapply()
.
34.8 Add new column in-place
Use :=
assignment to add a new column in the existing data.table
.
dt[, AplusB := A + B]
dt
A B C AplusB
<int> <num> <char> <num>
1: 1 1.2 a 2.2
2: 2 4.3 b 6.3
3: 3 9.7 b 12.7
4: 4 5.6 a 9.6
5: 5 8.1 a 13.1
Note how dt
was modified even though we did not run dt <- dt[, AplusB := A + B]
34.9 Add multiple columns in-place
You can define multiple new column names using a character vector of new column names on the left of :=
and a list on the right.
We can use lapply()
since it always returns a list:
A B C AplusB AtimesB AoverB logA logB
<int> <num> <char> <num> <num> <num> <num> <num>
1: 1 1.2 a 2.2 1.2 0.8333333 0.0000000 0.1823216
2: 2 4.3 b 6.3 8.6 0.4651163 0.6931472 1.4586150
3: 3 9.7 b 12.7 29.1 0.3092784 1.0986123 2.2721259
4: 4 5.6 a 9.6 22.4 0.7142857 1.3862944 1.7227666
5: 5 8.1 a 13.1 40.5 0.6172840 1.6094379 2.0918641
You can also use :=
in a little more awkward syntax:
dt[, `:=`(AminusB = A - B, AoverC = A / B)]
dt
A B C AplusB AtimesB AoverB logA logB AminusB
<int> <num> <char> <num> <num> <num> <num> <num> <num>
1: 1 1.2 a 2.2 1.2 0.8333333 0.0000000 0.1823216 -0.2
2: 2 4.3 b 6.3 8.6 0.4651163 0.6931472 1.4586150 -2.3
3: 3 9.7 b 12.7 29.1 0.3092784 1.0986123 2.2721259 -6.7
4: 4 5.6 a 9.6 22.4 0.7142857 1.3862944 1.7227666 -1.6
5: 5 8.1 a 13.1 40.5 0.6172840 1.6094379 2.0918641 -3.1
AoverC
<num>
1: 0.8333333
2: 0.4651163
3: 0.3092784
4: 0.7142857
5: 0.6172840
34.10 Convert column type
34.10.1 Assignment by reference with :=
Use any base R coercion function (as.*
) to convert a column in-place using the :=
notation
dt[, A := as.numeric(A)]
dt
A B C AplusB AtimesB AoverB logA logB AminusB
<num> <num> <char> <num> <num> <num> <num> <num> <num>
1: 1 1.2 a 2.2 1.2 0.8333333 0.0000000 0.1823216 -0.2
2: 2 4.3 b 6.3 8.6 0.4651163 0.6931472 1.4586150 -2.3
3: 3 9.7 b 12.7 29.1 0.3092784 1.0986123 2.2721259 -6.7
4: 4 5.6 a 9.6 22.4 0.7142857 1.3862944 1.7227666 -1.6
5: 5 8.1 a 13.1 40.5 0.6172840 1.6094379 2.0918641 -3.1
AoverC
<num>
1: 0.8333333
2: 0.4651163
3: 0.3092784
4: 0.7142857
5: 0.6172840
34.10.2 Delete columns in-place with :=
To delete a column, use :=
to set it to NULL:
dt[, AoverB := NULL]
dt
A B C AplusB AtimesB logA logB AminusB AoverC
<num> <num> <char> <num> <num> <num> <num> <num> <num>
1: 1 1.2 a 2.2 1.2 0.0000000 0.1823216 -0.2 0.8333333
2: 2 4.3 b 6.3 8.6 0.6931472 1.4586150 -2.3 0.4651163
3: 3 9.7 b 12.7 29.1 1.0986123 2.2721259 -6.7 0.3092784
4: 4 5.6 a 9.6 22.4 1.3862944 1.7227666 -1.6 0.7142857
5: 5 8.1 a 13.1 40.5 1.6094379 2.0918641 -3.1 0.6172840
Delete multiple columns
dt[, c("logA", "logB") := NULL]
Or:
dt[, `:=`(AplusB = NULL, AminusB = NULL)]
dt
A B C AtimesB AoverC
<num> <num> <char> <num> <num>
1: 1 1.2 a 1.2 0.8333333
2: 2 4.3 b 8.6 0.4651163
3: 3 9.7 b 29.1 0.3092784
4: 4 5.6 a 22.4 0.7142857
5: 5 8.1 a 40.5 0.6172840
34.10.3 Fast loop-able assignment with set()
data.table
’s set()
is a loop-able version of the :=
operator. Use it in a for loop to operate on multiple columns.
Syntax: set(dt, i, j, value)
-
dt
the data.table to operate on -
i
optionally define which rows to operate on.i = NULL
to operate on all rows -
j
column names or index to be assignedvalue
-
value
values to be assigned toj
by reference
As a simple example, transform the first two columns in-place by squaring:
for (i in 1:2) {
set(dt, i = NULL, j = i, value = dt[[i]]^2)
}
34.11 Summarize
You can apply one or multiple summary functions on one or multiple columns. Surround the operations in list()
or .()
to output a new data.table holding the outputs of the operations, i.e. the input data.table remains unchanged.
A_max A_min A_sd
<num> <num> <num>
1: 25 1 9.66954
Example: Get the standard deviation of all numeric columns:
A B AtimesB AoverC
<num> <num> <num> <num>
1: 9.66954 37.35521 15.74462 0.2060219
If your function returns more than one value, the output will have multiple rows:
dt_range <- dt[, lapply(.SD, range), .SDcols = numid]
dt_range
A B AtimesB AoverC
<num> <num> <num> <num>
1: 1 1.44 1.2 0.3092784
2: 25 94.09 40.5 0.8333333
34.12 Set order
You can set the row order of a data.table in-place based on one or multiple columns’ values using setorder()
dt <- data.table(PatientID = sample(1001:9999, size = 10),
Height = rnorm(10, mean = 175, sd = 14),
Weight = rnorm(10, mean = 78, sd = 10),
Group = factor(sample(c("A", "B"), size = 10, replace = TRUE)))
dt
PatientID Height Weight Group
<int> <num> <num> <fctr>
1: 2854 184.1146 69.62671 A
2: 5802 167.9553 90.20479 B
3: 6372 189.1118 84.64837 A
4: 4453 199.0681 97.57381 A
5: 1683 209.3120 64.42896 B
6: 6039 175.5606 61.83276 A
7: 3259 157.5410 83.45360 A
8: 4429 169.0540 88.55968 B
9: 8993 186.6740 70.34618 A
10: 7339 155.7481 78.86928 A
Let’s set the order by PatientID:
setorder(dt, PatientID)
dt
PatientID Height Weight Group
<int> <num> <num> <fctr>
1: 1683 209.3120 64.42896 B
2: 2854 184.1146 69.62671 A
3: 3259 157.5410 83.45360 A
4: 4429 169.0540 88.55968 B
5: 4453 199.0681 97.57381 A
6: 5802 167.9553 90.20479 B
7: 6039 175.5606 61.83276 A
8: 6372 189.1118 84.64837 A
9: 7339 155.7481 78.86928 A
10: 8993 186.6740 70.34618 A
Let’s re-order, always in-place, by group and then by height:
setorder(dt, Group, Height)
dt
PatientID Height Weight Group
<int> <num> <num> <fctr>
1: 7339 155.7481 78.86928 A
2: 3259 157.5410 83.45360 A
3: 6039 175.5606 61.83276 A
4: 2854 184.1146 69.62671 A
5: 8993 186.6740 70.34618 A
6: 6372 189.1118 84.64837 A
7: 4453 199.0681 97.57381 A
8: 5802 167.9553 90.20479 B
9: 4429 169.0540 88.55968 B
10: 1683 209.3120 64.42896 B
34.13 Group-by operations
Up to now, we have learned how to use the data.table
frame dat[i, j]
to filter cases in i
or add/remove/transform columns in-place in j
. dat[i, j, by]
allows to perform operations separately on groups of cases.
dt <- data.table(A = 1:5,
B = c(1.2, 4.3, 9.7, 5.6, 8.1),
C = rnorm(5),
Group = c("a", "b", "b", "a", "a"))
dt
A B C Group
<int> <num> <num> <char>
1: 1 1.2 -0.32495656 a
2: 2 4.3 2.15479797 b
3: 3 9.7 1.01111238 b
4: 4 5.6 -1.01614533 a
5: 5 8.1 0.03401566 a
34.13.1 Group-by summary
As we’ve seen, using .()
or list()
in j
, returns a new data.table
:
34.13.2 Group-by operation and assignment
Making an assignment with :=
in j
, adds a column in-place. If you combine such an assignment with a group-by operation, the same value will be assigned to all cases of the group:
dt[, mean_A_by_Group := mean(A), by = Group]
dt
A B C Group mean_A_by_Group
<int> <num> <num> <char> <num>
1: 1 1.2 -0.32495656 a 3.333333
2: 2 4.3 2.15479797 b 2.500000
3: 3 9.7 1.01111238 b 2.500000
4: 4 5.6 -1.01614533 a 3.333333
5: 5 8.1 0.03401566 a 3.333333
34.14 Apply functions to columns
Any function that returns a list can be used in j
to return a new data.table - therefore lapply is perfect for getting summary on multiple columns. This is another example where you have to use the .SD
notation:
dt1 <- as.data.table(sapply(1:3, \(i) rnorm(10)))
dt1
V1 V2 V3
<num> <num> <num>
1: 0.21051232 -0.7952196 -0.1114999
2: 0.40835419 0.3399457 -1.0079365
3: 0.30058197 0.6037877 -1.7683228
4: 0.48821143 -0.4297953 -0.8320273
5: -0.65502092 1.6373637 0.6777377
6: 0.31486633 -0.5948143 -0.5559450
7: 0.03931736 0.4013012 -0.7522870
8: -0.04110166 1.4621264 1.3029453
9: -0.07188748 0.2717599 0.8335130
10: -1.83133346 -0.2057708 1.4685168
Alpha Beta Gamma
<num> <num> <num>
1: -0.08374999 0.2690685 -0.07453057
You can specify which columns to operate on using the .SDcols
argument:
dt2 <- data.table(A = 1:5,
B = c(1.2, 4.3, 9.7, 5.6, 8.1),
C = rnorm(5),
Group = c("a", "b", "b", "a", "a"))
dt2
A B C Group
<int> <num> <num> <char>
1: 1 1.2 -1.1787199 a
2: 2 4.3 -0.3847940 b
3: 3 9.7 0.1066941 b
4: 4 5.6 0.1838466 a
5: 5 8.1 1.2703974 a
dt2[, lapply(.SD, mean), .SDcols = 1:2]
A B
<num> <num>
1: 3 5.78
A B
<num> <num>
1: 3 5.78
A B
<num> <num>
1: 3 5.78
You can combine .SDcols
and by
:
Group B C
<char> <num> <num>
1: a 5.6 0.1838466
2: b 7.0 -0.1390499
Create multiple new columns from transformation of existing and store with custom prefix:
dt1
Alpha Beta Gamma
<num> <num> <num>
1: 0.21051232 -0.7952196 -0.1114999
2: 0.40835419 0.3399457 -1.0079365
3: 0.30058197 0.6037877 -1.7683228
4: 0.48821143 -0.4297953 -0.8320273
5: -0.65502092 1.6373637 0.6777377
6: 0.31486633 -0.5948143 -0.5559450
7: 0.03931736 0.4013012 -0.7522870
8: -0.04110166 1.4621264 1.3029453
9: -0.07188748 0.2717599 0.8335130
10: -1.83133346 -0.2057708 1.4685168
Alpha Beta Gamma Alpha_abs Beta_abs Gamma_abs
<num> <num> <num> <num> <num> <num>
1: 0.21051232 -0.7952196 -0.1114999 0.21051232 0.7952196 0.1114999
2: 0.40835419 0.3399457 -1.0079365 0.40835419 0.3399457 1.0079365
3: 0.30058197 0.6037877 -1.7683228 0.30058197 0.6037877 1.7683228
4: 0.48821143 -0.4297953 -0.8320273 0.48821143 0.4297953 0.8320273
5: -0.65502092 1.6373637 0.6777377 0.65502092 1.6373637 0.6777377
6: 0.31486633 -0.5948143 -0.5559450 0.31486633 0.5948143 0.5559450
7: 0.03931736 0.4013012 -0.7522870 0.03931736 0.4013012 0.7522870
8: -0.04110166 1.4621264 1.3029453 0.04110166 1.4621264 1.3029453
9: -0.07188748 0.2717599 0.8335130 0.07188748 0.2717599 0.8335130
10: -1.83133346 -0.2057708 1.4685168 1.83133346 0.2057708 1.4685168
dt2
A B C Group
<int> <num> <num> <char>
1: 1 1.2 -1.1787199 a
2: 2 4.3 -0.3847940 b
3: 3 9.7 0.1066941 b
4: 4 5.6 0.1838466 a
5: 5 8.1 1.2703974 a
cols <- c("A", "C")
dt2[, paste0(cols, "_groupMean") := lapply(.SD, mean), .SDcols = cols, by = Group]
dt2
A B C Group A_groupMean C_groupMean
<int> <num> <num> <char> <num> <num>
1: 1 1.2 -1.1787199 a 3.333333 0.09184136
2: 2 4.3 -0.3847940 b 2.500000 -0.13904993
3: 3 9.7 0.1066941 b 2.500000 -0.13904993
4: 4 5.6 0.1838466 a 3.333333 0.09184136
5: 5 8.1 1.2703974 a 3.333333 0.09184136
34.15 Row-wise operations
dt <- data.table(a = 1:5, b = 11:15, c = 21:25,
d = 31:35, e = 41:45)
dt
a b c d e
<int> <int> <int> <int> <int>
1: 1 11 21 31 41
2: 2 12 22 32 42
3: 3 13 23 33 43
4: 4 14 24 34 44
5: 5 15 25 35 45
To operate row-wise, we can use by = 1:nrow(dt)
. For example, to add a column, in-place, with row-wise sums of variables b through d:
34.16 Watch out for data.table
error messages
For example