library(data.table)
options(datatable.print.class = TRUE)
36 Reference Semantics
This chapter introduces the concept of reference semantics, which is used by the data.table
package.
When you create an object in R, it is stored at some location in memory. The address()
function in the data.table
package returns the memory address of an object.
36.1 Add column to data.frame
Let’s create a simple data.frame, df1
:
df1 <- data.frame(
ID = c(8001, 8002),
GCS = c(15, 13)
)
…and print its address:
address(df1)
[1] "0x10a8b3f08"
Right now, we don’t care what the actual address is - but we want to keep track when it changes.
Let’s add a new column to df1
:
df1$HR <- c(80, 90)
…and print its address:
address(df1)
[1] "0x140649e08"
The address has changed, even though we’re still working on the “same” df1
object.
36.2 Add column to data.table
Let’s create a simple data.table, dt1
:
dt1 <- data.table(
ID = c(8001, 8002),
GCS = c(15, 13)
)
…and print its address:
address(dt1)
[1] "0x10a967200"
Let’s add a new column to dt1
in-place:
dt1[, HR := c(80, 90)]
…and print its address:
address(dt1)
[1] "0x10a967200"
The address remains the same.
What if we had used the data.frame syntax (which still works on a data.table) instead?
dt1$HR <- c(80, 90)
address(dt1)
[1] "0x107b99600"
The address indeed changes, just like with data.frames.
Making copies of large objects can be time-consuming and memory-intensive. Up to this point, we have seen that making changes to data.table by reference, changes the object in-place and does not create a new copy.
36.3 Caution with reference semantics
So far so good, we start to understand one reason why data.table
is efficient. One very important thing to keep in mind is that when you do want to make a copy of a data.table, e.g. to create a different version of it, you must use the copy()
.
Let’s see why.
36.3.1 Copying a data.frame
Let’s remind ourselves of the contenets and address of df1
:
df1
ID GCS HR
1 8001 15 80
2 8002 13 90
address(df1)
[1] "0x140649e08"
To make a copy of df1
, we can simply assign it to a new object:
df2 <- df1
df2
ID GCS HR
1 8001 15 80
2 8002 13 90
address(df2)
[1] "0x140649e08"
The address of df2
is the same as df1
, which means they are pointing to the same object in memory.
As we’ve already seen, if we edit df2
, its address will change:
df2[1, 3] <- 75
df2
ID GCS HR
1 8001 15 75
2 8002 13 90
address(df2)
[1] "0x1286d6b28"
The contents and address of df2
have changed, but df1
remains the same, as you might expect:
df1
ID GCS HR
1 8001 15 80
2 8002 13 90
address(df1)
[1] "0x140649e08"
36.3.2 Copying a data.table
Let’s remind ourselves of the contenets and address of dt1
:
dt1
ID GCS HR
<num> <num> <num>
1: 8001 15 80
2: 8002 13 90
address(dt1)
[1] "0x107b99600"
Let’s see what happens if we assign dt1
to a new object:
dt2 <- dt1
dt2
ID GCS HR
<num> <num> <num>
1: 8001 15 80
2: 8002 13 90
address(dt2)
[1] "0x107b99600"
So far it’s the same as with data.frames.
Let’s see what happens if we edit dt2
by reference:
dt2[1, HR := 75]
dt2
ID GCS HR
<num> <num> <num>
1: 8001 15 75
2: 8002 13 90
address(dt2)
[1] "0x107b99600"
and let’s recheck dt1
:
dt1
ID GCS HR
<num> <num> <num>
1: 8001 15 75
2: 8002 13 90
dt1
has changed as well, because dt1
and dt2
are still pointing to the same object in memory!
This is crucial to remember to avoid errors and confusion.
When you want to make a copy of a data.table, you must use the copy()
function.
Let’s see what happens if we use copy()
:
dt3 <- copy(dt1)
dt3
ID GCS HR
<num> <num> <num>
1: 8001 15 75
2: 8002 13 90
address(dt1)
[1] "0x107b99600"
address(dt3)
[1] "0x10b550200"
dt3
and dt1
are pointing to different objects in memory, so editing one does not affect the other.
dt3[1, HR := 100]
dt3
ID GCS HR
<num> <num> <num>
1: 8001 15 100
2: 8002 13 90
dt1
ID GCS HR
<num> <num> <num>
1: 8001 15 75
2: 8002 13 90