9  Data Structures

9.1 Overview

There are 5 main data structures in R:

Data Structure Dimensionality Contents Notes
Vector 1D homogeneous the “base” object
Matrix 2D homogeneous a vector with 2 dimensions
Array ND homogeneous a vector with N dimensions
List 1D; can be nested heterogeneous a collection of any R objects, each of any length
Data frame 2D heterogeneous a special kind of list: a collection of (column) vectors of any type, all of the same length

Vectors are homogeneous data structures which means all of their elements have to be of the same type (see Chapter 8), e.g. integer, double, character, logical.

Matrices and arrays are vectors with more dimensions, and as such, are also homogeneous.

Lists are the most flexible. Their elements can be any R objects, including lists, and therefore can be nested.

Data frames are a special kind of list. Their elements are one or more vectors, which can be of any type, and form columns. Therefore a data.frame is a two-dimensional data structure where rows typically correspond to cases (e.g. individuals) and columns represent variables. As such, data.frames are the most common data structure for statistical analysis.

Figure 9.1: R Data Structure summary - Best to read through this chapter first and then refer back to this figure
Tip

Check object class with class().

Check object class and contents’ types with str().

Caution

Many errors in R occur because a variable is, or gets coerced to, the wrong type or class by accident. That’s why it is essential to be able to:

  • check the type of a variable using typeof() or class()

  • convert (coerce) between types or classes using as.* functions

9.2 Vectors

A vector is the most basic and fundamental data structure in R. Other data structures are made up of one or more vectors.

x <- c(1, 3, 5, 7)
x
[1] 1 3 5 7
[1] "numeric"
[1] "double"

A vector has length() but no dim(), e.g.

[1] 4
dim(x)
NULL

9.2.1 Initializing a vector

See Initializing vectors

9.3 Matrices

A matrix is a vector with 2 dimensions.

To create a matrix, you pass a vector to the matrix() function and specify number of rows using nrow and/or number of columns using ncol;

x <- matrix(21:50,
            nrow = 10, ncol = 3)
x
      [,1] [,2] [,3]
 [1,]   21   31   41
 [2,]   22   32   42
 [3,]   23   33   43
 [4,]   24   34   44
 [5,]   25   35   45
 [6,]   26   36   46
 [7,]   27   37   47
 [8,]   28   38   48
 [9,]   29   39   49
[10,]   30   40   50
[1] "matrix" "array" 

A matrix has length (length(x)) equal to the number of all (i, j) elements or nrow * ncol (if i is the row index and j is the column index) and dimensions (dim(x)) as expected:

[1] 30
dim(x)
[1] 10  3
nrow(x)
[1] 10
ncol(x)
[1] 3

9.3.1 Construct by row vs. by column

By default, vectors are constructed by column (byrow = FALSE), e.g.

x <- matrix(1:20, nrow = 10, ncol = 2, byrow = FALSE)
x
      [,1] [,2]
 [1,]    1   11
 [2,]    2   12
 [3,]    3   13
 [4,]    4   14
 [5,]    5   15
 [6,]    6   16
 [7,]    7   17
 [8,]    8   18
 [9,]    9   19
[10,]   10   20

You can set the byrow argument to TRUE to fill the matrix by row instead:

x <- matrix(1:20, nrow = 10, ncol = 2, byrow = TRUE)
x
      [,1] [,2]
 [1,]    1    2
 [2,]    3    4
 [3,]    5    6
 [4,]    7    8
 [5,]    9   10
 [6,]   11   12
 [7,]   13   14
 [8,]   15   16
 [9,]   17   18
[10,]   19   20

9.3.2 Initialize a matrix

You can initialize a matrix with some constant value, e.g. 0:

x <- matrix(0, nrow = 6, ncol = 4)
x
     [,1] [,2] [,3] [,4]
[1,]    0    0    0    0
[2,]    0    0    0    0
[3,]    0    0    0    0
[4,]    0    0    0    0
[5,]    0    0    0    0
[6,]    0    0    0    0
Note

To initialize a matrix with NA values, it is most efficient to use NA of the appropriate type, e.g. NA_real_ for a numeric matrix, NA_character_ for a character matrix, etc. See NA types.

For example, to initialize a numeric matrix with NA values:

x <- matrix(NA_real_, nrow = 6, ncol = 4)
x
     [,1] [,2] [,3] [,4]
[1,]   NA   NA   NA   NA
[2,]   NA   NA   NA   NA
[3,]   NA   NA   NA   NA
[4,]   NA   NA   NA   NA
[5,]   NA   NA   NA   NA
[6,]   NA   NA   NA   NA

9.3.3 Bind vectors by column or by row

Use cbind (“column-bind”) to convert a set of input vectors to columns of a matrix. The vectors must be of the same length:

x <- cbind(1:10, 11:20, 41:50)
x
      [,1] [,2] [,3]
 [1,]    1   11   41
 [2,]    2   12   42
 [3,]    3   13   43
 [4,]    4   14   44
 [5,]    5   15   45
 [6,]    6   16   46
 [7,]    7   17   47
 [8,]    8   18   48
 [9,]    9   19   49
[10,]   10   20   50
[1] "matrix" "array" 

Similarly, you can use rbind (“row-bind”) to convert a set of input vectors to rows of a matrix. The vectors again must be of the same length:

x <- rbind(1:10, 11:20, 41:50)
x
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    1    2    3    4    5    6    7    8    9    10
[2,]   11   12   13   14   15   16   17   18   19    20
[3,]   41   42   43   44   45   46   47   48   49    50
[1] "matrix" "array" 

9.3.4 Combine matrices

cbind() and rbind() can be used to combine two or more matrices together - or vector and matrices:

cbind(matrix(1, nrow = 5, ncol = 2), matrix(2, nrow = 5, ncol = 4))
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    1    1    2    2    2    2
[2,]    1    1    2    2    2    2
[3,]    1    1    2    2    2    2
[4,]    1    1    2    2    2    2
[5,]    1    1    2    2    2    2

9.4 Arrays

Arrays are vectors with dimensions.
You can have 1D, 2D or any number of dimensions, i.e. ND arrays.

9.4.1 One-dimensional (“1D”) array

A 1D array is just like a vector but of class array and with dim(x) equal to length(x). Remember, vectors have only length(x) and undefined dim(x).

x <- 1:10
xa <- array(1:10, dim = 10)
class(x)
[1] "integer"
[1] TRUE
[1] 10
dim(x)
NULL
class(xa)
[1] "array"
[1] FALSE
length(xa)
[1] 10
dim(xa)
[1] 10

It is rather unlikely you will need to use a 1D array instead of a vector.

9.4.2 Two-dimensional (“2D”) array

A 2D array is a matrix:

x <- array(1:40, dim = c(10, 4))
class(x)
[1] "matrix" "array" 
dim(x)
[1] 10  4

9.4.3 Multi-dimensional (“ND”) array

You can build an N-dimensional array:

x <- array(1:60, dim = c(5, 4, 3))
x
, , 1

     [,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20

, , 2

     [,1] [,2] [,3] [,4]
[1,]   21   26   31   36
[2,]   22   27   32   37
[3,]   23   28   33   38
[4,]   24   29   34   39
[5,]   25   30   35   40

, , 3

     [,1] [,2] [,3] [,4]
[1,]   41   46   51   56
[2,]   42   47   52   57
[3,]   43   48   53   58
[4,]   44   49   54   59
[5,]   45   50   55   60
[1] "array"

You can provide names for each dimensions using the dimnames argument. It accepts a list where each elements is a character vector of length equal to the dimension length. Using the same example as above, we pass three character vector of length 5, 4, and 3 to match the length of the dimensions:

x <- array(1:60,
            dim = c(5, 4, 3),
            dimnames = list(letters[1:5],
                            c("alpha", "beta", "gamma", "delta"),
                            c("x", "y", "z")))

3D arrays can be used to represent color images. Here, just for fun, we use rasterImage() to show how you would visualize such an image:

x <- array(sample(0:255, size = 12 * 12 * 3, replace = TRUE), dim = c(12, 12, 3))
par("pty")
[1] "m"
par(pty = "s")
plot(NULL, NULL,
     xlim = c(0, 100), ylim = c(0, 100),
     axes = FALSE, ann = FALSE, pty = "s")
rasterImage(x / 255, xleft = 0, ybottom = 0, xright = 100, ytop = 100)

9.5 Lists

To define a list, we use list() to pass any number of objects.
If these objects are passed as named arguments, the names will be used as element names:

x <- list(one = 1:4,
          two = sample(seq(0, 100, by = 0.1), size = 10),
          three = c("mango", "banana", "tangerine"),
          four = median)
class(x)
[1] "list"
str(x)
List of 4
 $ one  : int [1:4] 1 2 3 4
 $ two  : num [1:10] 64.9 37.6 67.1 7.2 77.5 10.9 29 15.9 65 34
 $ three: chr [1:3] "mango" "banana" "tangerine"
 $ four :function (x, na.rm = FALSE, ...)  
[1] 4

9.5.1 Nested lists

Since each element can be any object, we can build nested lists:

x <- list(alpha = letters[sample(26, size = 4)],
          beta = sample(12),
          gamma = list(i = rnorm(10),
                       j = runif(10),
                       k = seq(0, 1000, length.out = 10)))
x
$alpha
[1] "o" "t" "c" "z"

$beta
 [1]  3 10  5  4  1 11  7  9  6  8  2 12

$gamma
$gamma$i
 [1]  0.71406941  0.93094109  0.69767571 -0.92266489 -0.36619439  0.31486522
 [7]  0.94298669 -0.04036322  0.14757294  1.05095575

$gamma$j
 [1] 0.73779608 0.09909672 0.13083053 0.74411395 0.15990955 0.83336543
 [7] 0.19055086 0.46871206 0.81962626 0.59315281

$gamma$k
 [1]    0.0000  111.1111  222.2222  333.3333  444.4444  555.5556  666.6667
 [8]  777.7778  888.8889 1000.0000

In the example above, alpha, beta, and gamma, are x’s elements. Notice how the length of the list refers to the number of these top-level elements:

[1] 3

9.5.2 Initialize a list

When setting up experiments, it can be very convenient to set up and empty list, where results will be stored (e.g. using a for-loop):

x <- vector("list", length = 4)
x
[[1]]
NULL

[[2]]
NULL

[[3]]
NULL

[[4]]
NULL
[1] 4

9.5.3 Add element to a list

You can add a new elements to a list by assigning directly to an element that doesn’t yet exist, which will cause it to be created:

x <- list(a = 1:10, b = rnorm(10))
x
$a
 [1]  1  2  3  4  5  6  7  8  9 10

$b
 [1] -0.23630780 -0.79507438  0.26842609 -0.69612991 -0.14193811 -0.12717634
 [7]  1.04491005  0.73335184  0.64191453  0.02478336
x$c <- 30:21
x
$a
 [1]  1  2  3  4  5  6  7  8  9 10

$b
 [1] -0.23630780 -0.79507438  0.26842609 -0.69612991 -0.14193811 -0.12717634
 [7]  1.04491005  0.73335184  0.64191453  0.02478336

$c
 [1] 30 29 28 27 26 25 24 23 22 21

9.5.4 Combine lists

You can combine lists with c(), just like vectors:

l1 <- list(q = 11:14, r = letters[11:14])
l2 <- list(s = LETTERS[21:24], t = 100:97)
x <- c(l1, l2)
x
$q
[1] 11 12 13 14

$r
[1] "k" "l" "m" "n"

$s
[1] "U" "V" "W" "X"

$t
[1] 100  99  98  97
[1] 4

9.6 Combining different types with c()

It’s best to use c() to either combine elements of the same type into a vector, or to combine lists.

As we’ve seen, if all arguments passed to c() are of a single type, you get a vector of that type:

x <- c(12.9, 94.67, 23.74, 46.901)
x
[1] 12.900 94.670 23.740 46.901
[1] "numeric"

If arguments passed to c() are a mix of numeric and character, they all get coerced to character.

(x <- c(23.54, "mango", "banana", 75))
[1] "23.54"  "mango"  "banana" "75"    
[1] "character"

If you pass more types of objects (which cannot be coerced to character) you get a list, since it is the only structure that can support all of them together:

(x <- c(42, mean, "potatoes"))
[[1]]
[1] 42

[[2]]
function (x, ...) 
UseMethod("mean")
<bytecode: 0x1210e9220>
<environment: namespace:base>

[[3]]
[1] "potatoes"
[1] "list"

9.7 Data frames

Note

A data frames is a special type of list where each element has the same length and forms a column, resulting in a 2D structure. Unlike matrices, each column can contain a different data type.

data.frames are usually created with named elements:

x <- data.frame(Feat_1 = 1:5,
                Feat_2 = rnorm(5),
                Feat_3 = paste0("rnd_", sample(seq(100), size = 5)))
x
  Feat_1     Feat_2 Feat_3
1      1  0.9665632 rnd_87
2      2  1.6218976 rnd_86
3      3  0.2895373 rnd_11
4      4 -0.3127645  rnd_6
5      5  2.8406137 rnd_22
[1] "data.frame"
str(x)
'data.frame':   5 obs. of  3 variables:
 $ Feat_1: int  1 2 3 4 5
 $ Feat_2: num  0.967 1.622 0.29 -0.313 2.841
 $ Feat_3: chr  "rnd_87" "rnd_86" "rnd_11" "rnd_6" ...
class(x$Feat_1)
[1] "integer"
Note

Unlike a matrix, the elements of a data.frame are its columns, not the individual values in each position. Therefore the length of a data.frame is equal to the number of columns.

mat <- matrix(1:100, nrow = 10)
length(mat)
[1] 100
df <- as.data.frame(mat)
length(df)
[1] 10

Just like with lists, you can add new columns to a data.frame using assignment to a new element, i.e. column:

x <- data.frame(PIDN = sample(8001:9000, size = 10, replace = TRUE),
                Age = rnorm(10, mean = 48, sd = 2.9))
x
   PIDN      Age
1  8345 43.44100
2  8809 52.42539
3  8527 51.18651
4  8034 47.79848
5  8232 39.43173
6  8186 51.71419
7  8078 50.96995
8  8909 49.30835
9  8852 50.87468
10 8202 43.44940
x$Weight <- rnorm(10, mean = 84, sd = 1.5)
x
   PIDN      Age   Weight
1  8345 43.44100 82.57095
2  8809 52.42539 84.65434
3  8527 51.18651 85.43061
4  8034 47.79848 85.86807
5  8232 39.43173 83.91720
6  8186 51.71419 81.32832
7  8078 50.96995 84.63175
8  8909 49.30835 87.90053
9  8852 50.87468 83.98714
10 8202 43.44940 84.56448

9.8 Generating sequences

Other than assigning individual elements explicitly with c(), there are multiple ways to create numeric sequences.

Colon notation allows generating a simple integer sequence:

x <- 1:5
x
[1] 1 2 3 4 5
[1] "integer"

seq(from, to, by)

seq(1, 10, by = 0.5)
 [1]  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0  7.5  8.0
[16]  8.5  9.0  9.5 10.0

seq(from, to, length.out = n)

seq(-5, 12, length.out = 11)
 [1] -5.0 -3.3 -1.6  0.1  1.8  3.5  5.2  6.9  8.6 10.3 12.0

seq(object) generates a sequence of length equal to length(object)

seq(iris)
[1] 1 2 3 4 5

seq_along(object) is the optimized version of seq(object):

seq_along(iris)
[1] 1 2 3 4 5

seq(n) is equivalent to 1:n

seq(12)
 [1]  1  2  3  4  5  6  7  8  9 10 11 12
# same output as
1:12
 [1]  1  2  3  4  5  6  7  8  9 10 11 12

seq_len(n) is an optimized version of seq(n):

 [1]  1  2  3  4  5  6  7  8  9 10 11 12

9.9 Naming object elements

All objects’ elements can be named.

9.9.1 Vectors

You can create a vector with named elements:

SBP = c(before = 179, after = 118)
SBP
before  after 
   179    118 

Use names() to get a vector’s elements’ names:

names(SBP)
[1] "before" "after" 

You can add names to an existing, unnamed, vector:

N <- c(112, 120)
names(N)
NULL
names(N) <- c("Cases", "Controls")
N
   Cases Controls 
     112      120 

Matrices and data frames can have column names (colnames) and row names (rownames):

xm <- matrix(1:15, nrow = 5)
xdf <- as.data.frame(xm)
colnames(xm)
NULL
[1] "V1" "V2" "V3"
NULL
colnames(xm) <- colnames(xdf) <- paste0("Feature", seq(3))
rownames(xm) <- rownames(xdf) <- paste0("Case", seq(5))
xm
      Feature1 Feature2 Feature3
Case1        1        6       11
Case2        2        7       12
Case3        3        8       13
Case4        4        9       14
Case5        5       10       15
xdf
      Feature1 Feature2 Feature3
Case1        1        6       11
Case2        2        7       12
Case3        3        8       13
Case4        4        9       14
Case5        5       10       15

Lists are vectors so they have names. These can be defined when a list is created using the name-value pairs or added/changed at any time.

x <- list(HospitalName = "CaliforniaGeneral",
          ParticipatingDepartments = c("Neurology", "Psychiatry", "Neurosurgery"),
          PatientIDs = 1001:1018)
names(x)
[1] "HospitalName"             "ParticipatingDepartments"
[3] "PatientIDs"              

Add/Change names:

names(x) <- c("Hospital", "Departments", "PIDs")
x
$Hospital
[1] "CaliforniaGeneral"

$Departments
[1] "Neurology"    "Psychiatry"   "Neurosurgery"

$PIDs
 [1] 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015
[16] 1016 1017 1018

Remember that data a frame is a special type of list. Therefore in data frames colnames and names are equivalent:

colnames(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
names(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     
Tip

Note: As we saw, matrices have colnames() and rownames() Using names() on a matrix will assign names to individual elements, as if it was a long vector.

9.10 Initialize - coerce - test data structures

The following table lists the functions to initialize, coerce (=convert), and test the core data structures, which are shown in more detail in the following paragraphs:

Initialize Coerce Test
matrix(NA, nrow = x, ncol = y) as.matrix(x) is.matrix(x)
array(NA, dim = c(x, y, z)) as.array(x) is.array(x)
vector(mode = "list", length = x) as.list(x) is.list(x)
data.frame(matrix(NA, x, y)) as.data.frame(x) is.data.frame(x)

9.11 Attributes

R objects may have some builtin attributes but you can add arbitrary attributes to any R object. These are used to store additional information, sometimes called metadata.

9.11.2 Get or set specific attributes

You can assign new attributes using attr:

(x <- c(1:10))
 [1]  1  2  3  4  5  6  7  8  9 10
attr(x, "name") <- "Very special vector"

Printing the vector after adding a new attribute, prints the attribute name and value underneath the vector itself:

x
 [1]  1  2  3  4  5  6  7  8  9 10
attr(,"name")
[1] "Very special vector"

Our trusty str function will print attributes as well:

str(x)
 int [1:10] 1 2 3 4 5 6 7 8 9 10
 - attr(*, "name")= chr "Very special vector"

9.11.2.1 A matrix is a vector - a closer look

Let’s see how a matrix is literally just a vector with assigned dimensions.
Start with a vector of length 20:

x <- 1:20
x
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

The vector has no attributes - yet:

NULL

To convert to a matrix, we would normally pass our vector to the matrix() function and define number of rows and/or columns:

xm <- matrix(x, nrow = 5)
xm
     [,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20
$dim
[1] 5 4

Just for demonstration, let’s instead directly add a dimension attribute to our vector:

attr(x, "dim") <- c(5, 4)
x
     [,1] [,2] [,3] [,4]
[1,]    1    6   11   16
[2,]    2    7   12   17
[3,]    3    8   13   18
[4,]    4    9   14   19
[5,]    5   10   15   20
[1] "matrix" "array" 

Just like that, we have created a matrix.