8  Data Types & Vectors

8.1 Basic concepts: Data types

Most programming languages, including all languages used for data analysis, like R, Python, and Julia, have a set of data types for holding different kinds of data, like numbers or text.

Any time you are working with data, you have to ensure your variables are represented using the correct data type.

Figure 8.1: Common data types in R.

8.2 Base types

The simplest and most fundamental object in R is the vector: a one-dimensional collection of elements of the same data type, e.g. numbers, characters, etc. (known as an “atomic” vector).

For example, a numeric vector may consist of elements 12, 14, 20, and a character vector may consist of elements "x", "y", "apple", "banana".

Vectors can exist as stand-alone objects, or they can exist within other data structures, e.g. data.frames, lists, etc.

This chapter covers different atomic vectors, and the next covers data structures (Chapter 9).

R includes a number of builtin data types. These are defined by R - users cannot define their own data types.

Users can, however, define their own classes (Chapter 33).

The main/most common data types in R are:

  • numeric, including integer and double
  • character
  • logical (i.e. TRUE, FALSE, or NA, a.k.a. Boolean)

Other data types include environments and closures i.e. functions (Chapter 21).

8.3 Assignment

Use <- for all assignments:

x <- 3
# You can add comments within code blocks using the usual "#" prefix
Note

In RStudio, the keyboard shortcut for the assignment operator <- is Option - (macOS) or Alt - (Windows).

Typing the name of an object, e.g.

x
[1] 3

is equivalent to printing it, e.g.

[1] 3

You can also place any assignment in parentheses and this will perform the assignment and print the object:

(x <- 3)
[1] 3
Note

You can use either <- or = for assignment. However, many R syntax guides advise to use <- for assignment and = for passing arguments to functions.

You can assign the same value to multiple objects - this can be useful when initializing variables.

x <- z <- init <- 0
x
[1] 0
z
[1] 0
init
[1] 0

Excitingly, R allows assignment in the opposite direction as well:

10 -> x
x
[1] 10

We shall see later that the -> assignment can be convenient at the end of a pipe.

You can even do the following, which is fun, if not particularly useful:

x <- 7 -> z
x
[1] 7
z
[1] 7
Note

It’s good practice to use clear and descriptive names for all objects you create.

For multi-word names, snake case is a good option:

admission_date, age_at_onset, etc.

Caution

Avoid naming new objects using names of built-in commands. For example, avoid assigning your data to an object named data, since that could conflict with the built-in function data().

8.4 Create vectors with c()

Use c() to combine multiple values into a vector:

x <- c(-12, 3.5, 104)
x
[1] -12.0   3.5 104.0

8.5 Get the type of a vector using typeof()

[1] "double"

8.6 Common vector types

Let’s create some example vectors of the most common data types:

8.6.1 Integer

Numeric vector default to double;

v <- c(12, 14, 23)
v
[1] 12 14 23
[1] "double"

To create an integer vector, you can follow numbers by an “L”;

vi <- c(12L, 14L, 23L)
vi
[1] 12 14 23
typeof(vi)
[1] "integer"

Alternatively you can coerce a double to integer using as.integer();

vi <- as.integer(c(12, 14, 23))
vi
[1] 12 14 23
typeof(vi)
[1] "integer"

8.6.2 Double

vd <- c(1.3, 2.8, 3.6)
vd
[1] 1.3 2.8 3.6
typeof(vd)
[1] "double"

8.6.3 Character

A character vector consists of one or more elements, each of which consists of one or more actual characters, i.e. it is not a vector of single characters. (The length of a character vector is the number of individual elements, and is not related to the number of characters in each element)

vc <- c("a", "d", "s")
vc
[1] "a" "d" "s"
typeof(vc)
[1] "character"

8.6.4 Logical

Logical vectors typically consist of TRUE and FALSE values, but may also consist of NA (missing value). One important use of logical vectors is in indexing (Chapter 10).

When you are writing code, use TRUE and FALSE.
During interactive work, you can abbreviate to T and F.

vl <- c(TRUE, FALSE, FALSE)
vl
[1]  TRUE FALSE FALSE
typeof(vl)
[1] "logical"

8.7 Initialize vectors

Initializing a vector or other data structure is the process by which you create an object of a certain size with some initial values, e.g. all zeros or all NA, in order to replace with other values later.

This is usually computationally more efficient than starting with a small object and appending to it multiple times.

You can create / initialize vectors of specific type with the vector command and specifying a mode or directly by calling the relevant function:

(xl <- vector(mode = "logical", length = 10))
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
(xd <- vector(mode = "double", length = 10))
 [1] 0 0 0 0 0 0 0 0 0 0
(xn <- vector(mode = "numeric", length = 10)) # same as "double"
 [1] 0 0 0 0 0 0 0 0 0 0
(xi <- vector(mode = "integer", length = 10))
 [1] 0 0 0 0 0 0 0 0 0 0
(xc <- vector(mode = "character", length = 10))
 [1] "" "" "" "" "" "" "" "" "" ""

These are aliases of the vector command above (print their source code to see for yourself)

(xl <- logical(10))
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
(xd <- double(10))
 [1] 0 0 0 0 0 0 0 0 0 0
(xn <- numeric(10)) # same as double
 [1] 0 0 0 0 0 0 0 0 0 0
(xi <- integer(10))
 [1] 0 0 0 0 0 0 0 0 0 0
(xc <- character(10))
 [1] "" "" "" "" "" "" "" "" "" ""

8.8 Explicit coercion

We can explicitly convert vector of one type to a different type using as.* functions:

x <- c(1.2, 2.3, 3.4)
as.logical(x)
[1] TRUE TRUE TRUE
[1] 1.2 2.3 3.4
[1] 1.2 2.3 3.4
[1] 1 2 3
[1] "1.2" "2.3" "3.4"

Logical vectors are converted to 1s and 0s as expected, where TRUE becomes 1 and FALSE becomes 0, e.g.

x <- c(TRUE, TRUE, FALSE)
as.numeric(x)
[1] 1 1 0

Note that when converting from numeric to logical, anything other than zero is TRUE:

x <- seq(-2, 2, by = 0.5)
x
[1] -2.0 -1.5 -1.0 -0.5  0.0  0.5  1.0  1.5  2.0
[1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE

Not all conversions are possible.
There is no meaningful/consistent way to convert a character vector to numeric.
The following outputs NA values and prints a helpful warning message.

x <- c("mango", "banana", "tangerine")
as.numeric(x)
Warning: NAs introduced by coercion
[1] NA NA NA

8.9 Implicit coercion

Remember, the language generally tries to make life easier. Sometimes this means it will automatically coerce one class to another to allow requested operations.

For example, you can get the sum of a logical vector.
It will automatically be converted to numeric as we saw earlier.

x <- c(TRUE, TRUE, FALSE)
sum(x)
[1] 2

On the other hand, you cannot sum a factor, for example.
You get an error with an explanation:

x <- factor(c("mango", "banana", "mango"))
sum(x)
Error in Summary.factor(structure(c(2L, 1L, 2L), levels = c("banana", : 'sum' not meaningful for factors
Caution

Many errors in R occur because a variable is, or gets coerced to, the wrong type or class (see Chapter 9) by accident. That’s why it is essential to be able to:

  • check the type of a variable using typeof() or class()
  • convert (coerce) between types or classes using as.* functions

8.10 NA: Missing value

Missing values in any data type - logical, integer, double, or character - are coded using NA.
To check for the presence of NA values, use is.na():

x <- c(1.2, 5.3, 4.8, NA, 9.6)
x
[1] 1.2 5.3 4.8  NA 9.6
[1] FALSE FALSE FALSE  TRUE FALSE
x <- c("mango", "banana", NA, "sugar", "ackee")
x
[1] "mango"  "banana" NA       "sugar"  "ackee" 
[1] FALSE FALSE  TRUE FALSE FALSE
x <- c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, NA)
x
[1]  TRUE  TRUE FALSE  TRUE FALSE FALSE    NA
[1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

is.na() works similarly on matrices:

x <- matrix(1:20, nrow = 5)
x[4, 3] <- NA
is.na(x)
      [,1]  [,2]  [,3]  [,4]
[1,] FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE
[4,] FALSE FALSE  TRUE FALSE
[5,] FALSE FALSE FALSE FALSE
Note

Note that is.na() returns a response for each element (i.e. is vectorized) in contrast to is.numeric(), is.logical(), etc. The latter are checking the type of an object, while the former is checking individual elements.

anyNA() is a very useful function to check if there is one ore more missing values in an object, e.g.

[1] TRUE
Note

Operation on NA values results in NA.

x <- c(1.2, 5.3, 4.8, NA, 9.6)
x*2
[1]  2.4 10.6  9.6   NA 19.2

Multiple functions that accept as input an object with multiple values (a vector, a matrix, a data.frame, etc.) will return NA if any element is NA:

mean(x)
[1] NA
[1] NA
sd(x)
[1] NA
min(x)
[1] NA
max(x)
[1] NA
[1] NA NA

First, make sure NA values represent legitimate missing data and not some error. Then, decide how you want to handle it.

In all of the above commands you can pass na.rm = TRUE to ignore NA values:

mean(x, na.rm = TRUE)
[1] 5.225
median(x, na.rm = TRUE)
[1] 5.05
sd(x, na.rm = TRUE)
[1] 3.441293
min(x, na.rm = TRUE)
[1] 1.2
max(x, na.rm = TRUE)
[1] 9.6
range(x, na.rm = TRUE)
[1] 1.2 9.6

More generally, you can use na.exclude() to exclude NA values from R objects. This can be very useful for function that do not include a na.rm or similar argument to handle NA values.

x <- c(1, 2, NA, 4)
na.exclude(x)
[1] 1 2 4
attr(,"na.action")
[1] 3
attr(,"class")
[1] "exclude"

On a data.frame, na.exclude() excludes rows with any NAs:

df <- data.frame(a = c(1, 2, NA, 4),
                 b = c(11, NA, 13, 14))
na.exclude(df)
  a  b
1 1 11
4 4 14

Chapter 29 describes some approaches to handling missing data in the context of statistics or machine learning.

8.11 NA types

In the above examples, NA was used in vectors of different types. In reality, NA is a logical constant of length 1 that gets coerced to the type of the vector it is placed in. To specify NA of a specific type, use the appropriate NA_* constant:

  • NA_integer_
  • NA_real_
  • NA_complex_
  • NA_character_

See ?NA for more details. These can be useful when you want to initialize a vector/matrix/array of a specific type with NA values (for example, see (#initmatrix)).

8.12 NaN: Not a number

NaN is a special case of NA and can be the result of undefined mathematical operations:

a <- log(-4)
Warning in log(-4): NaNs produced

Note that class() returns “numeric”:

[1] "numeric"

To test for NaNs, use:

[1] TRUE

NaNs are also NA:

[1] TRUE

But the opposite is not true:

is.nan(NA)
[1] FALSE
Note

NaN can be considered a subtype of NA, as such: is.na(NaN) is TRUE, but is.nan(NA) is FALSE.

8.13 NULL: The empty object

The NULL object represents an empty object.

Note

NULL means empty, not missing, and is therefore entirely different from NA.

NULL shows up, for example, when initializing a list:

a <- vector("list", length = 4)
a
[[1]]
NULL

[[2]]
NULL

[[3]]
NULL

[[4]]
NULL

and it can be replaced normally:

a[[1]] <- 3
a
[[1]]
[1] 3

[[2]]
NULL

[[3]]
NULL

[[4]]
NULL

8.13.1 Replacing with NULL

You cannot replace one or more elements of a vector/matrix/array with NULL because NULL has length 0 and replacement requires object of equal length:

a <- 11:15
a
[1] 11 12 13 14 15
a[1] <- NULL
Error in a[1] <- NULL: replacement has length zero

However, in lists, and therefore also data frames (see Chapter Chapter 25), replacing an element with NULL removes that element:

al <- list(alpha = 11:15,
           beta = rnorm(10),
           gamma = c("mango", "banana", "tangerine"))
al
$alpha
[1] 11 12 13 14 15

$beta
 [1] -0.7723829 -0.3740601  1.0123050  0.5800966 -0.8030313 -1.0591240
 [7] -0.6634399  0.2035013 -0.5068714 -1.1689913

$gamma
[1] "mango"     "banana"    "tangerine"
al[[2]] <- NULL
al
$alpha
[1] 11 12 13 14 15

$gamma
[1] "mango"     "banana"    "tangerine"

Finally, NULL is often used as the default value in a function’s argument. The function definition must then determine what the default behavior/value should be.

8.14 Named vectors

While not very common, you can name the elements of a vector. You can do so either when creating the vector or after the fact:

First, create a vector without names, in this example, a character vector:

v <- c("UCSF", "Stanford", "Penn")
v
[1] "UCSF"     "Stanford" "Penn"    

Now, create a named vector and notice how the element names are displayed above the element values when the vector is printed:

site <- c(SiteA = "UCSF", SiteB = "Stanford", SiteC = "Penn")
site
     SiteA      SiteB      SiteC 
    "UCSF" "Stanford"     "Penn" 

Note that the v has no names, therefore the following returns NULL:

NULL

while site has names:

names(site)
[1] "SiteA" "SiteB" "SiteC"

If we wanted to add the names of the elements of v, we can do so like this:

names(v) <- c("SiteA", "SiteB", "SiteC")
v
     SiteA      SiteB      SiteC 
    "UCSF" "Stanford"     "Penn" 

Similarly, if we wanted to change/replace the names of site, we can do so like this:

names(site) <- c("Site_1", "Site_2", "Site_3")
site
    Site_1     Site_2     Site_3 
    "UCSF" "Stanford"     "Penn" 

Lastly, if we wanted to remove the names of site, we can replace them with NULL:

names(site) <- NULL
site
[1] "UCSF"     "Stanford" "Penn"    

8.15 Initialize - coerce - test vectors

The following summary table lists the functions to initialize, coerce (=convert), and test the main different vector types:

Initialize Coerce Test
logical(n) as.logical(x) is.logical(x)
integer(n) as.integer(x) is.integer(x)
double(n) as.double(x) is.double(x)
character(n) as.character(x) is.character(x)
Caution

The usage of the terms double and numeric across functions is unfortunately inconsistent in R.

Therefore, to promote clarity, prefer using double() instead of numeric() and as.double() instead of as.numeric().