8 Data Types & Vectors
8.1 Basic concepts: Data types
Most programming languages, including all languages used for data analysis, like R, Python, and Julia, have a set of data types for holding different kinds of data, like numbers or text.
Any time you are working with data, you have to ensure your variables are represented using the correct data type.
8.2 Base types
The simplest and most fundamental object in R is the vector: a one-dimensional collection of elements of the same data type, e.g. numbers, characters, etc. (known as an “atomic” vector).
For example, a numeric vector may consist of elements 12, 14, 20
, and a character vector may consist of elements "x", "y", "apple", "banana"
.
Vectors can exist as stand-alone objects, or they can exist within other data structures, e.g. data.frames, lists, etc.
This chapter covers different atomic vectors, and the next covers data structures (Chapter 9).
R includes a number of builtin data types. These are defined by R - users cannot define their own data types.
Users can, however, define their own classes (Chapter 33).
The main/most common data types in R are:
Other data types include environments and closures i.e. functions (Chapter 21).
8.3 Assignment
Use <-
for all assignments:
x <- 3
# You can add comments within code blocks using the usual "#" prefix
In RStudio, the keyboard shortcut for the assignment operator <-
is Option -
(macOS) or Alt -
(Windows).
Typing the name of an object, e.g.
x
[1] 3
is equivalent to printing it, e.g.
print(x)
[1] 3
You can also place any assignment in parentheses and this will perform the assignment and print the object:
(x <- 3)
[1] 3
You can use either <-
or =
for assignment. However, many R syntax guides advise to use <-
for assignment and =
for passing arguments to functions.
You can assign the same value to multiple objects - this can be useful when initializing variables.
x <- z <- init <- 0
x
[1] 0
z
[1] 0
init
[1] 0
Excitingly, R allows assignment in the opposite direction as well:
10 -> x
x
[1] 10
We shall see later that the ->
assignment can be convenient at the end of a pipe.
You can even do the following, which is fun, if not particularly useful:
x <- 7 -> z
x
[1] 7
z
[1] 7
It’s good practice to use clear and descriptive names for all objects you create.
For multi-word names, snake case is a good option:
admission_date, age_at_onset, etc.
Avoid naming new objects using names of built-in commands. For example, avoid assigning your data to an object named data
, since that could conflict with the built-in function data()
.
8.4 Create vectors with c()
Use c()
to combine multiple values into a vector:
x <- c(-12, 3.5, 104)
x
[1] -12.0 3.5 104.0
8.5 Get the type of a vector using typeof()
typeof(x)
[1] "double"
8.6 Common vector types
Let’s create some example vectors of the most common data types:
8.6.1 Integer
Numeric vector default to double;
To create an integer vector, you can follow numbers by an “L”;
Alternatively you can coerce a double to integer using as.integer()
;
8.6.2 Double
8.6.3 Character
A character vector consists of one or more elements, each of which consists of one or more actual characters, i.e. it is not a vector of single characters. (The length of a character vector is the number of individual elements, and is not related to the number of characters in each element)
8.6.4 Logical
Logical vectors typically consist of TRUE
and FALSE
values, but may also consist of NA
(missing value). One important use of logical vectors is in indexing (Chapter 10).
When you are writing code, use TRUE
and FALSE
.
During interactive work, you can abbreviate to T
and F
.
8.7 Initialize vectors
Initializing a vector or other data structure is the process by which you create an object of a certain size with some initial values, e.g. all zeros or all NA
, in order to replace with other values later.
This is usually computationally more efficient than starting with a small object and appending to it multiple times.
You can create / initialize vectors of specific type with the vector
command and specifying a mode
or directly by calling the relevant function:
(xl <- vector(mode = "logical", length = 10))
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
(xd <- vector(mode = "double", length = 10))
[1] 0 0 0 0 0 0 0 0 0 0
(xn <- vector(mode = "numeric", length = 10)) # same as "double"
[1] 0 0 0 0 0 0 0 0 0 0
(xi <- vector(mode = "integer", length = 10))
[1] 0 0 0 0 0 0 0 0 0 0
(xc <- vector(mode = "character", length = 10))
[1] "" "" "" "" "" "" "" "" "" ""
These are aliases of the vector
command above (print their source code to see for yourself)
8.8 Explicit coercion
We can explicitly convert vector of one type to a different type using as.*
functions:
x <- c(1.2, 2.3, 3.4)
as.logical(x)
[1] TRUE TRUE TRUE
as.double(x)
[1] 1.2 2.3 3.4
as.numeric(x)
[1] 1.2 2.3 3.4
as.integer(x)
[1] 1 2 3
as.character(x)
[1] "1.2" "2.3" "3.4"
Logical vectors are converted to 1s and 0s as expected, where TRUE becomes 1 and FALSE becomes 0, e.g.
x <- c(TRUE, TRUE, FALSE)
as.numeric(x)
[1] 1 1 0
Note that when converting from numeric to logical, anything other than zero is TRUE:
x <- seq(-2, 2, by = 0.5)
x
[1] -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
as.logical(x)
[1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
Not all conversions are possible.
There is no meaningful/consistent way to convert a character vector to numeric.
The following outputs NA values and prints a helpful warning message.
x <- c("mango", "banana", "tangerine")
as.numeric(x)
Warning: NAs introduced by coercion
[1] NA NA NA
8.9 Implicit coercion
Remember, the language generally tries to make life easier. Sometimes this means it will automatically coerce one class to another to allow requested operations.
For example, you can get the sum of a logical vector.
It will automatically be converted to numeric as we saw earlier.
On the other hand, you cannot sum a factor, for example.
You get an error with an explanation:
Error in Summary.factor(structure(c(2L, 1L, 2L), levels = c("banana", : 'sum' not meaningful for factors
8.10 NA
: Missing value
Missing values in any data type - logical, integer, double, or character - are coded using NA
.
To check for the presence of NA
values, use is.na()
:
x <- c("mango", "banana", NA, "sugar", "ackee")
x
[1] "mango" "banana" NA "sugar" "ackee"
is.na(x)
[1] FALSE FALSE TRUE FALSE FALSE
x <- c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, NA)
x
[1] TRUE TRUE FALSE TRUE FALSE FALSE NA
is.na(x)
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
is.na()
works similarly on matrices:
[,1] [,2] [,3] [,4]
[1,] FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE
[4,] FALSE FALSE TRUE FALSE
[5,] FALSE FALSE FALSE FALSE
Note that is.na()
returns a response for each element (i.e. is vectorized) in contrast to is.numeric()
, is.logical()
, etc. The latter are checking the type of an object, while the former is checking individual elements.
anyNA()
is a very useful function to check if there is one ore more missing values in an object, e.g.
anyNA(x)
[1] TRUE
Operation on NA
values results in NA
.
x <- c(1.2, 5.3, 4.8, NA, 9.6)
x*2
[1] 2.4 10.6 9.6 NA 19.2
Multiple functions that accept as input an object with multiple values (a vector, a matrix, a data.frame, etc.) will return NA
if any element is NA
:
First, make sure NA
values represent legitimate missing data and not some error. Then, decide how you want to handle it.
In all of the above commands you can pass na.rm = TRUE
to ignore NA
values:
mean(x, na.rm = TRUE)
[1] 5.225
median(x, na.rm = TRUE)
[1] 5.05
sd(x, na.rm = TRUE)
[1] 3.441293
min(x, na.rm = TRUE)
[1] 1.2
max(x, na.rm = TRUE)
[1] 9.6
range(x, na.rm = TRUE)
[1] 1.2 9.6
More generally, you can use na.exclude()
to exclude NA values from R objects. This can be very useful for function that do not include a na.rm
or similar argument to handle NA
values.
x <- c(1, 2, NA, 4)
na.exclude(x)
[1] 1 2 4
attr(,"na.action")
[1] 3
attr(,"class")
[1] "exclude"
On a data.frame, na.exclude()
excludes rows with any NA
s:
df <- data.frame(a = c(1, 2, NA, 4),
b = c(11, NA, 13, 14))
na.exclude(df)
a b
1 1 11
4 4 14
Chapter 29 describes some approaches to handling missing data in the context of statistics or machine learning.
8.11 NA
types
In the above examples, NA
was used in vectors of different types. In reality, NA
is a logical constant of length 1 that gets coerced to the type of the vector it is placed in. To specify NA
of a specific type, use the appropriate NA_*
constant:
NA_integer_
NA_real_
NA_complex_
NA_character_
See ?NA
for more details. These can be useful when you want to initialize a vector/matrix/array of a specific type with NA
values (for example, see (#initmatrix)).
8.12 NaN
: Not a number
NaN
is a special case of NA
and can be the result of undefined mathematical operations:
a <- log(-4)
Warning in log(-4): NaNs produced
Note that class()
returns “numeric”:
class(a)
[1] "numeric"
To test for NaN
s, use:
is.nan(a)
[1] TRUE
NaN
s are also NA
:
is.na(a)
[1] TRUE
But the opposite is not true:
is.nan(NA)
[1] FALSE
NaN
can be considered a subtype of NA
, as such: is.na(NaN)
is TRUE
, but is.nan(NA)
is FALSE
.
8.13 NULL
: The empty object
The NULL
object represents an empty object.
NULL
means empty, not missing, and is therefore entirely different from NA
.
NULL
shows up, for example, when initializing a list:
a <- vector("list", length = 4)
a
[[1]]
NULL
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
and it can be replaced normally:
a[[1]] <- 3
a
[[1]]
[1] 3
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
8.13.1 Replacing with NULL
You cannot replace one or more elements of a vector/matrix/array with NULL
because NULL
has length 0 and replacement requires object of equal length:
a <- 11:15
a
[1] 11 12 13 14 15
a[1] <- NULL
Error in a[1] <- NULL: replacement has length zero
However, in lists, and therefore also data frames (see Chapter Chapter 25), replacing an element with NULL
removes that element:
$alpha
[1] 11 12 13 14 15
$beta
[1] -0.7723829 -0.3740601 1.0123050 0.5800966 -0.8030313 -1.0591240
[7] -0.6634399 0.2035013 -0.5068714 -1.1689913
$gamma
[1] "mango" "banana" "tangerine"
al[[2]] <- NULL
al
$alpha
[1] 11 12 13 14 15
$gamma
[1] "mango" "banana" "tangerine"
Finally, NULL
is often used as the default value in a function’s argument. The function definition must then determine what the default behavior/value should be.
8.14 Named vectors
While not very common, you can name the elements of a vector. You can do so either when creating the vector or after the fact:
First, create a vector without names, in this example, a character vector:
v <- c("UCSF", "Stanford", "Penn")
v
[1] "UCSF" "Stanford" "Penn"
Now, create a named vector and notice how the element names are displayed above the element values when the vector is printed:
site <- c(SiteA = "UCSF", SiteB = "Stanford", SiteC = "Penn")
site
SiteA SiteB SiteC
"UCSF" "Stanford" "Penn"
Note that the v
has no names, therefore the following returns NULL
:
names(v)
NULL
while site
has names:
names(site)
[1] "SiteA" "SiteB" "SiteC"
If we wanted to add the names of the elements of v
, we can do so like this:
Similarly, if we wanted to change/replace the names of site
, we can do so like this:
Lastly, if we wanted to remove the names of site
, we can replace them with NULL
:
names(site) <- NULL
site
[1] "UCSF" "Stanford" "Penn"
8.15 Initialize - coerce - test vectors
The following summary table lists the functions to initialize, coerce (=convert), and test the main different vector types:
Initialize | Coerce | Test |
---|---|---|
logical(n) |
as.logical(x) |
is.logical(x) |
integer(n) |
as.integer(x) |
is.integer(x) |
double(n) |
as.double(x) |
is.double(x) |
character(n) |
as.character(x) |
is.character(x) |
The usage of the terms double
and numeric
across functions is unfortunately inconsistent in R.
-
double()
is the same asnumeric()
: They both initialize a vector of type double.
-
as.double()
is the same asas.numeric()
: They both coerce to type double.
BUT
-
is.double()
is NOT the same asis.numeric()
-
is.numeric()
is TRUE for both integer and double types. It is useful when you want to check if a variable is a number, regardless of whether it is an integer or double.
-
is.double()
is TRUE only for type double -
is.integer()
is TRUE only for type integer.
Therefore, to promote clarity, prefer using double()
instead of numeric()
and as.double()
instead of as.numeric()
.