x <- character(10)
x
[1] "" "" "" "" "" "" "" "" "" ""
“Strings”, in programming, refer to collections of characters, i.e. text data. In R, text is usually stored as objects of class “character” (See Chapter 8). In statistics/machine learning/data science, we come across text in a few different ways. All the following are character vectors:
Some important string operations include:
Reminder: To initialize, coerce, test character vectors, use:
character()
: Initialize empty character vectoras.character()
: Coerce vector to character vectoris.character()
: Test if object is characterInitialize character vector:
x <- character(10)
x
[1] "" "" "" "" "" "" "" "" "" ""
Coerce to character:
v <- c(10, 20, 22, 43)
x <- as.character(v)
x
[1] "10" "20" "22" "43"
Test object is character vector:
x <- c("PID", "Age", "Sex", "Handedness")
is.character(x)
[1] TRUE
cat()
: Concatenate and print
cat()
concatenates its inputs to prints to screen (console) or to file.
cat()
does not return any object. It is therefore useful for producing informative messages in your programs.
It can concatenate strings along with the output of expressions:
sbp <- 130
temp <- 98.4
cat("The blood pressure was", sbp, "and the temperature was", temp, "\n")
The blood pressure was 130 and the temperature was 98.4
weight <- 74
height <- 1.78
cat("The weight is", weight, "and the height is", height,
"giving a BMI of", signif(weight/height^2, 3), "\n")
The weight is 74 and the height is 1.78 giving a BMI of 23.4
Use the file
argument to write to a text file. The append
argument allows using multiple consecutive cat()
calls to append to the same file.
if you use cat()
on a factor, it will print the integer index of the factor levels.
If you wish to print the levels themselves, use as.character()
on the factor vector:
cat(as.character(head(iris$Species)), "\n")
setosa setosa setosa setosa setosa setosa
paste()
: Concatenate character vectorsIn its simplest form, paste()
acts like as.character()
:
But, its main job is to combine strings.
It can combine two or more strings into one:
first <- "Jane"
last <- "Artiste"
ID <- "8001"
paste(ID, last, first)
[1] "8001 Artiste Jane"
The sep
argument defaults to a single space (” “) and defines the separator:
paste(ID, last, first, sep = " | ")
[1] "8001 | Artiste | Jane"
paste()
is vectorized, which means it can combine character vectors elementwise:
id <- c("001", "010", "018", "020", "021", "051")
dept <- c("Emergency", "Cardiology", "Neurology",
"Anesthesia", "Surgery", "Psychiatry")
paste(id, dept)
[1] "001 Emergency" "010 Cardiology" "018 Neurology" "020 Anesthesia"
[5] "021 Surgery" "051 Psychiatry"
paste0()
is an alias for the commonly used paste(..., sep = "")
:
paste0(id, dept)
[1] "001Emergency" "010Cardiology" "018Neurology" "020Anesthesia"
[5] "021Surgery" "051Psychiatry"
As with other vectorized operations, value recycling can be very convenient. In the example below, the shorter vector (i.e. “Feature”, a character vector of length 1) is recycled to match the length of the longest vector (1:10).
paste0("Feature_", 1:10)
[1] "Feature_1" "Feature_2" "Feature_3" "Feature_4" "Feature_5"
[6] "Feature_6" "Feature_7" "Feature_8" "Feature_9" "Feature_10"
The argument collapse
helps output a single character object after combining with some string:
paste0("Feature_", 1:10, collapse = ", ")
[1] "Feature_1, Feature_2, Feature_3, Feature_4, Feature_5, Feature_6, Feature_7, Feature_8, Feature_9, Feature_10"
nchar()
:nchar()
counts the number of characters in each element of a character vector:
substr()
:substr()
allows you to get and set individual (literal) characters from a character vector, by position.
For example, extract the first three characters of each character element as:
x <- c("001Emergency", "010Cardiology", "018Neurology",
"020Anesthesia", "021Surgery", "051Psychiatry")
substr(x, start = 1, stop = 3)
[1] "001" "010" "018" "020" "021" "051"
Neither start
nor stop
need to be valid character positions.
For example, if you want to get all characters from the fourth one to the last one, you can specify a very large stop
as:
substr(x, 4, 99)
[1] "Emergency" "Cardiology" "Neurology" "Anesthesia" "Surgery"
[6] "Psychiatry"
If you start with too high an index, you end up with empty strings:
substr(x, 20, 24)
[1] "" "" "" "" "" ""
Note: substring()
is also available, with similar syntax to substr()
: (first, last) instead of (start, stop). It is available for compatibility with S - check its source code to see how it’s an alias for substr()
.
To replace the first three letters, use:
x <- c("Jan_1987")
x
[1] "Jan_1987"
substr(x, 1, 3) <- "Feb"
x
[1] "Feb_1987"
Note that if the replacement is longer, it is “cropped” to the length of the substring being replaced:
substr(x, 1, 3) <- "April"
x
[1] "Apr_1987"
strsplit()
:strsplit()
allows you to split a character vector’s elements based on any character or regular expression.
For example, extract individual words by splitting a sentence on each space character:
x <- "This is one sentence"
strsplit(x, " ")
[[1]]
[1] "This" "is" "one" "sentence"
x <- "14,910"
strsplit(x, ",")
[[1]]
[1] "14" "910"
As with all functions, you can compose multiple string operations in complex ways, and as with all function compositions remember to build and test them step-by-step.
x <- c("1,950", "2,347")
x
[1] "1,950" "2,347"
toupper()
and tolower()
features <- c("id", "age", "sex", "sbp", "dbp", "hct", "urea", "creatinine")
features
[1] "id" "age" "sex" "sbp" "dbp"
[6] "hct" "urea" "creatinine"
features_upper <- toupper(features)
features_upper
[1] "ID" "AGE" "SEX" "SBP" "DBP"
[6] "HCT" "UREA" "CREATININE"
features_lower <- tolower(features_upper)
features_lower
[1] "id" "age" "sex" "sbp" "dbp"
[6] "hct" "urea" "creatinine"
The tools
package comes with the base R installation, but is not loaded at startup, because it contains rather specialized functions for package development, administration, and documentation. However, it includes the toTitleCase()
function, which can be handy for formatting variable names, e.g. before plotting, etc.
features <- c("full name", "admission type", "attending name", "date of admission")
tools::toTitleCase(features)
[1] "Full Name" "Admission Type" "Attending Name"
[4] "Date of Admission"
abbreviate()
allows reducing character vector elements to short, unique abbreviations of a minimum length (defaults to 4). For example,
x <- c("Emergency", "Cardiology", "Surgery", "Anesthesia",
"Neurology", "Psychiatry", "Clinical Psychology")
abbreviate(x)
Emergency Cardiology Surgery Anesthesia
"Emrg" "Crdl" "Srgr" "Anst"
Neurology Psychiatry Clinical Psychology
"Nrlg" "Psyc" "ClnP"
abbreviate(x, minlength = 4)
Emergency Cardiology Surgery Anesthesia
"Emrg" "Crdl" "Srgr" "Anst"
Neurology Psychiatry Clinical Psychology
"Nrlg" "Psyc" "ClnP"
abbreviate(x, minlength = 5)
Emergency Cardiology Surgery Anesthesia
"Emrgn" "Crdlg" "Srgry" "Ansth"
Neurology Psychiatry Clinical Psychology
"Nrlgy" "Psych" "ClncP"
A very common task in programming is to find +/- replace string patterns in a vector of strings.
- grep()
and grepl()
help find strings that contain a given pattern.
- sub()
and gsub()
help find and replace strings.
grep()
: Get integer index of elements that match a patternx <- c("001Age", "002Sex", "010Temp", "014SBP",
"018Hct", "022PFratio", "030GCS", "112SBP-DBP")
grep(pattern = "SBP", x = x)
[1] 4 8
grep()
’s value
argument, which defaults to FALSE
, allows returning the matched string itself (the value of the element) instead of its integer index, e.g.
grep("SBP", x, value = TRUE)
[1] "014SBP" "112SBP-DBP"
grepl()
: Get logical index of elements that match a patterngrepl()
is similar to grep()
, but returns a logical index:
grepl("SBP", x)
[1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE
sub()
: Replace first match of a patternx <- c("The most important variable was PF ratio. Other significant variables are listed
in the supplementary information.")
x
[1] "The most important variable was PF ratio. Other significant variables are listed \nin the supplementary information."
sub(pattern = "variable", replacement = "feature", x = x)
[1] "The most important feature was PF ratio. Other significant variables are listed \nin the supplementary information."
sub()
is vectorized and “first match” refers to each element of a character vector:
gsub()
: Replace all matches of a patternx <- c("The most important variable was P/F ratio. Other significant variables are listed in the supplementary information.")
x
[1] "The most important variable was P/F ratio. Other significant variables are listed in the supplementary information."
gsub(pattern = "variable", replacement = "feature", x = x)
[1] "The most important feature was P/F ratio. Other significant features are listed in the supplementary information."
“All matches” means all matches across all elements:
You can use a vertical bar (|
) in the pattern string to match one of multiple patterns:
x <- c("Emergency", "Cardiology", "Neurology", "Anesthesia",
"Surgery", "Psychiatry")
grep("Cardio|Neuro", x, value = TRUE)
[1] "Cardiology" "Neurology"
grep("Emerg|Surg|Anesth", x, value = TRUE)
[1] "Emergency" "Anesthesia" "Surgery"
Regular expressions allow you to perform flexible pattern matching. For example, you can look for a pattern specifically at the beginning or the end of a word, or for a variable pattern with certain characteristics.
Regular expressions are very powerful and heavily used. They exist in multiple programming languages - with many similarities and some differences in their syntax.
There are many rules in defining regular expressions and they take a little getting used to. You can read the R manual by typing ?base::regex
.
Some of the most important rules are listed below:
^
and \\<
Use the caret sign ^
in the beginning of a pattern to only match strings that begin with that pattern.
Pattern 012
matches both 2nd and 3rd elements:
x <- c("001_xyz_993", "012_qwe_764", "029_aqw_012")
x
[1] "001_xyz_993" "012_qwe_764" "029_aqw_012"
grep("012", x)
[1] 2 3
By adding ^
or \\<
, only the 2nd element in our character vector matches:
$
and \\>
The dollar sign $
is used at the end of a pattern to only match strings which end with this pattern:
.
+
n
times with {n}
n
or more times with {n,}
n
times and no more than m
times with {n,m}
You can define a set of characters to be matched using square brackets. Any number of the characters in the set will be matched.
For example match and replace $
and/or @
with an underscore:
[1] "Feat1_alpha" "Feat2_gamma_5" "Feat9_zeta2"
A number of character classes are predefined. Perhaps confusingly, they are themselves surrounded by brackets and to use them as a character class, you need a second set of brackets around them. Some of the most common ones include:
[:alnum:]
: alphanumeric, i.e. all letters and numbers[:alpha:]
: all letters[:digit:]
: all numbers[:lower:]
: all lowercase letters[:upper:]
: all uppercase letters[:punct:]
: all punctuation characters (! ” # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~.)[:blank:]
: all spaces and tabs[:space:]
: all spaces, tabs, newline characters, and some moreLet’s look at some examples using them.
Here we use [:digit:]
to remove all numbers:
x <- c("001Emergency", "010Cardiology", "018Neurology", "020Anesthesia",
"021Surgery", "051Psychiatry")
x
[1] "001Emergency" "010Cardiology" "018Neurology" "020Anesthesia"
[5] "021Surgery" "051Psychiatry"
gsub("[[:digit:]]", "", x)
[1] "Emergency" "Cardiology" "Neurology" "Anesthesia" "Surgery"
[6] "Psychiatry"
We can use [:alpha:]
to remove all letters instead:
gsub("[[:alpha:]]", "", x)
[1] "001" "010" "018" "020" "021" "051"
We can use a caret ^
in the beginning of a character class to match any character not in the character set:
For more information on regular expressions, start by reading the built-in documentation: ?regex
.
Metacharacters are characters that have a special meaning within a regular expression. They include:
. \ | ( ) [ { ^ $ * + ?
.
For example, we have seen above that the period matches any character and the square brackets are used to define character classes If you want to match one of these characters itself, you must “escape” it using a double backslash. Escaping a character simply means “this is not part of a regular expression, match it as is”.
For example, to match a period (.
) and replace it with underscores:
x <- c("systolic.blood.pressure", "diastolic.blood.pressure")
x
[1] "systolic.blood.pressure" "diastolic.blood.pressure"
gsub("\\.", "_", x)
[1] "systolic_blood_pressure" "diastolic_blood_pressure"
If we didn’t escape the period above, it would have matched any character:
gsub(".", "_", x)
[1] "_______________________" "________________________"
Another example, include an escaped metacharacter within a regular expression. In the example below we want to remove everything up to and including the dollar sign:
Our regular expression .*\\$
, decomposed:
.
: match any character.*
: match any character any number of times.*\\$
: match any character any number of times till you find a dollar signIf we had not escaped the $
, it wouldn’t have worked:
gsub(".*$", "", x)
[1] "" ""