30  Strings

“Strings”, in programming, refer to collections of characters, i.e. text data. In R, text is usually stored as objects of class “character” (See Chapter 8). In statistics/machine learning/data science, we come across text in a few different ways. All the following are character vectors:

Some important string operations include:

30.1 Creating character vectors

Reminder: To initialize, coerce, test character vectors, use:

Initialize character vector:

x <- character(10)
x
 [1] "" "" "" "" "" "" "" "" "" ""

Coerce to character:

v <- c(10, 20, 22, 43)
x <- as.character(v)
x
[1] "10" "20" "22" "43"

Test object is character vector:

x <- c("PID", "Age", "Sex", "Handedness")
is.character(x)
[1] TRUE

30.1.1 cat(): Concatenate and print

cat() concatenates its inputs to prints to screen (console) or to file.

cat() does not return any object. It is therefore useful for producing informative messages in your programs.

It can concatenate strings along with the output of expressions:

sbp <- 130
temp <- 98.4
cat("The blood pressure was", sbp, "and the temperature was", temp, "\n")
The blood pressure was 130 and the temperature was 98.4 
weight <- 74
height <- 1.78
cat("The weight is", weight, "and the height is", height, 
    "giving a BMI of", signif(weight/height^2, 3), "\n")
The weight is 74 and the height is 1.78 giving a BMI of 23.4 

Use the file argument to write to a text file. The append argument allows using multiple consecutive cat() calls to append to the same file.

Note

if you use cat() on a factor, it will print the integer index of the factor levels.

If you wish to print the levels themselves, use as.character() on the factor vector:

cat(head(iris$Species), "\n")
1 1 1 1 1 1 
cat(as.character(head(iris$Species)), "\n")
setosa setosa setosa setosa setosa setosa 

30.1.2 paste(): Concatenate character vectors

In its simplest form, paste() acts like as.character():

v <- c(10, 20, 22, 43)
paste(v)
[1] "10" "20" "22" "43"

But, its main job is to combine strings.

It can combine two or more strings into one:

first <- "Jane"
last <- "Artiste"
ID <- "8001"
paste(ID, last, first)
[1] "8001 Artiste Jane"

The sep argument defaults to a single space (” “) and defines the separator:

paste(ID, last, first, sep = " | ")
[1] "8001 | Artiste | Jane"

paste() is vectorized, which means it can combine character vectors elementwise:

id <- c("001", "010", "018", "020", "021", "051")
dept <- c("Emergency", "Cardiology", "Neurology",
         "Anesthesia", "Surgery", "Psychiatry")
paste(id, dept)
[1] "001 Emergency"  "010 Cardiology" "018 Neurology"  "020 Anesthesia"
[5] "021 Surgery"    "051 Psychiatry"

paste0() is an alias for the commonly used paste(..., sep = ""):

paste0(id, dept)
[1] "001Emergency"  "010Cardiology" "018Neurology"  "020Anesthesia"
[5] "021Surgery"    "051Psychiatry"

As with other vectorized operations, value recycling can be very convenient. In the example below, the shorter vector (i.e. “Feature”, a character vector of length 1) is recycled to match the length of the longest vector (1:10).

paste0("Feature_", 1:10)
 [1] "Feature_1"  "Feature_2"  "Feature_3"  "Feature_4"  "Feature_5" 
 [6] "Feature_6"  "Feature_7"  "Feature_8"  "Feature_9"  "Feature_10"

The argument collapse helps output a single character object after combining with some string:

paste0("Feature_", 1:10, collapse = ", ")
[1] "Feature_1, Feature_2, Feature_3, Feature_4, Feature_5, Feature_6, Feature_7, Feature_8, Feature_9, Feature_10"

30.2 Common string utilities

30.2.1 Get number of characters in element with nchar():

nchar() counts the number of characters in each element of a character vector:

x <- c("a", "bb", "ccc")
nchar(x)
[1] 1 2 3

30.2.2 Extract/replace substring with substr():

substr() allows you to get and set individual (literal) characters from a character vector, by position.

For example, extract the first three characters of each character element as:

x <- c("001Emergency", "010Cardiology", "018Neurology", 
       "020Anesthesia", "021Surgery", "051Psychiatry")
substr(x, start = 1, stop = 3)
[1] "001" "010" "018" "020" "021" "051"

Neither start nor stop need to be valid character positions.

For example, if you want to get all characters from the fourth one to the last one, you can specify a very large stop as:

substr(x, 4, 99)
[1] "Emergency"  "Cardiology" "Neurology"  "Anesthesia" "Surgery"   
[6] "Psychiatry"

If you start with too high an index, you end up with empty strings:

substr(x, 20, 24)
[1] "" "" "" "" "" ""

Note: substring() is also available, with similar syntax to substr(): (first, last) instead of (start, stop). It is available for compatibility with S - check its source code to see how it’s an alias for substr().

To replace the first three letters, use:

x <- c("Jan_1987")
x
[1] "Jan_1987"
substr(x, 1, 3) <- "Feb"
x
[1] "Feb_1987"

Note that if the replacement is longer, it is “cropped” to the length of the substring being replaced:

substr(x, 1, 3) <- "April"
x
[1] "Apr_1987"

30.2.3 Split strings with strsplit():

strsplit() allows you to split a character vector’s elements based on any character or regular expression.

For example, extract individual words by splitting a sentence on each space character:

x <- "This is one sentence"
strsplit(x, " ")
[[1]]
[1] "This"     "is"       "one"      "sentence"
x <- "14,910"
strsplit(x, ",")
[[1]]
[1] "14"  "910"

As with all functions, you can compose multiple string operations in complex ways, and as with all function compositions remember to build and test them step-by-step.

x <- c("1,950", "2,347")
x
[1] "1,950" "2,347"
lapply(strsplit(x, ","), \(i) 
  paste(i, c("thousand", "dollars"), collapse = " and "))
[[1]]
[1] "1 thousand and 950 dollars"

[[2]]
[1] "2 thousand and 347 dollars"

30.3 String formatting

30.3.1 Change case with toupper() and tolower()

features <- c("id", "age", "sex", "sbp", "dbp", "hct", "urea", "creatinine")
features
[1] "id"         "age"        "sex"        "sbp"        "dbp"       
[6] "hct"        "urea"       "creatinine"
features_upper <- toupper(features)
features_upper
[1] "ID"         "AGE"        "SEX"        "SBP"        "DBP"       
[6] "HCT"        "UREA"       "CREATININE"
features_lower <- tolower(features_upper)
features_lower
[1] "id"         "age"        "sex"        "sbp"        "dbp"       
[6] "hct"        "urea"       "creatinine"

30.3.2 Convert to Title Case

The tools package comes with the base R installation, but is not loaded at startup, because it contains rather specialized functions for package development, administration, and documentation. However, it includes the toTitleCase() function, which can be handy for formatting variable names, e.g. before plotting, etc.

features <- c("full name", "admission type", "attending name", "date of admission")
tools::toTitleCase(features)
[1] "Full Name"         "Admission Type"    "Attending Name"   
[4] "Date of Admission"

30.3.3 Abbreviate

abbreviate() allows reducing character vector elements to short, unique abbreviations of a minimum length (defaults to 4). For example,

x <- c("Emergency", "Cardiology", "Surgery", "Anesthesia", 
       "Neurology", "Psychiatry", "Clinical Psychology")
abbreviate(x)
          Emergency          Cardiology             Surgery          Anesthesia 
             "Emrg"              "Crdl"              "Srgr"              "Anst" 
          Neurology          Psychiatry Clinical Psychology 
             "Nrlg"              "Psyc"              "ClnP" 
abbreviate(x, minlength = 4)
          Emergency          Cardiology             Surgery          Anesthesia 
             "Emrg"              "Crdl"              "Srgr"              "Anst" 
          Neurology          Psychiatry Clinical Psychology 
             "Nrlg"              "Psyc"              "ClnP" 
abbreviate(x, minlength = 5)
          Emergency          Cardiology             Surgery          Anesthesia 
            "Emrgn"             "Crdlg"             "Srgry"             "Ansth" 
          Neurology          Psychiatry Clinical Psychology 
            "Nrlgy"             "Psych"             "ClncP" 

30.4 Pattern matching

A very common task in programming is to find +/- replace string patterns in a vector of strings.


- grep() and grepl() help find strings that contain a given pattern.
- sub() and gsub() help find and replace strings.

30.4.1 grep(): Get integer index of elements that match a pattern

x <- c("001Age", "002Sex", "010Temp", "014SBP", 
       "018Hct", "022PFratio", "030GCS", "112SBP-DBP")
grep(pattern = "SBP", x = x)
[1] 4 8

grep()’s value argument, which defaults to FALSE, allows returning the matched string itself (the value of the element) instead of its integer index, e.g.

grep("SBP", x, value = TRUE)
[1] "014SBP"     "112SBP-DBP"

30.4.2 grepl(): Get logical index of elements that match a pattern

grepl() is similar to grep(), but returns a logical index:

grepl("SBP", x)
[1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE

30.4.3 sub(): Replace first match of a pattern

x <- c("The most important variable was PF ratio. Other significant variables are listed 
in the supplementary information.")
x
[1] "The most important variable was PF ratio. Other significant variables are listed \nin the supplementary information."
sub(pattern = "variable", replacement = "feature", x = x)
[1] "The most important feature was PF ratio. Other significant variables are listed \nin the supplementary information."

sub() is vectorized and “first match” refers to each element of a character vector:

x <- c("var 1, var 2", "var 3, var 4")
sub("var", "feat", x)
[1] "feat 1, var 2" "feat 3, var 4"

30.4.4 gsub(): Replace all matches of a pattern

x <- c("The most important variable was P/F ratio. Other significant variables are listed in the supplementary information.")
x
[1] "The most important variable was P/F ratio. Other significant variables are listed in the supplementary information."
gsub(pattern = "variable", replacement = "feature", x = x)
[1] "The most important feature was P/F ratio. Other significant features are listed in the supplementary information."

“All matches” means all matches across all elements:

x <- c("var 1, var 2", "var 3, var 4")
gsub("var", "feat", x)
[1] "feat 1, feat 2" "feat 3, feat 4"

30.4.5 Match one or more patterns

You can use a vertical bar (|) in the pattern string to match one of multiple patterns:

x <- c("Emergency", "Cardiology", "Neurology", "Anesthesia", 
       "Surgery", "Psychiatry")
grep("Cardio|Neuro", x, value = TRUE)
[1] "Cardiology" "Neurology" 
grep("Emerg|Surg|Anesth", x, value = TRUE)
[1] "Emergency"  "Anesthesia" "Surgery"   

30.5 Regular expressions

Regular expressions allow you to perform flexible pattern matching. For example, you can look for a pattern specifically at the beginning or the end of a word, or for a variable pattern with certain characteristics.

Regular expressions are very powerful and heavily used. They exist in multiple programming languages - with many similarities and some differences in their syntax.

There are many rules in defining regular expressions and they take a little getting used to. You can read the R manual by typing ?base::regex.

Some of the most important rules are listed below:

30.5.1 Match a pattern at the beginning of a line/string with ^ and \\<

Use the caret sign ^ in the beginning of a pattern to only match strings that begin with that pattern.

Pattern 012 matches both 2nd and 3rd elements:

x <- c("001_xyz_993", "012_qwe_764", "029_aqw_012")
x
[1] "001_xyz_993" "012_qwe_764" "029_aqw_012"
grep("012", x)
[1] 2 3

By adding ^ or \\<, only the 2nd element in our character vector matches:

grep("^012", x)
[1] 2
grep("\\<012", x)
[1] 2

30.5.2 Match a pattern at the end of a line/string with $ and \\>

The dollar sign $ is used at the end of a pattern to only match strings which end with this pattern:

x
[1] "001_xyz_993" "012_qwe_764" "029_aqw_012"
grep("012$", x)
[1] 3
grep("012\\>", x)
[1] 3
x <- c("1one", "2one", "3two", "3three")
grep("one$", x)
[1] 1 2
grep("one\\>", x)
[1] 1 2

30.5.3 Match any character with .

grep("e.X", c("eX", "enX", "ennX", "ennnX", "ennnnX"))
[1] 2

30.5.4 Match preceding character one or more times with +

grep("en+X", c("eX", "enX", "ennX", "ennnX", "ennnnX"))
[1] 2 3 4 5

30.5.5 Match preceding character n times with {n}

grep("en{2}X", c("eX", "enX", "ennX", "ennnX", "ennnnX"))
[1] 3

30.5.6 Match preceding character n or more times with {n,}

grep("en{2,}X", c("eX", "enX", "ennX", "ennnX", "ennnnX"))
[1] 3 4 5

30.5.7 Match preceding character at least n times and no more than m times with {n,m}

grep("en{2,3}X", c("eX", "enX", "ennX", "ennnX", "ennnnX"))
[1] 3 4

30.5.8 Character classes

You can define a set of characters to be matched using square brackets. Any number of the characters in the set will be matched.

For example match and replace $ and/or @ with an underscore:

x <- c("Feat1$alpha", "Feat2$gamma@5", "Feat9@zeta2")
gsub("[$@]", "_", x)
[1] "Feat1_alpha"   "Feat2_gamma_5" "Feat9_zeta2"  

30.5.8.1 Predefined character classes

A number of character classes are predefined. Perhaps confusingly, they are themselves surrounded by brackets and to use them as a character class, you need a second set of brackets around them. Some of the most common ones include:

  • [:alnum:]: alphanumeric, i.e. all letters and numbers
  • [:alpha:]: all letters
  • [:digit:]: all numbers
  • [:lower:]: all lowercase letters
  • [:upper:]: all uppercase letters
  • [:punct:]: all punctuation characters (! ” # $ % & ’ ( ) * + , - . / : ; < = > ? @ [  ] ^ _ ` { | } ~.)
  • [:blank:]: all spaces and tabs
  • [:space:]: all spaces, tabs, newline characters, and some more

Let’s look at some examples using them.

Here we use [:digit:] to remove all numbers:

x <- c("001Emergency", "010Cardiology", "018Neurology", "020Anesthesia", 
       "021Surgery", "051Psychiatry")
x
[1] "001Emergency"  "010Cardiology" "018Neurology"  "020Anesthesia"
[5] "021Surgery"    "051Psychiatry"
gsub("[[:digit:]]", "", x)
[1] "Emergency"  "Cardiology" "Neurology"  "Anesthesia" "Surgery"   
[6] "Psychiatry"

We can use [:alpha:] to remove all letters instead:

gsub("[[:alpha:]]", "", x)
[1] "001" "010" "018" "020" "021" "051"

We can use a caret ^ in the beginning of a character class to match any character not in the character set:

x <- c("001$Emergency", "010@Cardiology", "018*Neurology", "020!Anesthesia", 
       "021!Surgery", "051*Psychiatry")
gsub("[^[:alnum:]]", "_", x)
[1] "001_Emergency"  "010_Cardiology" "018_Neurology"  "020_Anesthesia"
[5] "021_Surgery"    "051_Psychiatry"

30.5.9 Match from multiple character classes

x <- c("123#$%alphaBeta")
gsub("[[:digit:][:punct:]]", "", x)
[1] "alphaBeta"
Note

For more information on regular expressions, start by reading the built-in documentation: ?regex.

30.5.10 Escaping metacharacters

Metacharacters are characters that have a special meaning within a regular expression. They include:

. \ | ( ) [ { ^ $ * + ?.

For example, we have seen above that the period matches any character and the square brackets are used to define character classes If you want to match one of these characters itself, you must “escape” it using a double backslash. Escaping a character simply means “this is not part of a regular expression, match it as is”.

For example, to match a period (.) and replace it with underscores:

x <- c("systolic.blood.pressure", "diastolic.blood.pressure")
x
[1] "systolic.blood.pressure"  "diastolic.blood.pressure"
gsub("\\.", "_", x)
[1] "systolic_blood_pressure"  "diastolic_blood_pressure"

If we didn’t escape the period above, it would have matched any character:

gsub(".", "_", x)
[1] "_______________________"  "________________________"

Another example, include an escaped metacharacter within a regular expression. In the example below we want to remove everything up to and including the dollar sign:

x <- c("df$ID", "df$Age")
gsub(".*\\$", "", x)
[1] "ID"  "Age"

Our regular expression .*\\$, decomposed:

  • .: match any character
  • .*: match any character any number of times
  • .*\\$: match any character any number of times till you find a dollar sign

If we had not escaped the $, it wouldn’t have worked:

gsub(".*$", "", x)
[1] "" ""