mplot3_x(iris$Sepal.Length, type = "density")
4 Static Graphics
Visualization is a central part of any data analysis pipeline. It is hard to overemphasize the importance of visualizing your data. Ideally, you want to visualize data before and after any / most operations. Depending on the kind and amount of data you are working on, this can range from straightforward to quite challening. Here, we introduce some data visualization functions which are created using base R graphics. Some advantages of using base graphics are:
- They are easy to extend if you are familiar with base graphics / combine their output with that of other functions using base graphics
- They are very fast to draw. This becomes particularly important when monitoring learning algorithms live, or building shiny applications.
High-dimensional data can sometimes be indirectly visualized after dimensionality reduction.
4.1 Density and Histograms
mplot3_x(iris$Sepal.Length, type = "hist")
We can also directly plot grouped data by inputing a list. Note that partial matching allows us to just use "d"
for type:
set.seed(2019)
<- list(A = rnorm(500, mean = 0, sd = 1),
xl B = rnorm(200, mean = 3, sd = 1.5))
mplot3_x(xl, "d")
mplot3_x(xl, "hist", hist.breaks = 24)
mplot3_x(split(iris$Sepal.Length, iris$Species), "d")
mplot3_x(iris)
4.2 Scatter plots
Here we are going to look at the static mplot3_xy()
and mplot3_xym()
, and the interactive dplot3_xy()
.
Some synthetic data:
set.seed(2019)
<- rnorm(200)
x <- x^3 + rnorm(200, 3, 1.5) y
We create some synthetic data and plot using mplot3_xy()
. We can ask for any supervised learner to be used to fit the data. For linear relationships, that would be glm
, for non-linear fits there are many options, but gam
is a great one.
4.2.1 mplot3_xy
mplot3_xy(x, y, fit = 'gam', se.fit = TRUE)
mplot3_xy()
allows you to easily group data in a few different ways.
You pass x or y or both as a list of vectors:
set.seed(2019)
<- rnorm(200)
x <- x^2 + rnorm(200)
y1 <- -x^2 + 10 + rnorm(200)/4
y2 mplot3_xy(x, y = list(y1 = y1, y2 = y2), fit = 'gam')
Or you can use the group
argument, which will accept either a variable name, if data
is defined, or a factor vector:
<- rnorm(400)
x <- sample(400, 200)
id <- x[id]^2 + rnorm(200)
y1 <- x[-id]^3 + rnorm(200)
y2 <- rep(1, 400)
group -id] <- 2
group[<- rep(0, length(x))
y <- y1
y[id] -id] <- y2
y[<- data.frame(x, y, group)
dat mplot3_xy(x, y, data = dat, group = group, fit = "gam")
4.2.2 mplot3_xym()
This extension of mplot3_xy()
adds marginal density / histogram plots to a scatter plot:
set.seed(2019)
<- rnorm(200)
x <- x^3 + 12 + rnorm(200)
y mplot3_xym(x, y)
4.2.3 Fit custom functions
mplot3_xy
includes a formula argument as an alternative to fit. This allows the user to define the formula of the fitting function, if that is known. As an example, let’s look at power curves. Power curves can help us model a number of important relationships that occur in nature. Let’s see how we can plot these in rtemis.
4.2.3.1 y = b * m ^ x
First, we create some synthetic data:
= 8102
set.seed <- rnorm(200)
x <- .8 * 2.7 ^ x
y.true <- y.true + .9 * rnorm(200) y
Let’s plot the data:
mplot3_xy(x, y)
Now, let’s add a fit line. There are two ways to add a fit line in mplot3_xy
:
- The
fit
argument, e.g.fit = 'glm'
- The
formula
argument, e.g.formula = y ~ a * x + b
In this case, a linear model (both 'lm'
and 'glm'
work) is not a good idea:
mplot3_xy(x, y, fit = 'glm')
A generalized additive model (GAM) is our best bet if we know nothing about the relationship between x
and y
. (fit
, is the third argument to mplot3_xy
, so we can skip naming it)
mplot3_xy(x, y, 'gam')
Even better, if we do know the type of relationship between x
and y
, we can provide a formula. This will be solved using the Nonlinear Least Squares learner (s_NLS
)
mplot3_xy(x, y, formula = y ~ b * m ^ x)
We can plot the true function along with the fit.
<- s_NLS(x, y, formula = y ~ b * m ^ x)$fitted fitted
01-07-24 00:23:27 Hello, egenn [s_NLS]
.:Regression Input Summary
Training features: 200 x 1
Training outcome: 200 x 1
Testing features: Not available
Testing outcome: Not available
01-07-24 00:23:27 Initializing all parameters as 0.1 [s_NLS]
01-07-24 00:23:27 Training NLS model... [s_NLS]
.:NLS Regression Training Summary
MSE = 0.68 (89.24%)
RMSE = 0.82 (67.19%)
MAE = 0.65 (54.06%)
r = 0.94 (p = 7.9e-98)
R sq = 0.89
01-07-24 00:23:27 Completed in 1.5e-04 minutes (Real: 0.01; User: 0.01; System: 1e-03) [s_NLS]
mplot3_xy(x, y = list(Observed = y, True = y.true, Fitted = fitted),
type = c('p', 'l', 'l'), marker.alpha = .85)
4.2.4 Scatterplot + Cluster
We already saw we can use any learner to draw a fit line in a scatter plot. You can similarly use any clutering algorithm to cluster the data and color them by cluster membership. Let’s use HOPACH (Van der Laan and Pollard 2003) to cluster the famous iris dataset. Learn more about [Clustering].
mplot3_xy(iris$Sepal.Length, iris$Petal.Length,
cluster = "hopach")
4.3 Heatmaps
<- rnormmat(20, 20, seed = 2018)
x <- cor(x) x.cor
mplot3_heatmap(x.cor)
Notice how mplot3_heatmap
’s colorbar defaults to 10 overlapping discs on either side of zero, representing a 10% change from one to the next.
Turn off hierarchical clustering and dendrogram:
mplot3_heatmap(x.cor, Colv = NA, Rowv = NA)
4.4 Barplots
mplot3_bar(VADeaths,
col = colorRampPalette(c("#82afd3", "#000f3a"))(nrow(VADeaths)),
group.names = rownames(VADeaths),
group.legend = TRUE)
4.5 Boxplots
Some synthetic data:
<- rnormmat(200, 4, return.df = TRUE, seed = 2019)
x colnames(x) <- c("mango", "banana", "tangerine", "sugar")
mplot3_box(x)
4.6 Mosaic Plots
Mosaic plots are a great way to visualize count data, e.g. from a contingency table.
Some synthetic data from R’s documentation:
<- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))
party dimnames(party) <- list(gender = c("F", "M"),
party = c("Democrat","Independent", "Republican"))
mplot3_mosaic(party)
4.7 Decision Boundaries
The goal of a classifier is to establish a decision boundary in feature space separating the different outcome classes. While most feature spaces are high dimensional and cannot be directly visualized, it is can still be helpful to look at decision boundaries in low-dimensional problems. We can compare different algorithms or the effects of hyperparameter tuning for a given algorithm.
4.7.1 2D synthetic data
Let’s create some 2D synthetic data using the mlbench package, and plot them, coloring by group, using mplot3_xy
.
set.seed(2018)
<- mlbench::mlbench.2dnormals(200)
data2D <- data.frame(data2D$x, y = data2D$classes)
dat mplot3_xy(dat$X1, dat$X2, group = dat$y, marker.col = c("#18A3AC", "#F48024"))
4.7.2 Logistic Regression
<- s_GLM(dat, verbose = FALSE, print.plot = FALSE) mod.glm
Warning in eval(family$initialize): non-integer #successes in a binomial glm!
mplot3_decision(mod.glm, dat)
4.7.3 CART
<- s_CART(dat, verbose = FALSE, print.plot = FALSE)
mod.cart mplot3_decision(mod.cart, dat)
4.7.4 RF
<- s_Ranger(dat, verbose = FALSE, print.plot = FALSE)
mod.rf mplot3_decision(mod.rf, dat)
4.8 Multiplots with mplot3
rtemis provides a convenience function to plot multiple graphs together, rtlayout
. It’s based on the graphics::layout
function and integrates behind the scenes with all mplot3
functions. You specify number of rows and number of columns. Optional arguments allow you to arrange plots by row or by column and automatically create labels for each plot. As with most visualization functions in rtemis, there is an option to save to PDF. This means you can create a publication-quality multipanel plot in a few lines of code:
Start by defining n nrows and n columns, plot your plots using mplot3
functions, and close using rtlayout()
.
set.seed(2019)
<- runif(200, min = -20, max = 20)
x <- rnorm(200, mean = 0, sd = 4)
z <- .8 * x^2 + .6 * z^3 + rnorm(200)
y rtlayout(2, 2, byrow = TRUE, autolabel = TRUE)
mplot3_x(x, 'd')
mplot3_x(z, 'd')
mplot3_xy(x, y, fit = 'gam')
mplot3_xy(z, y, fit = 'gam')
rtlayout()