library(rtemis)
16 Handling Imbalanced Data
In classification problems, it is common for outcome classes to appear with different frequencies. This is called imbalanced data. Consider, for example, a binary classification problem where the positive class (the ‘events’) appears with a 5% probability. Applying a learning algorithm naively without considering this class imbalance, may lead to the algorithm always predicting the majority class, which automatically results in 95% accuracy.
To handle imbalanced data, we make considerations during model training and assessment.
16.1 Model Training
There are a few different ways to address the problem of imbalanced data during training. We’ll consider the 3 main ones:
Inverse Frequency Weighting
We weigh each case based on its frequency, such that less frequent classes are up-weighed. This is called Inverse Frequency Weighting (IFW), and is enabled by default in rtemis for all classification learning algorithms that support case weights. The logical argumentifw
controls whether IFW is used. It is TRUE by default in all learners.Upsampling the minority class
We randomly sample from the minority class to reach the size of the manjority class. The effect is not very different from upweighing using IFW. The logical argumentupsample
in all rtemis learners that support classification controls whether upsampling of the minority class should be performed. (If it is set to TRUE, it makes theifw
argument irrelevant as the sample becomes balanced)Downsampling the majority class
Conversely, we randomly subsample the majority class to reach the size of the minority class. The logical argumentdownsample
controls this behavior.
16.2 Classification model performance metrics
During model selection as well as model assessment, it is crucial to use metrics that take into consideration imbalanced outcomes.
The following metrics address the issue in different ways and are reported by the modError
function in all classification problems:
Balanced Accuracy (the mean of Sensitivity + Sensitivity) \[\frac{1}{N}\sum_{i=1}^k Sensitivity_i\] i.e. the mean per-class Sensitivity. In the binary case, this is equal to the mean of Sensitivity and Specificity.
F1 Harmonic mean of Sensitivity (aka Recall) and Positive Predictive Value (aka Precision) \[F_1 = 2\frac{precision * recall}{precision + recall}\]
AUROC (Area under the ROC) i.e. the area under the True Positive Rate vs False Positive Rate curve or Sensitivity vs 1-Specificity
16.3 Example dataset
Let’s look at a very imbalanced dataset from the Penn ML Benchmarks repository
<- read("https://github.com/EpistasisLab/pmlb/raw/master/datasets/hypothyroid/hypothyroid.tsv.gz") dat
06-30-24 10:57:38 ▶ Reading hypothyroid.tsv.gz using data.table... [read]
06-30-24 10:57:39 Read in 3,163 x 26 [read]
06-30-24 10:57:39 Removed 77 duplicate rows. [read]
06-30-24 10:57:39 New dimensions: 3,086 x 26 [read]
06-30-24 10:57:39 Completed in 0.01 minutes (Real: 0.47; User: 0.05; System: 0.01) [read]
$target <- factor(dat$target, levels = c(1, 0)) dat
check_data(dat)
dat: A data.table with 3086 rows and 26 columns
Data types
* 0 numeric features
* 25 integer features
* 1 factor, which is not ordered
* 0 character features
* 0 date features
Issues
* 0 constant features
* 0 duplicate cases
* 0 missing values
Recommendations
* Everything looks good
Get the frequency of the target classes:
table(dat$target)
1 0
2945 141
16.3.1 Class Imbalance
We can use the Class Imbalance formula using the class_imbalance()
function:
\[I = K\cdot\sum_{i=1}^K (n_i/N - 1/K)^2\]
class_imbalance(dat$target)
[1] 0.8255895
Let’s create some resamples to train and test models:
<- resample(dat, seed = 2019) res
06-30-24 10:57:39 Input contains more than one columns; will stratify on last [resample]
.:Resampling Parameters
n.resamples: 10
resampler: strat.sub
stratify.var: y
train.p: 0.75
strat.n.bins: 4
06-30-24 10:57:39 Using max n bins possible = 2 [strat.sub]
06-30-24 10:57:39 Created 10 stratified subsamples [resample]
<- dat[res$Subsample_1, ]
dat.train <- dat[-res$Subsample_1, ] dat.test
16.4 GLM
16.4.1 No imbalance correction
Let’s train a GLM without inverse probability weighting or upsampling. Since IFW is set to TRUE by default in all rtemis supervised learning functions that support it, we have to set it to FALSE
:
<- s_GLM(dat.train, dat.test,
mod.glm.imb ifw = FALSE)
06-30-24 10:57:39 Hello, egenn [s_GLM]
.:Classification Input Summary
Training features: 2313 x 25
Training outcome: 2313 x 1
Testing features: 773 x 25
Testing outcome: 773 x 1
06-30-24 10:57:39 Training GLM... [s_GLM]
.:LOGISTIC Classification Training Summary
Reference
Estimated 1 0
1 2195 72
0 13 33
Overall
Sensitivity 0.9941
Specificity 0.3143
Balanced Accuracy 0.6542
PPV 0.9682
NPV 0.7174
F1 0.9810
Accuracy 0.9633
AUC 0.9431
Brier Score 0.0284
Positive Class: 1
.:LOGISTIC Classification Testing Summary
Reference
Estimated 1 0
1 728 24
0 9 12
Overall
Sensitivity 0.9878
Specificity 0.3333
Balanced Accuracy 0.6606
PPV 0.9681
NPV 0.5714
F1 0.9778
Accuracy 0.9573
AUC 0.9177
Brier Score 0.0340
Positive Class: 1
06-30-24 10:57:39 Completed in 2e-03 minutes (Real: 0.12; User: 0.11; System: 0.01) [s_GLM]
We get almost perfect Sensitivity, but very low Specificity.
16.4.2 IFW
Let’s enable IFW:
<- s_GLM(dat.train, dat.test,
mod.glm.ifw ifw = TRUE)
06-30-24 10:57:39 Hello, egenn [s_GLM]
06-30-24 10:57:39 Imbalanced classes: using Inverse Frequency Weighting [prepare_data]
.:Classification Input Summary
Training features: 2313 x 25
Training outcome: 2313 x 1
Testing features: 773 x 25
Testing outcome: 773 x 1
06-30-24 10:57:39 Training GLM... [s_GLM]
.:LOGISTIC Classification Training Summary
Reference
Estimated 1 0
1 1912 8
0 296 97
Overall
Sensitivity 0.8659
Specificity 0.9238
Balanced Accuracy 0.8949
PPV 0.9958
NPV 0.2468
F1 0.9264
Accuracy 0.8686
AUC 0.9469
Brier Score 0.0968
Positive Class: 1
.:LOGISTIC Classification Testing Summary
Reference
Estimated 1 0
1 624 6
0 113 30
Overall
Sensitivity 0.8467
Specificity 0.8333
Balanced Accuracy 0.8400
PPV 0.9905
NPV 0.2098
F1 0.9129
Accuracy 0.8461
AUC 0.9085
Brier Score 0.1104
Positive Class: 1
06-30-24 10:57:39 Completed in 1.1e-03 minutes (Real: 0.07; User: 0.06; System: 4e-03) [s_GLM]
Sensitivity dropped a little, but Specificity improved a lot and they are now very close.
16.4.3 Upsampling
Let’s try upsampling instead of IFW:
<- s_GLM(dat.train, dat.test,
mod.glm.ups ifw = FALSE,
upsample = TRUE)
06-30-24 10:57:39 Hello, egenn [s_GLM]
06-30-24 10:57:39 Upsampling to create balanced set... [prepare_data]
06-30-24 10:57:39 1 is majority outcome with length = 2208 [prepare_data]
.:Classification Input Summary
Training features: 4416 x 25
Training outcome: 4416 x 1
Testing features: 773 x 25
Testing outcome: 773 x 1
06-30-24 10:57:39 Training GLM... [s_GLM]
.:LOGISTIC Classification Training Summary
Reference
Estimated 1 0
1 1913 124
0 295 2084
Overall
Sensitivity 0.8664
Specificity 0.9438
Balanced Accuracy 0.9051
PPV 0.9391
NPV 0.8760
F1 0.9013
Accuracy 0.9051
AUC 0.9476
Brier Score 0.0808
Positive Class: 1
.:LOGISTIC Classification Testing Summary
Reference
Estimated 1 0
1 630 6
0 107 30
Overall
Sensitivity 0.8548
Specificity 0.8333
Balanced Accuracy 0.8441
PPV 0.9906
NPV 0.2190
F1 0.9177
Accuracy 0.8538
AUC 0.9086
Brier Score 0.1100
Positive Class: 1
06-30-24 10:57:39 Completed in 2.3e-03 minutes (Real: 0.14; User: 0.13; System: 0.01) [s_GLM]
In this example, upsampling the minority class helped give almost perfect Specificity at the cost of lower Sensitivity.
16.4.4 Downsampling
<- s_GLM(dat.train, dat.test,
mod.glm.downs ifw = FALSE,
downsample = TRUE)
06-30-24 10:57:40 Hello, egenn [s_GLM]
06-30-24 10:57:40 Downsampling to balance outcome classes... [prepare_data]
06-30-24 10:57:40 0 is the minority outcome with 105 cases [prepare_data]
.:Classification Input Summary
Training features: 210 x 25
Training outcome: 210 x 1
Testing features: 773 x 25
Testing outcome: 773 x 1
06-30-24 10:57:40 Training GLM... [s_GLM]
.:LOGISTIC Classification Training Summary
Reference
Estimated 1 0
1 96 6
0 9 99
Overall
Sensitivity 0.9143
Specificity 0.9429
Balanced Accuracy 0.9286
PPV 0.9412
NPV 0.9167
F1 0.9275
Accuracy 0.9286
AUC 0.9640
Brier Score 0.0675
Positive Class: 1
.:LOGISTIC Classification Testing Summary
Reference
Estimated 1 0
1 608 3
0 129 33
Overall
Sensitivity 0.8250
Specificity 0.9167
Balanced Accuracy 0.8708
PPV 0.9951
NPV 0.2037
F1 0.9021
Accuracy 0.8292
AUC 0.9129
Brier Score 0.1243
Positive Class: 1
06-30-24 10:57:40 Completed in 3.5e-04 minutes (Real: 0.02; User: 0.02; System: 1e-03) [s_GLM]
Similar results to upsampling, in this case.
16.5 Random forest
Some algorithms allow multiple ways to handle imbalanced data. See this Tech Report for techniques to handle imbalanced classes with Random Forest. The report describes the “Balanced Random Forest” and “Weighted Random Forest” approaches.
16.5.1 No imbalance correction
Again, let’s begin by training a model with no correction for imbalanced data:
<- s_Ranger(dat.train, dat.test,
mod.rf.imb ifw = FALSE)
06-30-24 10:57:40 Hello, egenn [s_Ranger]
.:Classification Input Summary
Training features: 2313 x 25
Training outcome: 2313 x 1
Testing features: 773 x 25
Testing outcome: 773 x 1
.:Parameters
n.trees: 1000
mtry: NULL
06-30-24 10:57:40 Training Random Forest (ranger) Classification with 1000 trees... [s_Ranger]
.:Ranger Classification Training Summary
Reference
Estimated 1 0
1 2208 1
0 0 104
Overall
Sensitivity 1.0000
Specificity 0.9905
Balanced Accuracy 0.9952
PPV 0.9995
NPV 1.0000
F1 0.9998
Accuracy 0.9996
AUC 1.0000
Brier Score 3.1e-03
Positive Class: 1
.:Ranger Classification Testing Summary
Reference
Estimated 1 0
1 732 14
0 5 22
Overall
Sensitivity 0.9932
Specificity 0.6111
Balanced Accuracy 0.8022
PPV 0.9812
NPV 0.8148
F1 0.9872
Accuracy 0.9754
AUC 0.9785
Brier Score 0.0193
Positive Class: 1
06-30-24 10:57:40 Completed in 4.2e-03 minutes (Real: 0.25; User: 1.00; System: 0.03) [s_Ranger]
16.5.2 IFW: Case weights
Now, with IFW. By Default, s_Ranger()
, uses IFW to define case weights (i.e. ifw.case.weights = TRUE
):
<- s_Ranger(dat.train, dat.test,
mod.rf.ifw ifw = TRUE)
06-30-24 10:57:40 Hello, egenn [s_Ranger]
06-30-24 10:57:40 Imbalanced classes: using Inverse Frequency Weighting [prepare_data]
.:Classification Input Summary
Training features: 2313 x 25
Training outcome: 2313 x 1
Testing features: 773 x 25
Testing outcome: 773 x 1
.:Parameters
n.trees: 1000
mtry: NULL
06-30-24 10:57:40 Training Random Forest (ranger) Classification with 1000 trees... [s_Ranger]
.:Ranger Classification Training Summary
Reference
Estimated 1 0
1 2194 0
0 14 105
Overall
Sensitivity 0.9937
Specificity 1.0000
Balanced Accuracy 0.9968
PPV 1.0000
NPV 0.8824
F1 0.9968
Accuracy 0.9939
AUC 1.0000
Brier Score 4.8e-03
Positive Class: 1
.:Ranger Classification Testing Summary
Reference
Estimated 1 0
1 728 10
0 9 26
Overall
Sensitivity 0.9878
Specificity 0.7222
Balanced Accuracy 0.8550
PPV 0.9864
NPV 0.7429
F1 0.9871
Accuracy 0.9754
AUC 0.9840
Brier Score 0.0187
Positive Class: 1
06-30-24 10:57:40 Completed in 0.01 minutes (Real: 0.31; User: 1.20; System: 0.02) [s_Ranger]
Again, IFW increases the Specificity.
16.5.3 IFW: Class weights
Alternatively, we can use IFW to define class weights:
<- s_Ranger(dat.train, dat.test,
mod.rf.cw ifw = TRUE,
ifw.case.weights = FALSE,
ifw.class.weights = TRUE)
06-30-24 10:57:40 Hello, egenn [s_Ranger]
06-30-24 10:57:40 Imbalanced classes: using Inverse Frequency Weighting [prepare_data]
.:Classification Input Summary
Training features: 2313 x 25
Training outcome: 2313 x 1
Testing features: 773 x 25
Testing outcome: 773 x 1
.:Parameters
n.trees: 1000
mtry: NULL
06-30-24 10:57:40 Training Random Forest (ranger) Classification with 1000 trees... [s_Ranger]
.:Ranger Classification Training Summary
Reference
Estimated 1 0
1 2208 1
0 0 104
Overall
Sensitivity 1.0000
Specificity 0.9905
Balanced Accuracy 0.9952
PPV 0.9995
NPV 1.0000
F1 0.9998
Accuracy 0.9996
AUC 1.0000
Brier Score 3.1e-03
Positive Class: 1
.:Ranger Classification Testing Summary
Reference
Estimated 1 0
1 732 15
0 5 21
Overall
Sensitivity 0.9932
Specificity 0.5833
Balanced Accuracy 0.7883
PPV 0.9799
NPV 0.8077
F1 0.9865
Accuracy 0.9741
AUC 0.9813
Brier Score 0.0191
Positive Class: 1
06-30-24 10:57:40 Completed in 3.9e-03 minutes (Real: 0.23; User: 0.97; System: 0.02) [s_Ranger]
16.5.4 Upsampling
Now try upsampling:
<- s_Ranger(dat.train, dat.test,
mod.rf.ups ifw = FALSE,
upsample = TRUE)
06-30-24 10:57:40 Hello, egenn [s_Ranger]
06-30-24 10:57:40 Upsampling to create balanced set... [prepare_data]
06-30-24 10:57:40 1 is majority outcome with length = 2208 [prepare_data]
.:Classification Input Summary
Training features: 4416 x 25
Training outcome: 4416 x 1
Testing features: 773 x 25
Testing outcome: 773 x 1
.:Parameters
n.trees: 1000
mtry: NULL
06-30-24 10:57:40 Training Random Forest (ranger) Classification with 1000 trees... [s_Ranger]
.:Ranger Classification Training Summary
Reference
Estimated 1 0
1 2205 0
0 3 2208
Overall
Sensitivity 0.9986
Specificity 1.0000
Balanced Accuracy 0.9993
PPV 1.0000
NPV 0.9986
F1 0.9993
Accuracy 0.9993
AUC 1.0000
Brier Score 1.4e-03
Positive Class: 1
.:Ranger Classification Testing Summary
Reference
Estimated 1 0
1 729 12
0 8 24
Overall
Sensitivity 0.9891
Specificity 0.6667
Balanced Accuracy 0.8279
PPV 0.9838
NPV 0.7500
F1 0.9865
Accuracy 0.9741
AUC 0.9817
Brier Score 0.0181
Positive Class: 1
06-30-24 10:57:41 Completed in 0.01 minutes (Real: 0.63; User: 2.32; System: 0.05) [s_Ranger]