16  Handling Imbalanced Data

library(rtemis)

In classification problems, it is common for outcome classes to appear with different frequencies. This is called imbalanced data. Consider, for example, a binary classification problem where the positive class (the ‘events’) appears with a 5% probability. Applying a learning algorithm naively without considering this class imbalance, may lead to the algorithm always predicting the majority class, which automatically results in 95% accuracy.

To handle imbalanced data, we make considerations during model training and assessment.

16.1 Model Training

There are a few different ways to address the problem of imbalanced data during training. We’ll consider the 3 main ones:

  • Inverse Frequency Weighting
    We weigh each case based on its frequency, such that less frequent classes are up-weighed. This is called Inverse Frequency Weighting (IFW), and is enabled by default in rtemis for all classification learning algorithms that support case weights. The logical argument ifw controls whether IFW is used. It is TRUE by default in all learners.

  • Upsampling the minority class
    We randomly sample from the minority class to reach the size of the manjority class. The effect is not very different from upweighing using IFW. The logical argument upsample in all rtemis learners that support classification controls whether upsampling of the minority class should be performed. (If it is set to TRUE, it makes the ifw argument irrelevant as the sample becomes balanced)

  • Downsampling the majority class
    Conversely, we randomly subsample the majority class to reach the size of the minority class. The logical argument downsample controls this behavior.

16.2 Classification model performance metrics

During model selection as well as model assessment, it is crucial to use metrics that take into consideration imbalanced outcomes.
The following metrics address the issue in different ways and are reported by the modError function in all classification problems:

  • Balanced Accuracy (the mean of Sensitivity + Sensitivity) \[\frac{1}{N}\sum_{i=1}^k Sensitivity_i\] i.e. the mean per-class Sensitivity. In the binary case, this is equal to the mean of Sensitivity and Specificity.

  • F1 Harmonic mean of Sensitivity (aka Recall) and Positive Predictive Value (aka Precision) \[F_1 = 2\frac{precision * recall}{precision + recall}\]

  • AUROC (Area under the ROC) i.e. the area under the True Positive Rate vs False Positive Rate curve or Sensitivity vs 1-Specificity

16.3 Example dataset

Let’s look at a very imbalanced dataset from the Penn ML Benchmarks repository

dat <- read("https://github.com/EpistasisLab/pmlb/raw/master/datasets/hypothyroid/hypothyroid.tsv.gz")
06-30-24 10:57:38  Reading hypothyroid.tsv.gz using data.table... [read]
06-30-24 10:57:39 Read in 3,163 x 26 [read]
06-30-24 10:57:39 Removed 77 duplicate rows. [read]
06-30-24 10:57:39 New dimensions: 3,086 x 26 [read]
06-30-24 10:57:39 Completed in 0.01 minutes (Real: 0.47; User: 0.05; System: 0.01) [read]

dat$target <- factor(dat$target, levels = c(1, 0))
check_data(dat)
  dat: A data.table with 3086 rows and 26 columns

  Data types
  * 0 numeric features
  * 25 integer features
  * 1 factor, which is not ordered
  * 0 character features
  * 0 date features

  Issues
  * 0 constant features
  * 0 duplicate cases
  * 0 missing values

  Recommendations
  * Everything looks good 

Get the frequency of the target classes:

table(dat$target)

   1    0 
2945  141 

16.3.1 Class Imbalance

We can use the Class Imbalance formula using the class_imbalance() function:

\[I = K\cdot\sum_{i=1}^K (n_i/N - 1/K)^2\]

class_imbalance(dat$target)
[1] 0.8255895

Let’s create some resamples to train and test models:

res <- resample(dat, seed = 2019)
06-30-24 10:57:39 Input contains more than one columns; will stratify on last [resample]
.:Resampling Parameters
    n.resamples: 10 
      resampler: strat.sub 
   stratify.var: y 
        train.p: 0.75 
   strat.n.bins: 4 
06-30-24 10:57:39 Using max n bins possible = 2 [strat.sub]
06-30-24 10:57:39 Created 10 stratified subsamples [resample]

dat.train <- dat[res$Subsample_1, ]
dat.test <- dat[-res$Subsample_1, ]

16.4 GLM

16.4.1 No imbalance correction

Let’s train a GLM without inverse probability weighting or upsampling. Since IFW is set to TRUE by default in all rtemis supervised learning functions that support it, we have to set it to FALSE:

mod.glm.imb <- s_GLM(dat.train, dat.test,
                     ifw = FALSE)
06-30-24 10:57:39 Hello, egenn [s_GLM]

.:Classification Input Summary
Training features: 2313 x 25 
 Training outcome: 2313 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

06-30-24 10:57:39 Training GLM... [s_GLM]

.:LOGISTIC Classification Training Summary
                   Reference 
        Estimated  1     0   
                1  2195  72
                0    13  33

                   Overall  
      Sensitivity  0.9941 
      Specificity  0.3143 
Balanced Accuracy  0.6542 
              PPV  0.9682 
              NPV  0.7174 
               F1  0.9810 
         Accuracy  0.9633 
              AUC  0.9431 
      Brier Score  0.0284 

  Positive Class:  1 

.:LOGISTIC Classification Testing Summary
                   Reference 
        Estimated  1    0   
                1  728  24
                0    9  12

                   Overall  
      Sensitivity  0.9878 
      Specificity  0.3333 
Balanced Accuracy  0.6606 
              PPV  0.9681 
              NPV  0.5714 
               F1  0.9778 
         Accuracy  0.9573 
              AUC  0.9177 
      Brier Score  0.0340 

  Positive Class:  1 
06-30-24 10:57:39 Completed in 2e-03 minutes (Real: 0.12; User: 0.11; System: 0.01) [s_GLM]

We get almost perfect Sensitivity, but very low Specificity.

16.4.2 IFW

Let’s enable IFW:

mod.glm.ifw <- s_GLM(dat.train, dat.test,
                     ifw = TRUE)
06-30-24 10:57:39 Hello, egenn [s_GLM]

06-30-24 10:57:39 Imbalanced classes: using Inverse Frequency Weighting [prepare_data]

.:Classification Input Summary
Training features: 2313 x 25 
 Training outcome: 2313 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

06-30-24 10:57:39 Training GLM... [s_GLM]

.:LOGISTIC Classification Training Summary
                   Reference 
        Estimated  1     0   
                1  1912   8
                0   296  97

                   Overall  
      Sensitivity  0.8659 
      Specificity  0.9238 
Balanced Accuracy  0.8949 
              PPV  0.9958 
              NPV  0.2468 
               F1  0.9264 
         Accuracy  0.8686 
              AUC  0.9469 
      Brier Score  0.0968 

  Positive Class:  1 

.:LOGISTIC Classification Testing Summary
                   Reference 
        Estimated  1    0   
                1  624   6
                0  113  30

                   Overall  
      Sensitivity  0.8467 
      Specificity  0.8333 
Balanced Accuracy  0.8400 
              PPV  0.9905 
              NPV  0.2098 
               F1  0.9129 
         Accuracy  0.8461 
              AUC  0.9085 
      Brier Score  0.1104 

  Positive Class:  1 
06-30-24 10:57:39 Completed in 1.1e-03 minutes (Real: 0.07; User: 0.06; System: 4e-03) [s_GLM]

Sensitivity dropped a little, but Specificity improved a lot and they are now very close.

16.4.3 Upsampling

Let’s try upsampling instead of IFW:

mod.glm.ups <- s_GLM(dat.train, dat.test,
                     ifw = FALSE,
                     upsample = TRUE)
06-30-24 10:57:39 Hello, egenn [s_GLM]

06-30-24 10:57:39 Upsampling to create balanced set... [prepare_data]
06-30-24 10:57:39 1 is majority outcome with length = 2208 [prepare_data]

.:Classification Input Summary
Training features: 4416 x 25 
 Training outcome: 4416 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

06-30-24 10:57:39 Training GLM... [s_GLM]

.:LOGISTIC Classification Training Summary
                   Reference 
        Estimated  1     0     
                1  1913   124
                0   295  2084

                   Overall  
      Sensitivity  0.8664 
      Specificity  0.9438 
Balanced Accuracy  0.9051 
              PPV  0.9391 
              NPV  0.8760 
               F1  0.9013 
         Accuracy  0.9051 
              AUC  0.9476 
      Brier Score  0.0808 

  Positive Class:  1 

.:LOGISTIC Classification Testing Summary
                   Reference 
        Estimated  1    0   
                1  630   6
                0  107  30

                   Overall  
      Sensitivity  0.8548 
      Specificity  0.8333 
Balanced Accuracy  0.8441 
              PPV  0.9906 
              NPV  0.2190 
               F1  0.9177 
         Accuracy  0.8538 
              AUC  0.9086 
      Brier Score  0.1100 

  Positive Class:  1 
06-30-24 10:57:39 Completed in 2.3e-03 minutes (Real: 0.14; User: 0.13; System: 0.01) [s_GLM]

In this example, upsampling the minority class helped give almost perfect Specificity at the cost of lower Sensitivity.

16.4.4 Downsampling

mod.glm.downs <- s_GLM(dat.train, dat.test,
                       ifw = FALSE,
                       downsample = TRUE)
06-30-24 10:57:40 Hello, egenn [s_GLM]

06-30-24 10:57:40 Downsampling to balance outcome classes... [prepare_data]
06-30-24 10:57:40 0 is the minority outcome with 105 cases [prepare_data]

.:Classification Input Summary
Training features: 210 x 25 
 Training outcome: 210 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

06-30-24 10:57:40 Training GLM... [s_GLM]

.:LOGISTIC Classification Training Summary
                   Reference 
        Estimated  1   0   
                1  96   6
                0   9  99

                   Overall  
      Sensitivity  0.9143 
      Specificity  0.9429 
Balanced Accuracy  0.9286 
              PPV  0.9412 
              NPV  0.9167 
               F1  0.9275 
         Accuracy  0.9286 
              AUC  0.9640 
      Brier Score  0.0675 

  Positive Class:  1 

.:LOGISTIC Classification Testing Summary
                   Reference 
        Estimated  1    0   
                1  608   3
                0  129  33

                   Overall  
      Sensitivity  0.8250 
      Specificity  0.9167 
Balanced Accuracy  0.8708 
              PPV  0.9951 
              NPV  0.2037 
               F1  0.9021 
         Accuracy  0.8292 
              AUC  0.9129 
      Brier Score  0.1243 

  Positive Class:  1 
06-30-24 10:57:40 Completed in 3.5e-04 minutes (Real: 0.02; User: 0.02; System: 1e-03) [s_GLM]

Similar results to upsampling, in this case.

16.5 Random forest

Some algorithms allow multiple ways to handle imbalanced data. See this Tech Report for techniques to handle imbalanced classes with Random Forest. The report describes the “Balanced Random Forest” and “Weighted Random Forest” approaches.

16.5.1 No imbalance correction

Again, let’s begin by training a model with no correction for imbalanced data:

mod.rf.imb <- s_Ranger(dat.train, dat.test,
                       ifw = FALSE)
06-30-24 10:57:40 Hello, egenn [s_Ranger]

.:Classification Input Summary
Training features: 2313 x 25 
 Training outcome: 2313 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

.:Parameters
   n.trees: 1000 
      mtry: NULL 

06-30-24 10:57:40 Training Random Forest (ranger) Classification with 1000 trees... [s_Ranger]

.:Ranger Classification Training Summary
                   Reference 
        Estimated  1     0    
                1  2208    1
                0     0  104

                   Overall  
      Sensitivity  1.0000 
      Specificity  0.9905 
Balanced Accuracy  0.9952 
              PPV  0.9995 
              NPV  1.0000 
               F1  0.9998 
         Accuracy  0.9996 
              AUC  1.0000 
      Brier Score  3.1e-03

  Positive Class:  1 

.:Ranger Classification Testing Summary
                   Reference 
        Estimated  1    0   
                1  732  14
                0    5  22

                   Overall  
      Sensitivity  0.9932 
      Specificity  0.6111 
Balanced Accuracy  0.8022 
              PPV  0.9812 
              NPV  0.8148 
               F1  0.9872 
         Accuracy  0.9754 
              AUC  0.9785 
      Brier Score  0.0193 

  Positive Class:  1 
06-30-24 10:57:40 Completed in 4.2e-03 minutes (Real: 0.25; User: 1.00; System: 0.03) [s_Ranger]

16.5.2 IFW: Case weights

Now, with IFW. By Default, s_Ranger(), uses IFW to define case weights (i.e. ifw.case.weights = TRUE):

mod.rf.ifw <- s_Ranger(dat.train, dat.test,
                       ifw = TRUE)
06-30-24 10:57:40 Hello, egenn [s_Ranger]

06-30-24 10:57:40 Imbalanced classes: using Inverse Frequency Weighting [prepare_data]

.:Classification Input Summary
Training features: 2313 x 25 
 Training outcome: 2313 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

.:Parameters
   n.trees: 1000 
      mtry: NULL 

06-30-24 10:57:40 Training Random Forest (ranger) Classification with 1000 trees... [s_Ranger]

.:Ranger Classification Training Summary
                   Reference 
        Estimated  1     0    
                1  2194    0
                0    14  105

                   Overall  
      Sensitivity  0.9937 
      Specificity  1.0000 
Balanced Accuracy  0.9968 
              PPV  1.0000 
              NPV  0.8824 
               F1  0.9968 
         Accuracy  0.9939 
              AUC  1.0000 
      Brier Score  4.8e-03

  Positive Class:  1 

.:Ranger Classification Testing Summary
                   Reference 
        Estimated  1    0   
                1  728  10
                0    9  26

                   Overall  
      Sensitivity  0.9878 
      Specificity  0.7222 
Balanced Accuracy  0.8550 
              PPV  0.9864 
              NPV  0.7429 
               F1  0.9871 
         Accuracy  0.9754 
              AUC  0.9840 
      Brier Score  0.0187 

  Positive Class:  1 
06-30-24 10:57:40 Completed in 0.01 minutes (Real: 0.31; User: 1.20; System: 0.02) [s_Ranger]

Again, IFW increases the Specificity.

16.5.3 IFW: Class weights

Alternatively, we can use IFW to define class weights:

mod.rf.cw <- s_Ranger(dat.train, dat.test,
                      ifw = TRUE,
                      ifw.case.weights = FALSE,
                      ifw.class.weights = TRUE)
06-30-24 10:57:40 Hello, egenn [s_Ranger]

06-30-24 10:57:40 Imbalanced classes: using Inverse Frequency Weighting [prepare_data]

.:Classification Input Summary
Training features: 2313 x 25 
 Training outcome: 2313 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

.:Parameters
   n.trees: 1000 
      mtry: NULL 

06-30-24 10:57:40 Training Random Forest (ranger) Classification with 1000 trees... [s_Ranger]

.:Ranger Classification Training Summary
                   Reference 
        Estimated  1     0    
                1  2208    1
                0     0  104

                   Overall  
      Sensitivity  1.0000 
      Specificity  0.9905 
Balanced Accuracy  0.9952 
              PPV  0.9995 
              NPV  1.0000 
               F1  0.9998 
         Accuracy  0.9996 
              AUC  1.0000 
      Brier Score  3.1e-03

  Positive Class:  1 

.:Ranger Classification Testing Summary
                   Reference 
        Estimated  1    0   
                1  732  15
                0    5  21

                   Overall  
      Sensitivity  0.9932 
      Specificity  0.5833 
Balanced Accuracy  0.7883 
              PPV  0.9799 
              NPV  0.8077 
               F1  0.9865 
         Accuracy  0.9741 
              AUC  0.9813 
      Brier Score  0.0191 

  Positive Class:  1 
06-30-24 10:57:40 Completed in 3.9e-03 minutes (Real: 0.23; User: 0.97; System: 0.02) [s_Ranger]

16.5.4 Upsampling

Now try upsampling:

mod.rf.ups <- s_Ranger(dat.train, dat.test,
                       ifw = FALSE,
                       upsample = TRUE)
06-30-24 10:57:40 Hello, egenn [s_Ranger]

06-30-24 10:57:40 Upsampling to create balanced set... [prepare_data]
06-30-24 10:57:40 1 is majority outcome with length = 2208 [prepare_data]

.:Classification Input Summary
Training features: 4416 x 25 
 Training outcome: 4416 x 1 
 Testing features: 773 x 25 
  Testing outcome: 773 x 1 

.:Parameters
   n.trees: 1000 
      mtry: NULL 

06-30-24 10:57:40 Training Random Forest (ranger) Classification with 1000 trees... [s_Ranger]

.:Ranger Classification Training Summary
                   Reference 
        Estimated  1     0     
                1  2205     0
                0     3  2208

                   Overall  
      Sensitivity  0.9986 
      Specificity  1.0000 
Balanced Accuracy  0.9993 
              PPV  1.0000 
              NPV  0.9986 
               F1  0.9993 
         Accuracy  0.9993 
              AUC  1.0000 
      Brier Score  1.4e-03

  Positive Class:  1 

.:Ranger Classification Testing Summary
                   Reference 
        Estimated  1    0   
                1  729  12
                0    8  24

                   Overall  
      Sensitivity  0.9891 
      Specificity  0.6667 
Balanced Accuracy  0.8279 
              PPV  0.9838 
              NPV  0.7500 
               F1  0.9865 
         Accuracy  0.9741 
              AUC  0.9817 
      Brier Score  0.0181 

  Positive Class:  1 
06-30-24 10:57:41 Completed in 0.01 minutes (Real: 0.63; User: 2.32; System: 0.05) [s_Ranger]