13 Bagging

Bagging, i.e. bootstrap aggregating, is a core ML ensemble technique which can reduce bias and variance of a learner. In bagging, the training set is resampled, and a model is trained on each resample. Predictions from each model are averaged to give the final estimate. Random Forest is the most popular application of bagging. rtemis allows you to easily bag any learner - but don’t try bagging a linear model.

13.1 Regression

First, create some synthetic data:

set.seed(2018)
x <- rnormmat(500, 50)
colnames(x) <- paste0("Feature", 1:50)
w <- rnorm(50)
y <- x[, 3] ^ 2 + x[, 10] * 2.3 + x[, 15] * .5 + x[, 20] * .7 + x[, 27] * 3.4 + x[, 31] * 1.2 + rnorm(500)
dat <- data.frame(x, y)
res <- resample(dat, seed = 2018)

01-07-24 00:31:43 Input contains more than one columns; will stratify on last [resample]
.:Resampling Parameters
    n.resamples: 10 
      resampler: strat.sub 
   stratify.var: y 
        train.p: 0.75 
   strat.n.bins: 4 
01-07-24 00:31:43 Created 10 stratified subsamples [resample]

dat.train <- dat[res$Subsample_1, ]
dat.test <- dat[-res$Subsample_1, ]

13.1.1 Single CART

Let’s start by training a single CART of depth 3:

mod.cart <- s_CART(dat.train, dat.test,
                   maxdepth = 3,
                   print.plot = FALSE)

01-07-24 00:31:43 Hello, egenn [s_CART]

.:Regression Input Summary
Training features: 373 x 50 
 Training outcome: 373 x 1 
 Testing features: 127 x 50 
  Testing outcome: 127 x 1 

01-07-24 00:31:43 Training CART... [s_CART]

.:CART Regression Training Summary
    MSE = 7.54 (67.32%)
   RMSE = 2.75 (42.84%)
    MAE = 2.17 (43.35%)
      r = 0.82 (p = 3.9e-92)
   R sq = 0.67

.:CART Regression Testing Summary
    MSE = 7.83 (64.11%)
   RMSE = 2.80 (40.09%)
    MAE = 2.17 (40.52%)
      r = 0.80 (p = 7.3e-30)
   R sq = 0.64
01-07-24 00:31:43 Completed in 3.2e-04 minutes (Real: 0.02; User: 0.02; System: 1e-03) [s_CART]

rtlayout(1, 2)
mod.cart$plotFitted()
mod.cart$plotPredicted()

rtlayout()

13.1.2 Bagged CARTs

Let’s bag 20 CARTs:

bag.cart <- bag(dat.train, dat.test,
                mod = 'cart', k = 20,
                mod.params = list(maxdepth = 3),
                .resample = setup.resample(resampler = "bootstrap",
                                                  n.resamples = 20),
                print.plot = FALSE)

01-07-24 00:31:43 Hello, egenn [bag]

.:Regression Input Summary
Training features: 373 x 50 
 Training outcome: 373 x 1 
 Testing features: 127 x 50 
  Testing outcome: 127 x 1 
.:Parameters
          mod: cart 
   mod.params:  
               maxdepth: 3 
01-07-24 00:31:43 Bagging 20 Classification and Regression Trees... [bag]
01-07-24 00:31:43 Outer resampling: Future plan set to multicore with 8 workers [resLearn]

.:Regression Training Summary
    MSE = 6.07 (73.72%)
   RMSE = 2.46 (48.73%)
    MAE = 1.92 (50.09%)
      r = 0.87 (p = 5e-117)
   R sq = 0.74

.:Regression Testing Summary
    MSE = 6.75 (69.06%)
   RMSE = 2.60 (44.38%)
    MAE = 1.98 (45.56%)
      r = 0.84 (p = 3e-35)
   R sq = 0.69
01-07-24 00:31:44 Completed in 0.01 minutes (Real: 0.49; User: 0.14; System: 0.20) [bag]

rtlayout(1, 2)
bag.cart$plotFitted()
bag.cart$plotPredicted()

rtlayout()

We make two important observations:

Both training and testing error is reduced
Generalizability is increased, i.e. the gap between training and testing error is decreased

13.2 Classification

We’ll use the Sonar dataset:

library(mlbench)
data(Sonar)
res <- resample(Sonar)

01-07-24 00:31:44 Input contains more than one columns; will stratify on last [resample]
.:Resampling Parameters
    n.resamples: 10 
      resampler: strat.sub 
   stratify.var: y 
        train.p: 0.75 
   strat.n.bins: 4 
01-07-24 00:31:44 Using max n bins possible = 2 [strat.sub]
01-07-24 00:31:44 Created 10 stratified subsamples [resample]

sonar.train <- Sonar[res$Subsample_1, ]
sonar.test <- Sonar[-res$Subsample_1, ]

13.2.1 Single CART

mod.cart <- s_CART(sonar.train, sonar.test,
              maxdepth = 10,
              print.plot = FALSE)

01-07-24 00:31:44 Hello, egenn [s_CART]

01-07-24 00:31:44 Imbalanced classes: using Inverse Frequency Weighting [prepare_data]

.:Classification Input Summary
Training features: 155 x 60 
 Training outcome: 155 x 1 
 Testing features: 53 x 60 
  Testing outcome: 53 x 1 

01-07-24 00:31:44 Training CART... [s_CART]

.:CART Classification Training Summary
                   Reference 
        Estimated  M   R   
                M  83   0
                R   0  72

                   Overall  
      Sensitivity  1      
      Specificity  1      
Balanced Accuracy  1      
              PPV  1      
              NPV  1      
               F1  1      
         Accuracy  1      
              AUC  1      

  Positive Class:  M 

.:CART Classification Testing Summary
                   Reference 
        Estimated  M   R   
                M  22   6
                R   6  19

                   Overall  
      Sensitivity  0.7857 
      Specificity  0.7600 
Balanced Accuracy  0.7729 
              PPV  0.7857 
              NPV  0.7600 
               F1  0.7857 
         Accuracy  0.7736 
              AUC  0.7729 

  Positive Class:  M 
01-07-24 00:31:45 Completed in 1.1e-03 minutes (Real: 0.07; User: 0.06; System: 0.01) [s_CART]

rtlayout(1, 2)
mod.cart$plotFitted()
mod.cart$plotPredicted()

rtlayout()

13.2.2 Bagged CARTs

bag.cart <- bag(sonar.train, sonar.test,
                mod = 'cart', k = 20,
                mod.params = list(maxdepth = 10),
                .resample = setup.resample(resampler = "bootstrap",
                                                  n.resamples = 20),
                print.plot = FALSE)

01-07-24 00:31:45 Hello, egenn [bag]

01-07-24 00:31:45 Imbalanced classes: using Inverse Frequency Weighting [prepare_data]

.:Classification Input Summary
Training features: 155 x 60 
 Training outcome: 155 x 1 
 Testing features: 53 x 60 
  Testing outcome: 53 x 1 
.:Parameters
          mod: cart 
   mod.params:  
               maxdepth: 10 
01-07-24 00:31:45 Bagging 20 Classification and Regression Trees... [bag]
01-07-24 00:31:45 Outer resampling: Future plan set to multicore with 8 workers [resLearn]

.:Classification Training Summary
                   Reference 
        Estimated  M   R   
                M  82   0
                R   1  72

                   Overall  
      Sensitivity  0.9880 
      Specificity  1.0000 
Balanced Accuracy  0.9940 
              PPV  1.0000 
              NPV  0.9863 
               F1  0.9939 
         Accuracy  0.9935 

  Positive Class:  M 

.:Classification Testing Summary
                   Reference 
        Estimated  M   R   
                M  23   6
                R   5  19

                   Overall  
      Sensitivity  0.8214 
      Specificity  0.7600 
Balanced Accuracy  0.7907 
              PPV  0.7931 
              NPV  0.7917 
               F1  0.8070 
         Accuracy  0.7925 

  Positive Class:  M 
01-07-24 00:31:45 Completed in 0.01 minutes (Real: 0.67; User: 0.19; System: 0.23) [bag]

rtlayout(1, 2)
bag.cart$plotFitted()
bag.cart$plotPredicted()

rtlayout()