New Package for Ensembling R Models

March 13, 2013 • By Zach Deane-Mayer

I’ve written a new R package called caretEnsemble for creating ensembles of caret models in R. It currently works well for regression models, and I’ve written some preliminary support for binary classification models.

At this point, I’ve got 2 different algorithms for combining models:

Greedy stepwise ensembles (returns a weight for each model)
Stacks of caret models

(You can also manually specify weights for a greedy ensemble)

The greedy algorithm is based on the work of Caruana et al., 2004, and inspired by the medley package on github. The stacking algorithm simply builds a second caret model on top of the existing models (using their predictions as input), and employs all of the flexibility of the caret package.

All the models in the ensemble must use the same training/test folds. Both algorithms use the out-of-sample predictions to find the weights and train the stack. Here’s a brief script demonstrating how to use the package:

#Setup
rm(list = ls(all = TRUE)) 
gc(reset=TRUE)
#>          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
#> Ncells 486269   26    1039192 55.5         NA   486269   26
#> Vcells 912964    7    8388608 64.0      98304   912964    7
set.seed(42)

#Libraries
library(caret)
library(devtools)
library(caretEnsemble)

#Data
library(mlbench)
data(BostonHousing2)
X <- model.matrix(cmedv~crim+zn+indus+chas+nox+rm+age+dis+
                    rad+tax+ptratio+b+lstat+lat+lon, BostonHousing2)[,-1]
X <- data.frame(X)
Y <- BostonHousing2$cmedv

#Split train/test
train <- runif(nrow(X)) <= .66

#Setup CV Folds
#returnData=FALSE saves some space
folds=5
repeats=1
myControl <- trainControl(method='cv', number=folds, repeats=repeats, returnResamp='none', 
                          returnData=FALSE, savePredictions=TRUE, 
                          verboseIter=FALSE, allowParallel=TRUE,
                          index=createMultiFolds(Y[train], k=folds, times=repeats))
PP <- c('center', 'scale')

#Train some models
model1 <- train(X[train,], Y[train], method='gbm', verbose=FALSE, trControl=myControl,
                tuneGrid=expand.grid(.n.trees=500, .interaction.depth=15, .shrinkage = 0.01, .n.minobsinnode=10))
model2 <- train(X[train,], Y[train], method='blackboost', trControl=myControl)
model3 <- train(X[train,], Y[train], method='parRF', trControl=myControl)
model4 <- train(X[train,], Y[train], method='mlpWeightDecay', trControl=myControl, trace=FALSE, preProcess=PP)
model5 <- train(X[train,], Y[train], method='ppr', trControl=myControl, preProcess=PP)
model6 <- train(X[train,], Y[train], method='earth', trControl=myControl, preProcess=PP)
model7 <- train(X[train,], Y[train], method='glm', trControl=myControl, preProcess=PP)
model8 <- train(X[train,], Y[train], method='svmRadial', trControl=myControl, preProcess=PP)
model9 <- train(X[train,], Y[train], method='gam', trControl=myControl, preProcess=PP)
model10 <- train(X[train,], Y[train], method='glmnet', trControl=myControl, preProcess=PP)

#Make a list of all the models
all.models <- list(model1, model2, model3, model4, model5, model6, model7, model8, model9, model10)
names(all.models) <- sapply(all.models, function(x) x$method)
all.models <- as.caretList(all.models)
sort(sapply(all.models, function(x) min(x$results$RMSE)))
#>          parRF            gbm          earth            ppr     blackboost            gam      svmRadial         glmnet 
#>       3.306669       3.478900       3.670235       3.818690       3.888422       3.905597       4.167405       4.921640 
#>            glm mlpWeightDecay 
#>       4.922111       7.377662

#Make a greedy ensemble - currently can only use RMSE
greedy <- caretEnsemble(all.models, iter=1000L)
sort(greedy$weights, decreasing=TRUE)
#> NULL
greedy$error
#>   max_iter    RMSE  Rsquared      MAE    RMSESD RsquaredSD    MAESD
#> 1      100 3.29449 0.8808035 2.154081 0.3982098 0.05101488 0.124263

#Make a linear regression ensemble
linear <- caretStack(all.models, method='glm', trControl=trainControl(method='cv'))
summary(linear$ens_model$finalModel)
#> 
#> Call:
#> NULL
#> 
#> Coefficients:
#>                Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)    -0.84504    0.56565  -1.494 0.136143    
#> gbm            -0.13031    0.16813  -0.775 0.438859    
#> blackboost     -0.32620    0.11365  -2.870 0.004364 ** 
#> parRF           1.04531    0.18591   5.623 3.99e-08 ***
#> mlpWeightDecay -0.02962    0.03125  -0.948 0.343782    
#> ppr             0.16638    0.07715   2.157 0.031759 *  
#> earth           0.42152    0.07696   5.477 8.55e-08 ***
#> glm             9.09473    2.76016   3.295 0.001090 ** 
#> svmRadial       0.20700    0.07664   2.701 0.007269 ** 
#> gam            -0.19904    0.08942  -2.226 0.026686 *  
#> glmnet         -9.21373    2.77226  -3.324 0.000988 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 9.305633)
#> 
#>     Null deviance: 30592.2  on 342  degrees of freedom
#> Residual deviance:  3089.5  on 332  degrees of freedom
#> AIC: 1751.3
#> 
#> Number of Fisher Scoring iterations: 2
linear$error
#>   parameter    RMSE  Rsquared      MAE    RMSESD RsquaredSD     MAESD
#> 1      none 3.12004 0.8949955 2.114804 0.8736797 0.04463439 0.3865752

#Predict for test set:
preds <- predict(all.models, newdata=X[!train,])
preds$ENS_greedy <- predict(greedy, newdata=X[!train,])
preds$ENS_linear <- predict(linear, newdata=X[!train,])
sort(sqrt(colMeans((preds - Y[!train]) ^ 2)))
#>     ENS_linear     ENS_greedy          parRF            gbm      svmRadial            ppr            gam          earth 
#>       2.996924       3.057801       3.186682       3.214948       3.437198       3.551648       3.596338       3.655737 
#>     blackboost         glmnet            glm mlpWeightDecay 
#>       3.804047       4.490592       4.493315       6.965890

Please feel free to submit any comments here or on github. I’d also be happy to include any patches you feel like submitting. In particular, I could use some help writing support for multi-class models, writing more tests, and fixing bugs.

New Package for Ensembling R Models

Ready to ship your next AI feature?