It’s relatively simply to calculate the storage costs of different drive models based on $/TB, but that doesn’t take into account my main source of anxiety: drive failures. I want to be reasonably confident that my backup drives will still be functional in 5 years, especially if I put them under heavy load for writing and recovering data. Fortunately, the cloud storage provider Backblaze has some great data on hard drive reliability. They do a quarterly analysis of this data on their blog, but since I’m a data scientist, I also did my own analysis. Check out the link for all the data and code!

I like 16TB drives a lot. They’re not top-of-the-line in terms of size, so their price is reasonable, and they’re proven to be extremely reliable. In particular, the Seagate 16TB Exos drive is a good price (I currently see recertified ones on ebay for $165). Look for the model st16000nm001g.

I estimate that the 16TB Seagate Exos drive has a 5-year survival rate of about 95%, which equals about a 1-in-20 chance of failure after 5 years of heavy use. At $165 to store 16TB, that’s $10.31 per TB. To adjust for 5-year-survival, we divide by the survival rate: 10.31/.95 = $10.85 to store 1 TB for 5 years, including the probability of failure. That’s $2 per year over that period: not bad!

If you’ve got more money to spend, and are highly anxious about drive failures, check out WDC (formerly HGST, formerly Hitachi). WDC makes the top-of-the-line drives in terms of reliability, but they can also be a lot more expensive. The 16TB WDC Ultrastar drive tops my reliability estimates, with a 5-year survival rate of about 98%. This is pretty impressive: 98% survival rate means a 1-in-50 change of failure over 5 years! However, this drive costs a lot more than the equivalent Seagate drive. I currently see one of them on ebay for $209, but you’re more likely to pay something like $280 for a new one. At $209 to store 16TB, that’s $13.06 per TB. To adjust for 5-year-survival, we divide by the survival rate: 13.06/.98 = $13.33 to store 1 TB for 5 years, including the probability of failure.

Seagate makes a great drives at a reasonable price, but if you’re willing to pay 25%-50% more, you can get what I think is the most reliable drive in Backblaze’s dataset: WDC model wuh721816ale6l4.

Be careful buying 3TB drives, especially used. The least reliable drive in Backblaze’s data is a 3TB drive, with a 5-year survival rate of less than 20%.

Here’s the raw survival data, for the most reliable drive model at each size category. Note that we only have 2 years of data for the 16TB drives, so we’re extrapolating 5-year survival for this drive class. I’m pretty confident that these drives will continue to be highly reliable, but time will tell.

The x axis is years since installation. The y axis is the % of drives that survived that long. Note that the 18TB line is a very small sample size and I wouldn’t read too much into it yet.

]]>I first wrote this package when I was active on Kaggle, as model stacks are a common technique for winning Kaggle competitions. I loved using the caret package to cross-validate models in R, and caretEnsemble was born out of all my custom R scripts for stacking caret models. I wrote the first draft of caretEnsemble when I was on an Amtrak train from New York to Boston in 2013, and I’ve been maintaining it ever since.

caretEnsemble 4.0, the first update in years, introduces several new features. The highlight is the reintroduction of the greedy ensembling algorithm. This algorithm, present in version 0.0 but removed from the 1.0 CRAN submission due to complexity, has been reimplemented in a clean, modular way. I’m proud of the new implementation: it will be much more maintainable in the future.

This greedy selection algorithm was first described in Ensemble Selection from Libraries of Models and it has some nice properties: 1. The weights of the component models are always positive (negative weights can lead to odd results on out-of-sample data). 2. If it is hard to find a good ensemble, the greedy optimizer will simply select the best single model.

At the end of the day, it’s a simple weighted average of the input models, which turns out to be a robust way to combine models. But if you like complexity, you can also use caretEnsemble to do more complex stacks, using any caret model for the ensemble model. Many winning Kaggle solutions used XGboost to ensemble a variety of input models.

Version 4.0 adds a lot of new features, including greedy stacks, multiclass support, support for mixed-type ensembles, and transfer learning.

First, you will likely need to install the package from CRAN:

```
install.packages("caretEnsemble", repos = "https://cloud.r-project.org/")
```

For this demo, I’m going to use the “AmesHousing” and the ranger package, so you should install those too.

```
install.packages(c("AmesHousing", "Xgboost"), repos = "https://cloud.r-project.org/")
```

Take a quick look at the data:

```
set.seed(42)
dat <- AmesHousing::make_ames()
head(dat)
#> # A tibble: 6 × 81
#> MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape Land_Contour Utilities Lot_Config Land_Slope
#> <fct> <fct> <dbl> <int> <fct> <fct> <fct> <fct> <fct> <fct> <fct>
#> 1 One_Story_1946_an… Resident… 141 31770 Pave No_A… Slightly… Lvl AllPub Corner Gtl
#> 2 One_Story_1946_an… Resident… 80 11622 Pave No_A… Regular Lvl AllPub Inside Gtl
#> 3 One_Story_1946_an… Resident… 81 14267 Pave No_A… Slightly… Lvl AllPub Corner Gtl
#> 4 One_Story_1946_an… Resident… 93 11160 Pave No_A… Regular Lvl AllPub Corner Gtl
#> 5 Two_Story_1946_an… Resident… 74 13830 Pave No_A… Slightly… Lvl AllPub Inside Gtl
#> 6 Two_Story_1946_an… Resident… 78 9978 Pave No_A… Slightly… Lvl AllPub Inside Gtl
#> # ℹ 70 more variables: Neighborhood <fct>, Condition_1 <fct>, Condition_2 <fct>, Bldg_Type <fct>, House_Style <fct>,
#> # Overall_Qual <fct>, Overall_Cond <fct>, Year_Built <int>, Year_Remod_Add <int>, Roof_Style <fct>, Roof_Matl <fct>,
#> # Exterior_1st <fct>, Exterior_2nd <fct>, Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>, Exter_Qual <fct>, Exter_Cond <fct>,
#> # Foundation <fct>, Bsmt_Qual <fct>, Bsmt_Cond <fct>, Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>, BsmtFin_SF_1 <dbl>,
#> # BsmtFin_Type_2 <fct>, BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>, Total_Bsmt_SF <dbl>, Heating <fct>, Heating_QC <fct>,
#> # Central_Air <fct>, Electrical <fct>, First_Flr_SF <int>, Second_Flr_SF <int>, Low_Qual_Fin_SF <int>,
#> # Gr_Liv_Area <int>, Bsmt_Full_Bath <dbl>, Bsmt_Half_Bath <dbl>, Full_Bath <int>, Half_Bath <int>, …
```

To make things a little easier for this demo, we’ll model the log of the sale price, and also center and scale it:

```
dat$Sale_Price <- log(dat$Sale_Price)
dat$Sale_Price <- scale(dat$Sale_Price, center=TRUE, scale=TRUE)
```

Now let’s split it into a training set and a test set:

```
set.seed(780L)
train_index <- caret::createDataPartition(dat$Sale_Price, p = 0.8, list = FALSE)
train_data <- dat[train_index, ]
test_data <- dat[-train_index, ]
```

Now that we have a train/test split, we can use caretList to train a set of base models. We use the ranger package to fit a random forest, which requires no processing and works well out-of-the-box. We also fit a simple linear regression model with some preprocessing steps (removing zero-variance predictors and applying PCA. Zero variance predictors and correlated predictors can both lead to issues with linear models).

These models take a few minutes to train, so grab a cup of coffee while you wait!

```
set.seed(506L)
model_list <- caretEnsemble::caretList(
Sale_Price ~ .,
data = train_data,
methodList = "ranger",
tuneList = list(caretEnsemble::caretModelSpec(method = "lm", preProcess = c("zv", "pca")))
)
summary(model_list)
#> The following models were ensembled: lm, ranger
#>
#> Model accuracy:
#> model_name metric value sd
#> <char> <char> <num> <num>
#> 1: lm RMSE 0.3624948 0.04756348
#> 2: ranger RMSE 0.3428007 0.05774693
```

When looking at these results, remember that lower is better for RMSE.

Stacking these models is easy:

```
set.seed(961L)
greedy_stack <- caretEnsemble::caretEnsemble(
model_list,
tuneLength = 1
)
summary(greedy_stack)
#> The following models were ensembled: lm, ranger
#>
#> Model Importance:
#> lm ranger
#> 0.4476 0.5524
#>
#> Model accuracy:
#> model_name metric value sd
#> <char> <char> <num> <num>
#> 1: ensemble RMSE 0.3199006 0.04130564
#> 2: lm RMSE 0.3624948 0.04756348
#> 3: ranger RMSE 0.3428007 0.05774693
```

You can directly view the weights for the weighted average:

```
print(round(greedy_stack$ens_model$finalModel$model_weights, 3))
#> [,1]
#> lm 0.43
#> ranger 0.57
```

If you want a non-linear stack, you can use any model from the caret package. E.g. here’s how you would use a glm:

```
set.seed(961L)
glm_stack <- caretEnsemble::caretStack(
model_list,
method = "glm",
tuneLength = 1
)
summary(glm_stack)
#> The following models were ensembled: lm, ranger
#>
#> Model Importance:
#> lm ranger
#> 0.4112 0.5888
#>
#> Model accuracy:
#> model_name metric value sd
#> <char> <char> <num> <num>
#> 1: ensemble RMSE 0.3174498 0.04087683
#> 2: lm RMSE 0.3624948 0.04756348
#> 3: ranger RMSE 0.3428007 0.05774693
```

The glm is less constrained in the weights that it learns:

```
print(round(coef(glm_stack$ens_model$finalModel), 3))
#> (Intercept) lm ranger
#> 0.001 0.413 0.635
```

We can predict on new data with the stack and both ensembles:

```
preds <- data.table::data.table(predict(model_list, newdata = test_data))
preds[,greedy := predict(greedy_stack, newdata = test_data)]
preds[,glm := predict(glm_stack, newdata = test_data)]
```

And then calculate RMSE:

```
rmse <- sapply(preds, function(x) sqrt(mean((x - test_data$Sale_Price)^2)))
print(round(sort(rmse), 3))
#> glm greedy lm ranger
#> 0.311 0.313 0.325 0.342
```

Both ensembles do better on the test data than the base models.

caretEnsemble includes some useful plotting utilities. Use the base plot function to show the error of each model in the stack, as well as the overall error for the stack itself:

```
plot(glm_stack)
```

Note that while the linear model is generally worse and higher variance than the ranger model, including it in the stack both improves the ensemble and lowers the ensemble’s variance.

You can also use autoplot from ggplot2 to plot several diagnostics simultaneously:

```
ggplot2::autoplot(glm_stack, training_data=train_data, xvars=c("Year_Built", "Gr_Liv_Area"))
```

The 5 outliers in the training data with high residuals would be a good place to start investigating this model’s performance.

A Brief Introduction to caretEnsemble walks through the basics of model stacking with caretEnsemble, while Version 4.0 New Features is an overview of all the new features in version 4.0.

You can also checkout the source code on GitHub. Make a pull request if you spot any bugs!

]]>All posts prior to this one are from my old blog Modern Toolmaking, which I no longer maintain. Blogger lost the ability to post links to source code. I’ve moved all of my posts to this new blog, and restored the source code and model outputs to their former glory. In some cases, the old code no longer runs, due too:

- Package updates (I hope to eventually fix these)
- Packages removed from CRAN (these may or may not be fixable)
- Lost data sources and dead links (these may or may not be fixable)

Posts where the code doesn’t work any more are tagged “code-broken,” but I have preserved them for posterity. I did spellcheck some posts and make minor code edits for clarity or to fix compatibility with the modern version of R. I may retroactively add one or 2 posts from my hiatus period describing some of the other Kaggle competitions I participated in.

In the future, you will see a lot more python on this blog 😃.

]]>Check it out, and let me know what you think! (Submit bug reports and feature requests to the issue tracker.)

]]>If you are using R 2.15, follow these instructions to change your BLAS from the default to vecLib:

```
cd /Library/Frameworks/R.framework/Resources/lib
# for vecLib use
ln -sf libRblas.vecLib.dylib libRblas.dylib
# for R reference BLAS use
ln -sf libRblas.0.dylib libRblas.dylib
```

However, as noted in r-sig-mac, these instructions do not work for R 3.0. You have to directly link to the accelerate framework’s version of vecLib:

```
cd /Library/Frameworks/R.framework/Resources/lib
# for vecLib use
ln -sf /System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Versions/Current/libBLAS.dylib /Library/Frameworks/R.framework/Resources/lib/libRblas.dylib
# for R reference BLAS use
ln -sf libRblas.0.dylib libRblas.dylib
```

Finally, test your new blas using this script:

```
Rscript -e 'source("http://r.research.att.com/benchmarks/R-benchmark-25.R")'
```

On my system (a retina macbook pro), the default BLAS takes 141 seconds and vecLib takes 43 seconds, which is a significant speedup. If you plan to use vecLib, note the following warning from the R development team “Although fast, it is not under our control and may possibly deliver inaccurate results.”

So far, I have not encountered any issues using vecLib, but it’s only
been a few hours `:-)`

.

UPDATE: you can also install OpenBLAS on a mac:

```
brew install r
```

Homebrew uses openblas by default! On my modern M2 mac, the profiling script now takes 27 seconds.

]]>```
#Setup
rm(list = ls(all = TRUE))
gc(reset=TRUE)
#> used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
#> Ncells 483861 25.9 1032254 55.2 NA 483861 25.9
#> Vcells 915310 7.0 8388608 64.0 98304 915310 7.0
set.seed(1234)
#Libraries
library(caret)
library(caretEnsemble)
#Data
library(mlbench)
dat <- mlbench.xor(500, 2)
X <- data.frame(dat$x)
Y <- factor(ifelse(dat$classes=='1', 'Yes', 'No'))
#Split train/test
train <- runif(nrow(X)) <= .66
#Setup CV Folds
#returnData=FALSE saves some space
folds=5
repeats=1
myControl <- trainControl(method='cv', number=folds, repeats=repeats,
returnResamp='none', classProbs=TRUE,
returnData=FALSE, savePredictions=TRUE,
verboseIter=FALSE, allowParallel=TRUE,
summaryFunction=twoClassSummary,
index=createMultiFolds(Y[train], k=folds, times=repeats))
PP <- c('center', 'scale')
#Train some models
model1 <- train(X[train,], Y[train], method='gbm', verbose=FALSE, trControl=myControl,
tuneGrid=expand.grid(.n.trees=500, .interaction.depth=15, .shrinkage = 0.01, .n.minobsinnode=10))
model2 <- train(X[train,], Y[train], method='blackboost', trControl=myControl)
model3 <- train(X[train,], Y[train], method='parRF', tuneLength=1, trControl=myControl)
model4 <- train(X[train,], Y[train], method='mlpWeightDecay', trControl=myControl, trace=FALSE, preProcess=PP)
model5 <- train(X[train,], Y[train], method='knn', trControl=myControl, preProcess=PP)
model6 <- train(X[train,], Y[train], method='earth', trControl=myControl, preProcess=PP)
model7 <- train(X[train,], Y[train], method='glm', trControl=myControl, preProcess=PP)
model8 <- train(X[train,], Y[train], method='svmRadial', trControl=myControl, preProcess=PP)
model9 <- train(X[train,], Y[train], method='gam', trControl=myControl, preProcess=PP)
model10 <- train(X[train,], Y[train], method='glmnet', trControl=myControl, preProcess=PP)
#Make a list of all the models
all.models <- list(model1, model2, model3, model4, model5, model6, model7, model8, model9, model10)
names(all.models) <- sapply(all.models, function(x) x$method)
all.models <- as.caretList(all.models)
sort(sapply(all.models, function(x) min(x$results$ROC)))
#> mlpWeightDecay glm glmnet gam earth blackboost knn svmRadial
#> 0.3856384 0.4099975 0.4147880 0.4346771 0.4882353 0.5000000 0.9915315 0.9964432
#> parRF gbm
#> 0.9994703 0.9998268
#Make a greedy ensemble - currently can only use RMSE
greedy <- caretEnsemble(all.models, iter=1000L)
print(greedy$ens_model$finalModel$model_weights)
#> No Yes
#> gbm_No 1 0
#> gbm_Yes 0 1
#> blackboost_No 0 0
#> blackboost_Yes 0 0
#> parRF_No 0 0
#> parRF_Yes 0 0
#> mlpWeightDecay_No 0 0
#> mlpWeightDecay_Yes 0 0
#> knn_No 0 0
#> knn_Yes 0 0
#> earth_No 0 0
#> earth_Yes 0 0
#> glm_No 0 0
#> glm_Yes 0 0
#> svmRadial_No 0 0
#> svmRadial_Yes 0 0
#> gam_No 0 0
#> gam_Yes 0 0
#> glmnet_No 0 0
#> glmnet_Yes 0 0
greedy$error
#> max_iter ROC Sens Spec ROCSD SensSD SpecSD
#> 1 100 0.9998217 0.9884034 0.9878788 0.0003985861 0.01588213 0.01659765
#Make a linear regression ensemble
linear <- caretStack(all.models, method='glm')
print(round(coef(linear$ens_model$finalModel), 2))
#> (Intercept) gbm blackboost parRF mlpWeightDecay knn earth glm
#> 64968.70 642.80 -92459.55 6.92 218.60 481.55 -35355.58 13868.35
#> svmRadial gam glmnet
#> 10.76 -1110.63 -19911.04
linear$error
#> parameter ROC Sens Spec ROCSD SensSD SpecSD
#> 1 none 0.9806543 0.9768067 0.9575758 0.011322 0.02400921 0.02710385
#Predict for test set:
library(caTools)
preds <- predict(all.models, newdata=X[!train,])
preds$ENS_greedy <- predict(greedy, newdata=X[!train,])[,'Yes']
preds$ENS_linear <- predict(linear, newdata=X[!train,])[,'Yes']
sort(colAUC(preds, Y[!train])[1,])
#> blackboost earth gam glmnet glm knn mlpWeightDecay svmRadial
#> 0.5000000 0.5000000 0.5490459 0.5638044 0.5642516 0.9950805 0.9992546 0.9997018
#> gbm parRF ENS_greedy ENS_linear
#> 1.0000000 1.0000000 1.0000000 1.0000000
```

Right now, this code fails for me if I try a model like a nnet or an SVM for stacking, so there’s clearly bugs to fix.

The greedy model relies 100% on the gbm, which makes sense as the gbm has an AUC of 1 on the training set. The linear model uses all of the models, and achieves an AUC of 1, but it contains some very large, negative weights, which makes it a harder model to understand.

]]>At this point, I’ve got 2 different algorithms for combining models:

- Greedy stepwise ensembles (returns a weight for each model)
- Stacks of caret models

(You can also manually specify weights for a greedy ensemble)

The greedy algorithm is based on the work of Caruana et al., 2004, and inspired by the medley package on github. The stacking algorithm simply builds a second caret model on top of the existing models (using their predictions as input), and employs all of the flexibility of the caret package.

All the models in the ensemble must use the same training/test folds. Both algorithms use the out-of-sample predictions to find the weights and train the stack. Here’s a brief script demonstrating how to use the package:

```
#Setup
rm(list = ls(all = TRUE))
gc(reset=TRUE)
#> used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
#> Ncells 483922 25.9 1032428 55.2 NA 483922 25.9
#> Vcells 915717 7.0 8388608 64.0 98304 915717 7.0
set.seed(42)
#Libraries
library(caret)
library(devtools)
library(caretEnsemble)
#Data
library(mlbench)
data(BostonHousing2)
X <- model.matrix(cmedv~crim+zn+indus+chas+nox+rm+age+dis+
rad+tax+ptratio+b+lstat+lat+lon, BostonHousing2)[,-1]
X <- data.frame(X)
Y <- BostonHousing2$cmedv
#Split train/test
train <- runif(nrow(X)) <= .66
#Setup CV Folds
#returnData=FALSE saves some space
folds=5
repeats=1
myControl <- trainControl(method='cv', number=folds, repeats=repeats, returnResamp='none',
returnData=FALSE, savePredictions=TRUE,
verboseIter=FALSE, allowParallel=TRUE,
index=createMultiFolds(Y[train], k=folds, times=repeats))
PP <- c('center', 'scale')
#Train some models
model1 <- train(X[train,], Y[train], method='gbm', verbose=FALSE, trControl=myControl,
tuneGrid=expand.grid(.n.trees=500, .interaction.depth=15, .shrinkage = 0.01, .n.minobsinnode=10))
model2 <- train(X[train,], Y[train], method='blackboost', trControl=myControl)
model3 <- train(X[train,], Y[train], method='parRF', trControl=myControl)
model4 <- train(X[train,], Y[train], method='mlpWeightDecay', trControl=myControl, trace=FALSE, preProcess=PP)
model5 <- train(X[train,], Y[train], method='ppr', trControl=myControl, preProcess=PP)
model6 <- train(X[train,], Y[train], method='earth', trControl=myControl, preProcess=PP)
model7 <- train(X[train,], Y[train], method='glm', trControl=myControl, preProcess=PP)
model8 <- train(X[train,], Y[train], method='svmRadial', trControl=myControl, preProcess=PP)
model9 <- train(X[train,], Y[train], method='gam', trControl=myControl, preProcess=PP)
model10 <- train(X[train,], Y[train], method='glmnet', trControl=myControl, preProcess=PP)
#Make a list of all the models
all.models <- list(model1, model2, model3, model4, model5, model6, model7, model8, model9, model10)
names(all.models) <- sapply(all.models, function(x) x$method)
all.models <- as.caretList(all.models)
sort(sapply(all.models, function(x) min(x$results$RMSE)))
#> parRF gbm earth ppr blackboost gam svmRadial glmnet
#> 3.306669 3.478900 3.670235 3.818690 3.888422 3.905597 4.167405 4.921640
#> glm mlpWeightDecay
#> 4.922111 7.377662
#Make a greedy ensemble - currently can only use RMSE
greedy <- caretEnsemble(all.models, iter=1000L)
sort(greedy$weights, decreasing=TRUE)
#> NULL
greedy$error
#> max_iter RMSE Rsquared MAE RMSESD RsquaredSD MAESD
#> 1 100 3.29449 0.8808035 2.154081 0.3982098 0.05101488 0.124263
#Make a linear regression ensemble
linear <- caretStack(all.models, method='glm', trControl=trainControl(method='cv'))
summary(linear$ens_model$finalModel)
#>
#> Call:
#> NULL
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.84504 0.56565 -1.494 0.136143
#> gbm -0.13031 0.16813 -0.775 0.438859
#> blackboost -0.32620 0.11365 -2.870 0.004364 **
#> parRF 1.04531 0.18591 5.623 3.99e-08 ***
#> mlpWeightDecay -0.02962 0.03125 -0.948 0.343782
#> ppr 0.16638 0.07715 2.157 0.031759 *
#> earth 0.42152 0.07696 5.477 8.55e-08 ***
#> glm 9.09473 2.76016 3.295 0.001090 **
#> svmRadial 0.20700 0.07664 2.701 0.007269 **
#> gam -0.19904 0.08942 -2.226 0.026686 *
#> glmnet -9.21373 2.77226 -3.324 0.000988 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for gaussian family taken to be 9.305633)
#>
#> Null deviance: 30592.2 on 342 degrees of freedom
#> Residual deviance: 3089.5 on 332 degrees of freedom
#> AIC: 1751.3
#>
#> Number of Fisher Scoring iterations: 2
linear$error
#> parameter RMSE Rsquared MAE RMSESD RsquaredSD MAESD
#> 1 none 3.12004 0.8949955 2.114804 0.8736797 0.04463439 0.3865752
#Predict for test set:
preds <- predict(all.models, newdata=X[!train,])
preds$ENS_greedy <- predict(greedy, newdata=X[!train,])
preds$ENS_linear <- predict(linear, newdata=X[!train,])
sort(sqrt(colMeans((preds - Y[!train]) ^ 2)))
#> ENS_linear ENS_greedy parRF gbm svmRadial ppr gam earth
#> 2.996924 3.057801 3.186682 3.214948 3.437198 3.551648 3.596338 3.655737
#> blackboost glmnet glm mlpWeightDecay
#> 3.804047 4.490592 4.493315 6.965890
```

Please feel free to submit any comments here or on github. I’d also be happy to include any patches you feel like submitting. In particular, I could use some help writing support for multi-class models, writing more tests, and fixing bugs.

]]>The caret package for R now supports time series cross-validation! (Look for version 5.15-052 in the news file). You can use the createTimeSlices function to do time-series cross-validation with a fixed window, as well as a growing window. This function generates a list of indexes for the training set, as well as a list of indexes for the test set, which you can then pass to the trainControl object.

Caret does not currently support univariate time series models (like arima, auto.arima and ets), but perhaps that functionality is coming in the future? I’d also love to see someone write a timeSeriesSummary function for caret that calculates error at each horizon in the test set and a createTimeResamples function, perhaps using the Maximum Entropy Bootstrap.

Here’s a quick demo of how you might use this new functionality:

```
#Download S&P 500 data, adjust, and convert to monthly
set.seed(42)
library(quantmod)
getSymbols('^GSPC', from='1990-01-01')
#> [1] "GSPC"
GSPC <- adjustOHLC(GSPC, symbol.name='^GSPC')
GSPC <- to.monthly(GSPC, indexAt='lastof')
Target <- ClCl(GSPC)
#Calculate some co-variates
periods <- c(3, 6, 9, 12)
Lags <- data.frame(lapply(c(1:2, periods), function(x) Lag(Target, x)))
EMAs <- data.frame(lapply(periods, function(x) {
out <- EMA(Target, x)
names(out) <- paste('EMA', x, sep='.')
return(out)
}))
RSIs <- data.frame(lapply(periods, function(x) {
out <- RSI(Cl(GSPC), x)
names(out) <- paste('RSI', x, sep='.')
return(out)
}))
DVIs <- data.frame(lapply(periods, function(x) {
out <- DVI(Cl(GSPC), x)
out <- out$dvi
names(out) <- paste('DVI', x, sep='.')
return(out)
}))
dat <- data.frame(Next(Target), Lags, EMAs, RSIs, DVIs)
dat <- na.omit(dat)
#Custom Summary Function
mySummary <- function (data, lev = NULL, model = NULL) {
positions <- sign(data[, "pred"])
trades <- abs(c(1,diff(positions)))
profits <- positions*data[, "obs"] + trades*0.01
profit <- prod(1+profits)
names(profit) <- 'profit'
return(profit)
}
#Fit a model
library(caret)
model <- train(dat[,-1], dat[,1], method='rpart',
metric='profit', maximize=TRUE,
trControl=trainControl(
method='timeslice',
initialWindow=12, fixedWindow=TRUE,
horizon=12, summaryFunction=mySummary,
verboseIter=FALSE))
model
#> CART
#>
#> 300 samples
#> 18 predictor
#>
#> No pre-processing
#> Resampling: Rolling Forecasting Origin Resampling (12 held-out with a fixed window)
#> Summary of sample sizes: 12, 12, 12, 12, 12, 12, ...
#> Resampling results across tuning parameters:
#>
#> cp profit
#> 0.02432977 1.068372
#> 0.02449800 1.068372
#> 0.06424035 1.068372
#>
#> profit was used to select the optimal model using the largest value.
#> The final value used for the model was cp = 0.06424035.
```

]]>```
#Multi-Class Summary Function
#Based on caret::twoClassSummary
require(compiler)
multiClassSummary <- cmpfun(function (data, lev = NULL, model = NULL){
#Load Libraries
require(Metrics)
require(caret)
#Check data
if (!all(levels(data[, "pred"]) == levels(data[, "obs"])))
stop("levels of observed and predicted data do not match")
#Calculate custom one-vs-all stats for each class
prob_stats <- lapply(levels(data[, "pred"]), function(class){
#Grab one-vs-all data for the class
pred <- ifelse(data[, "pred"] == class, 1, 0)
obs <- ifelse(data[, "obs"] == class, 1, 0)
prob <- data[,class]
#Calculate one-vs-all AUC and logLoss and return
cap_prob <- pmin(pmax(prob, .000001), .999999)
prob_stats <- c(auc(obs, prob), logLoss(obs, cap_prob))
names(prob_stats) <- c('ROC', 'logLoss')
return(prob_stats)
})
prob_stats <- do.call(rbind, prob_stats)
rownames(prob_stats) <- paste('Class:', levels(data[, "pred"]))
#Calculate confusion matrix-based statistics
CM <- confusionMatrix(data[, "pred"], data[, "obs"])
#Aggregate and average class-wise stats
#Todo: add weights
class_stats <- cbind(CM$byClass, prob_stats)
class_stats <- colMeans(class_stats)
#Aggregate overall stats
overall_stats <- c(CM$overall)
#Combine overall with class-wise stats and remove some stats we don't want
stats <- c(overall_stats, class_stats)
stats <- stats[! names(stats) %in% c('AccuracyNull',
'Prevalence', 'Detection Prevalence')]
#Clean names and return
names(stats) <- gsub('[[:blank:]]+', '_', names(stats))
return(stats)
})
```

This function was prompted by a question on cross-validated, asking what the optimal value of k is for a knn model fit to the iris dataset. I wanted to look at statistics besides accuracy and kappa, so I wrote a wrapper function for caret::confusionMatrix and auc and logLoss from the Metric packages. Use the following code to fit a knn model to the iris dataset:

```
library(caret)
set.seed(19556)
model <- train(
Species~.,
data=iris,
method='knn',
tuneGrid=expand.grid(.k=1:30),
metric='Accuracy',
trControl=trainControl(
method='repeatedcv',
number=10,
repeats=15,
classProbs=TRUE,
summaryFunction=multiClassSummary))
```

This demonstrates that, depending on what metric you use, you will end up with a different model. For example, Accuracy seems to peak around 17, while AUC and logLoss seem to peak around 6:

```
# All possible metrics:
# c('Accuracy', 'Kappa', 'AccuracyLower', 'AccuracyUpper', 'AccuracyPValue',
# 'Sensitivity', 'Specificity', 'Pos_Pred_Value',
# 'Neg_Pred_Value', 'Detection_Rate', 'ROC', 'logLoss')
print(plot(model, metric='Accuracy'))
```

```
print(plot(model, metric='ROC'))
```

```
print(plot(model, metric='logLoss'))
```

You can also increase the number of cross-validation repeats, or use a different method of re-sampling, such as bootstrap re-sampling.

]]>I finally got around to publishing my time series cross-validation package to github, and I plan to push it out to CRAN shortly:

```
devtools::install_github("zachmayer/cv.ts")
```

Then run the following script to check it out:

```
library(forecast)
library(cv.ts)
set.seed(42)
#Download S&P 500 data and adjust from splits/dividends
library(quantmod)
getSymbols('^GSPC', from='1990-01-01')
#> [1] "GSPC"
GSPC <- adjustOHLC(GSPC, symbol.name='^GSPC')
#Calculate monthly returns
GSPC <- to.monthly(GSPC, indexAt='lastof')
GSPC <- Cl(GSPC)
#Convert from xts to ts
GSPC <- ts(GSPC, start=c(1990,1), frequency=12)
#Define cross validation parameters
myControl <- tseriesControl(
minObs=60,
stepSize=1,
maxHorizon=12,
fixedWindow=TRUE,
preProcess=FALSE,
ppMethod='guerrero',
summaryFunc=tsSummary
)
#Forecast using several models
result_naive <- cv.ts(GSPC, naiveForecast, myControl, progress=FALSE)
myControl$preProcess <- TRUE
result_autoarima <- cv.ts(GSPC, auto.arimaForecast, myControl, ic='bic', progress=FALSE)
result_ets <- cv.ts(GSPC, etsForecast, myControl, ic='bic', progress=FALSE)
```

```
library(reshape2)
library(ggplot2)
plotData <- data.frame(
horizon=1:12
,naive=result_naive$results$MAPE[1:12]
,arima=result_autoarima$results$MAPE[1:12]
,ets=result_ets$results$MAPE[1:12]
)
plotData <- melt(plotData, id.vars='horizon', value.name='MAPE', variable.name='model')
print(ggplot(plotData, aes(horizon, MAPE, color=model)) + geom_line())
```

Forecasting equities prices is hard!

]]>