We can think of boosting as some kind of weighted sample. Essentially we build models and have higher weights assigned to the observations which we score incorrectly. Our simple boosting algorithm is as follows:
- Build a model, on a weighted sample of points
- See which ones we score incorrectly/correctly and assign a penalty of 110%/90%
- Repeat 1.
Then we can average out the predictions to get the final score. This should improve the model over a simple one.
As a quick example, below is some R code which demonstrates this:
#' simple boosting algorithm
#' also uses caret library
library(mlbench)
library(rpart)
library(dplyr)
#' let's work off pimaindiandiabetes
data(PimaIndiansDiabetes)
simple_model <- rpart(diabetes ~., data=PimaIndiansDiabetes)
#' Boosting is essentially weighted repeated sampling
#' of a model. In this case we will
#' use 10 models with rpart together to create
#' a naive boosted tree...
#' normalises a vector, so that the sum is 1
normalise <- function(vec) {
vec/sum(vec)
}
#' from stackoverflow
getMode <- function(InVec) {
names(which.max(table(InVec)))
}
predict.boost <- function(modelList, data) {
model_pred_all <- as.data.frame(lapply(modelList, function(mod) {
predict(mod, data, type="class")
}))
return(apply(model_pred_all, 1, getMode))
}
n <- 10
index_weight <- normalise(rep(1, nrow(PimaIndiansDiabetes)))
mod_list <- list()
for(i in 1:n) {
samp_idx <- sample(seq_len(nrow(PimaIndiansDiabetes)), nrow(PimaIndiansDiabetes)*2, prob=index_weight, replace=TRUE)
new_model <- rpart(diabetes ~., data=PimaIndiansDiabetes[samp_idx,])
mod_list[[i]] <- new_model
reweight_idx <- unique(samp_idx)
new_pred <- predict(new_model, PimaIndiansDiabetes[reweight_idx, ], type="class")
for(idx in unique(samp_idx)){
if (predict(new_model, PimaIndiansDiabetes[idx, ], type="class") == PimaIndiansDiabetes[idx, "diabetes"]){
index_weight[idx] <- index_weight[idx] * 0.9
} else {
index_weight[idx] <- index_weight[idx] * 1.1
}
}
index_weight <- normalise(index_weight)
}
boost_mod <- predict.boost(mod_list, PimaIndiansDiabetes)
# compare performance...
caret::confusionMatrix(PimaIndiansDiabetes$diabetes, predict(simple_model, PimaIndiansDiabetes, type="class"))
caret::confusionMatrix(PimaIndiansDiabetes$diabetes, predict.boost(mod_list, PimaIndiansDiabetes))
With the output:
> caret::confusionMatrix(PimaIndiansDiabetes$diabetes, predict(simple_model, PimaIndiansDiabetes, type="class"))
Confusion Matrix and Statistics
Reference
Prediction neg pos
neg 449 51
pos 72 196
Accuracy : 0.8398
95% CI : (0.812, 0.8651)
No Information Rate : 0.6784
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.641
Mcnemar's Test P-Value : 0.07133
Sensitivity : 0.8618
Specificity : 0.7935
Pos Pred Value : 0.8980
Neg Pred Value : 0.7313
Prevalence : 0.6784
Detection Rate : 0.5846
Detection Prevalence : 0.6510
Balanced Accuracy : 0.8277
'Positive' Class : neg
> caret::confusionMatrix(PimaIndiansDiabetes$diabetes, predict.boost(mod_list, PimaIndiansDiabetes))
Confusion Matrix and Statistics
Reference
Prediction neg pos
neg 473 27
pos 65 203
Accuracy : 0.8802
95% CI : (0.8551, 0.9023)
No Information Rate : 0.7005
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7274
Mcnemar's Test P-Value : 0.0001145
Sensitivity : 0.8792
Specificity : 0.8826
Pos Pred Value : 0.9460
Neg Pred Value : 0.7575
Prevalence : 0.7005
Detection Rate : 0.6159
Detection Prevalence : 0.6510
Balanced Accuracy : 0.8809
'Positive' Class : neg