How can you build an auto-modeller?
Off the back of my first assignment (which we’ve been given permission to share with others) I thought I’ll begin to generalise some of my code so that I could apply a bunch of classifiers on a data set and produce some graphs and summaries to be able to hopefully make an informed decision on what action you would take next.
The goal of such a script wouldn’t be to remove all human knowledge (domain knowledge in building orthogonal data sets will still build applying models naively), but if it can even meet 50% of what a human can do, that is important information in estimated the amount of effort needed or even the feasibility of the problem on hand.
See here for the project repository.
Building an Automatic Classifier
Essentially the idea is we use the caret
library and train a bunch of models. We have to make sure that we make our models comparable so that we can then choose the “best” model. Luckily we can achieve this with minimal code.
Stratified Sampling
The important thing to do is stratified sampling. This ensures that our random sampling is somewhat “fair”. Although this probably isn’t the only way and is actually quite limited.
The function below shows the strength of the dplyr
library through the group_by
function.
stratify <- function(data, response, frac=.5, seed=10, ...) {
set.seed(seed)
if(class(data[, response])=="factor"){
by_data <- data %>% group_by_(response)
train <- sample_frac(by_data, frac)
test <- data[-as.numeric(rownames(train)),]
} else {
train <- sample_frac(data, frac)
test <- data[-as.numeric(rownames(train)),]
}
return(list(train=train, test=test))
}
Training all our models
To train, we can simply apply the function caret::train
, and iterate of the methods in order to see what is the best model.
train_caret <- function(train, response, methods=c("rpart", "nnet", "svmRadial", "gam")) {
control <- trainControl(method="repeatedcv", number=5, repeats=1,
index=createFolds(train[,response]))
models <- Map(function(method) {
model <- tryCatch(
train(as.formula(paste(response, "~ .")), data=train, method=method, trControl=control),
error=function(e) NULL)
return(model)
}, methods)
return(models)
}
Confusion Matrix
In the similar way as the train_caret
function we can build and examine the Confusion matrix for the models we are testing.
confusion_matrix <- function(test, response, models) {
return(Map(function(model){
fit_test <- predict(model, newdata=test[,!(names(iris) == response)])
cm <- confuctionMatrix(fit_test, test[,(names(iris) == response)])
}, models))
}
Where to now?
There is still much work to do. Although this is a good start. Here are some of the other things which would be worth working through in the future when there is more time:
- Preprocessing: When would it be better to do this?
- Feature selection: some algorithms perform better with altered or transformed features. How can this be incorporated.
- Postprocessing: Following from the preprocessing, is there anything that we can do in that regard.
- Improving the output document to become more polish: see the output below, we can have more tailored results based on what was observed. This would probably have to be made with tools other than R.
Sample Usage
To see some sample usage, below is Rmarkdown document for my automatic modeller tool; all in roughly 30 LOC! The document is modelled after The Automatic Statistician
This is my attempt at building an automatic classifier.
## Warning: package 'magrittr' was built under R version 3.1.2
## Warning: package 'dplyr' was built under R version 3.1.2
## Warning: package 'caret' was built under R version 3.1.2
## Warning: package 'ggplot2' was built under R version 3.1.2
Abstract
This is an automatic report for the
data(iris)
response <- "Species"
data set. This approach in automated classification involves heavy use of the caret
library with cross validation on half of the data to determine the optimal classifier. The optimal model of each class is chosen via caret::train
which in turn is compared against other models. We shall choose the model with the best performance against our hold-out sample.
1 Brief description of the data set
To confirm that I have interpreted the data correctly a short summary of the data set follows. The target of the classification is:
## [1] "Species"
The dimensions of the data set are:
## [1] 150 5
A summary of these variables is given below:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Summary of the model construction
I have compared a number of different model classes. The performance in the cross validation scheme is shown below:
## $rpart
##
## $nnet
##
## $svmRadial
##
## $gam
3 Model Description
In this section I have described the model I have constructed to explain the data. A quick summary is below, followed by the plots of the estimated Accuracy and Kappa coefficient for all the models.
bwplot(resamples(models))
INSERT VARIOUS CONDITIONS FOR SPECIFIC MODELS?
4 Model Criticism
In this section I have attempted to falsify the model I presented above to understand what aspects of the data it is not capturing well.
INSERT SOMETHING GENERIC