How can you build an auto-modeller?

Off the back of my first assignment (which we’ve been given permission to share with others) I thought I’ll begin to generalise some of my code so that I could apply a bunch of classifiers on a data set and produce some graphs and summaries to be able to hopefully make an informed decision on what action you would take next.

The goal of such a script wouldn’t be to remove all human knowledge (domain knowledge in building orthogonal data sets will still build applying models naively), but if it can even meet 50% of what a human can do, that is important information in estimated the amount of effort needed or even the feasibility of the problem on hand.

See here for the project repository.

Building an Automatic Classifier

Essentially the idea is we use the caret library and train a bunch of models. We have to make sure that we make our models comparable so that we can then choose the “best” model. Luckily we can achieve this with minimal code.

Stratified Sampling

The important thing to do is stratified sampling. This ensures that our random sampling is somewhat “fair”. Although this probably isn’t the only way and is actually quite limited.

The function below shows the strength of the dplyr library through the group_by function.

stratify <- function(data, response, frac=.5, seed=10, ...) {
  set.seed(seed)
  if(class(data[, response])=="factor"){
    by_data <- data %>% group_by_(response)
    train <- sample_frac(by_data, frac)
    test <- data[-as.numeric(rownames(train)),]
  } else {
    train <- sample_frac(data, frac)
    test <- data[-as.numeric(rownames(train)),]
  }
  return(list(train=train, test=test))
}

Training all our models

To train, we can simply apply the function caret::train, and iterate of the methods in order to see what is the best model.

train_caret <- function(train, response, methods=c("rpart", "nnet", "svmRadial", "gam")) {
  control <- trainControl(method="repeatedcv", number=5, repeats=1,
                          index=createFolds(train[,response]))
  models <- Map(function(method) {
    model <- tryCatch(
      train(as.formula(paste(response, "~ .")), data=train, method=method, trControl=control),
      error=function(e) NULL)
    return(model)
  }, methods)
  
  return(models)
}

Confusion Matrix

In the similar way as the train_caret function we can build and examine the Confusion matrix for the models we are testing.

confusion_matrix <- function(test, response, models) {
  return(Map(function(model){
    fit_test <- predict(model, newdata=test[,!(names(iris) == response)])
    cm <- confuctionMatrix(fit_test, test[,(names(iris) == response)])
  }, models))
}

Where to now?

There is still much work to do. Although this is a good start. Here are some of the other things which would be worth working through in the future when there is more time:

Preprocessing: When would it be better to do this?
Feature selection: some algorithms perform better with altered or transformed features. How can this be incorporated.
Postprocessing: Following from the preprocessing, is there anything that we can do in that regard.
Improving the output document to become more polish: see the output below, we can have more tailored results based on what was observed. This would probably have to be made with tools other than R.

Sample Usage

To see some sample usage, below is Rmarkdown document for my automatic modeller tool; all in roughly 30 LOC! The document is modelled after The Automatic Statistician

This is my attempt at building an automatic classifier.

## Warning: package 'magrittr' was built under R version 3.1.2

## Warning: package 'dplyr' was built under R version 3.1.2

## Warning: package 'caret' was built under R version 3.1.2

## Warning: package 'ggplot2' was built under R version 3.1.2

Abstract

This is an automatic report for the

data(iris)
response <- "Species"

data set. This approach in automated classification involves heavy use of the caret library with cross validation on half of the data to determine the optimal classifier. The optimal model of each class is chosen via caret::train which in turn is compared against other models. We shall choose the model with the best performance against our hold-out sample.

1 Brief description of the data set

To conﬁrm that I have interpreted the data correctly a short summary of the data set follows. The target of the classification is:

## [1] "Species"

The dimensions of the data set are:

## [1] 150   5

A summary of these variables is given below:

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

Summary of the model construction

I have compared a number of different model classes. The performance in the cross validation scheme is shown below:

## $rpart

## 
## $nnet

## 
## $svmRadial

## 
## $gam

3 Model Description

In this section I have described the model I have constructed to explain the data. A quick summary is below, followed by the plots of the estimated Accuracy and Kappa coefficient for all the models.

bwplot(resamples(models))

INSERT VARIOUS CONDITIONS FOR SPECIFIC MODELS?

4 Model Criticism

In this section I have attempted to falsify the model I presented above to understand what aspects of the data it is not capturing well.

INSERT SOMETHING GENERIC