Build A Supervised Learning Bootcamp

Following my previous post, I had the idea pull together a series of modules which demonstrate how you could build supervised models from the bottom up. Through this you would have a strong appreciation of the underlying processes and models used in supervised learning. I think this is also a really good time to learn a bit of Julia.

abstract Learner

type LinearModel <: Learner
    coef
    
    function LinearModel(X, y)
        this = new()
        this.coef = \(X, y)
        this
    end
end

function predict(lrn::LinearModel, X)
    X*lrn.coef
end

#' Usage:
# a = [1 2;3 4]
# b = [1;0]
# linModel = LinearModel(a, b)
# predict(linModel, a)

Lets consider the code block above which implements a very simple linear regression model. Within Julia we can create an abstract data type which is akin to the “superclass” of all our supervised learner.

Building on top of this, a linear regression can be easily be solved through linear algebra as shown above. We can write a generic predict function which will work with all our learners through using multiple dispatch method.

We can follow this up with KNN:


type KNN <: Learner
    dataX
    dataY
    N
    
    function KNN(X, y, neighbours)
        this = new()
        this.dataX = X
        this.dataY = y
        this.N = neighbours
        this
    end
end

function predict(lrn::KNN, X)
    predX = zeros(size(X)[1])
    for ix in 1:size(X)[1]
        y_indx = sortperm(sum(abs2(lrn.dataX - repmat(X[ix, :], size(lrn.dataX)[1])), 2)[:], rev=false)
        predX[ix] = mean(lrn.dataY[[y_indx[1:lrn.N]]]) # select the first N entries
    end
    return predX
end

The interesting functions are more around sortperm, which I used inplace of np.argsort.

Things get harder when you start building regression trees. I attempted it but my Julia knowledge is rather lacking and it is reflected in my poor code decision, which just seems to be scrappily put together:

type RegressionTree <: Learner
    # this implementation focuses on variance spits instead of entropy or gini impurity
    # arrays are split as follows, where index...:
    # 1 : is the index of the parent node, 0 implies it is the root node
    # 2 : 1, if condition in parent node was satified, 0 if it wasn't
    # 3 : is the variable (by column index) that we should split on 
    # 4 : is the value of the split. Will always assume that we are interested in predicate pattern value > ?
    decisionTree
    dataX
    dataY
    
    
    function RegressionTree(X, y, depth)
        function getSubset(decisionTree, X, y, leaf)    
            # traverses through the decision tree to get the subset applicable to the
            # current branch
            if leaf == 1      
                return X, y
            else
                parent   = decisionTree[leaf, 1]
                isParent = decisionTree[leaf, 2]
                col      = decisionTree[parent, 3]
                var      = decisionTree[parent, 4]  

                if isParent == 1
                    subsetX = X[X[:, col] .> var, :]
                    subsety = y[X[:, col] .> var, :]
                else
                    subsetX = X[X[:, col] .< var, :]
                    subsety = y[X[:, col] .< var, :]
                end
                return getSubset(decisionTree, subsetX, subsety, parent)
            end
        end
        
        function getBestVariance(xs, y)
            # takes in the vector and response, and outputs the optimal value
            # to split on and the variance criteria amount.
            if size(xs)[1] == 0
                return [0, 0] 
            end
            
            colTable = zeros(size(xs)[1], 2)
            for (row, x) in enumerate(xs)
                g1 = sum(abs2(xs[xs .> x] - y[xs .> x]))
                g2 = sum(abs2(xs[!(xs .> x)] - y[!(xs .> x)]))
                
                #g1 = 0 ? g1 == [] : g1
                #g2 = 0 ? g1 == [] : g1

                colTable[Int(row), 1] = g1 + g2
                colTable[Int(row), 2] = x
            end          

            return colTable[[Int(sortperm(colTable[:, 1])[1])], :]
        end
                
        this = new()
        this.dataX = X
        this.dataY = y
       
        # size can be predetermined as described in comments above
        # we can also fill things out;
        
        # depth 2
        #0 1 0 0
        #1 1
        #1 0
        #2 1 
        #2 0
        #3 1
        #3 0
        
        # depth 3
        #0 1 0 0
        #1 1
        #1 0
        #2 1 
        #2 0
        #3 1
        #3 0        
        #4
        #4
        #5
        #5
        #6
        #6
        #7
        #7
        
        this.decisionTree = zeros(2^depth-1, 4) 
        this.decisionTree[2:size(this.decisionTree)[1], 1] = vcat({i * ones(2) for i in 1:(size(this.decisionTree)[1])/2}...)
        this.decisionTree[2:size(this.decisionTree)[1], 2] = repmat([1, 0], Int((size(this.decisionTree)[1]-1)/2))
        
        # now fill in the tree; for simplicity we will always fill to the full depth
        # in practise it isn't this simplistic.
        for branch in 1:size(this.decisionTree)[1]            
            println(branch)
            subsetX, subsety = getSubset(this.decisionTree, X, y, branch)
            
            varTable = zeros(size(subsetX)[2], 3)
            varTable[:,1] = 1:size(subsetX)[2]
            
            # 1 : best split
            # 2 : variance number
            # calculate best split location.            
            for column in 1:size(subsetX)[2]
                cols = subsetX[:, column]
                varTable[column, 2:3] = getBestVariance(cols, subsety)
            end
            
            bestSplit = varTable[[sortperm(varTable[:, 2])][1], :]
            this.decisionTree[branch, 3] = bestSplit[1]
            this.decisionTree[branch, 4] = bestSplit[3]            
        end
        this
    end
end

Extending on this, we can further using bagging and boosting. This won’t be shown here in Julia, but the idea in Python is somewhat straightforward. Simply resample the points based on the classification error; if the classification error is high, then increase the sampling probability.