Today I am releasing binst which an “optimal binning” package using supervised and unsupervised methods including:

  • kmeans
  • entropy
  • decision trees

Motivations

This package was firstly spurred by smbinning which to me seemed to be very confusing to use. This was what spurred the decision tree method in this package. Although this package “worked”, I had trouble from an interoperability perspective to apply it on H2OFrames.

For example I wished to perform something as simple as:

h2o_makecuts <- function(x, y) {
  # using discretization
  breaks <- c(min(as.vector(x)), max(as.vector(x)), cutPoints(as.vector(x), as.vector(y)))
  h2o.cut(x, breaks, include.lowest = T)
}

However this was very difficult using smbinning as it ended up looking like this:


h2o_makecuts <- function(x, y) {
  # using the same approach as smbinning
  breaks <- make_ctreebreaks(as.vector(x), as.vector(y))
  h2o.cut(x, c(min(as.vector(x)), max(as.vector(x)), breaks), include.lowest = T)
}

make_ctreebreaks <- function(x, y) {
  df <- data.frame(x=x, y=y)
  build_tree <- ctree(y~x, df)
  breaks <- as.numeric(unique(unlist(list.rules.party(build_tree))))
  return(breaks)
}

list.rules.party <- function(x, i = NULL, ...) {
  #' stolen from source code
  if (is.null(i)) i <- nodeids(x, terminal = TRUE)
  if (length(i) > 1) {
    ret <- sapply(i, list.rules.party, x = x)
    names(ret) <- if (is.character(i)) i else names(x)[i]
    return(ret)
  }
  if (is.character(i) && !is.null(names(x)))
    i <- which(names(x) %in% i)
  stopifnot(length(i) == 1 & is.numeric(i))
  stopifnot(i <= length(x) & i >= 1)
  i <- as.integer(i)
  dat <- data_party(x, i)  
  if (!is.null(x$fitted)) {
    findx <- which("(fitted)" == names(dat))[1]  
    fit <- dat[,findx:ncol(dat), drop = FALSE]   
    dat <- dat[,-(findx:ncol(dat)), drop = FALSE]
    if (ncol(dat) == 0)
      dat <- x$data
  } else {
    fit <- NULL  
    dat <- x$data
  }
  
  rule <- c()
  
  recFun <- function(node) {
    if (id_node(node) == i) return(NULL)   
    kid <- sapply(kids_node(node), id_node)
    whichkid <- max(which(kid <= i))
    split <- split_node(node)
    ivar <- varid_split(split)
    svar <- names(dat)[ivar]
    index <- index_split(split)
    if (is.factor(dat[, svar])) {
      slevels <- levels(dat[, svar])[index == whichkid]
      srule <- paste(svar, " %in% c(\"", 
                     paste(slevels, collapse = "\", \"", sep = ""), "\")",
                     sep = "")
    } else {
      if (is.null(index)) index <- 1:length(kid)
      breaks <- cbind(c(-Inf, breaks_split(split)), 
                      c(breaks_split(split), Inf))
      sbreak <- breaks[index == whichkid,]
      right <- right_split(split)
      srule <- c()
      if (is.finite(sbreak[1]))
        srule <- c(srule, sbreak[1])
      if (is.finite(sbreak[2]))
        srule <- c(srule, sbreak[2])
      #srule <- paste(srule, collapse = " & ")
    }
    rule <<- c(rule, srule)
    return(recFun(node[[whichkid]]))
  }
  node <- recFun(node_party(x))
  return(unlist(rule))
}

list.rules.party(test)
test <- make_ctreecut(iris[, 1], iris[, 5])

iris.hex <- as.h2o(iris, "iris.hex")
iris.hex$SL_Bin <- h2o_makecuts(iris.hex[, 1], iris.hex[, 5])
iris.hex$SL_Bin %>% as.vector() %>% unique

Extensions

kmeans was mostly inspired by a similar approach I took to one dimensional clustering which was used in my CS 6601 course at Georgia Tech (note that in this course we had to implement our own variant of kmeans using Python, so you won’t get any hints here!)

The approach using entropy is simply a wrapper to the discretization library within R.

Future Work

If there is enough interest I hope to add other approaches to binning, probably using techniques such as

  • Jenks Natural Breaks
  • Support for categorical data