An inventor’s notebook is a space to record ideas, process and experimental results. Over the years, I’ve kept a rather uncurated series of “web logs” in its purest form which were more of a stream of consciousness. I’ll still probably occassionally add some things of a technical nature which I think are interesting and worthwhile.

For now I’ve moved everything here, without expectation that all the links will or won’t work. For the canonical versions please visit chappers.github.io - I’ll keep both websites updated until such time that I merge them together.

Onnx API and Retreiving Intermediary Output

It’s quite frustrating trying to navigate through comments and posts on how to do “simple” things, so I’ll just document how this is currently done, as well as why it is the way it is. Of course down the track maybe these issues get resolved - however for now, let’s run through this! Problme Statement: How do we get intermediary layer output as part of an ONNX model? This approach has many issues raised, with no completely working code. [Read More]

Revisiting TangleJS

Sometimes you want an interactive document; whereby you move a slider and values on the page changes. A good example of this is on distill.pub and of course Tangle. Let’s look at some of these components and see what we can do with it in a more “modern” framework, seeing that the library itself hasn’t been updated in years! At its heart, Tangle is essentially a fancy “range” input: [Read More]

Shortcodes in Hugo - Thinking about data lookup

Its now been over 7 years since starting out this blog, and I’ve been thinking about a refresh! In this process I’ve looked at Hugo, another static site generator. Shortcode Shortcodes are similar to Jinja templates which can inject and access variables. In our setup, if we want to access key-value data pairs so that we can easily manage links and other content en masse. In our setup, we’ll add some data to the data/test. [Read More]

Strapdown is dead long live Strapmarked?

Okay, so Strapdown hasn’t been updated in over 5 years, so its effectively dead (right?). In that case what are our options? Can we still generate beautiful documents by writing markdown directly in HTML? Yes of course! The answer is actually fairly straightforward: Use marked to handle the markdown parsing Use any drop css framework for styling the page. You can view a selection here. What does this look like? [Read More]

Gimp Python Fu

Bonus To make a document looked “scanned” - maybe we should add a python-fu script to do this as well! convert -density 90 input.pdf -rotate 0.5 -attenuate 0.2 +noise Multiplicative -colorspace Gray output.pdf I’ve recently been working on custom workflows using Gimp’s python-fu plugins; I’m actually surprised I haven’t come across it more in more image based workflows! In this workflow, you can add layers and do calculations on an image, without manipulating and forcing pixel level changes you might be forced to on a pixel level. [Read More]

Categorical Encoding For Knn

In this post we’ll look at an implementation of ABDM (Association-Based Distance Metric) and MVDM (Modified Value Difference Metric) for categorical encoding which can be used in k-nearest-neighbours. There currently isn’t a paper on this but is forthcoming at the time of writing. This is a quick implementation based on the Seldon Alibi implementation; mainly because their implementation wasn’t very “sklearn-eqsue”. This is also for my own edification! High Level View of Shared Components Whats the Algorithm? [Read More]

Torch Lightning Using Iris

Okay, so there are many articles on using torch with lightning and training with pytorch. But for whatever reason many of them are just overly complicated and talk through complicated workflows. For me, the details are important, but to start off, oftentimes we just want to know how do I do a “fit” and a “predict”. In this post, we’ll look briefly how to set up a minimal example for pytorch. [Read More]

Nim Python Starting Out

In an effort to learn something new and speed up some existing experiments, I’ve started dipping my toes into the Nim programming language. In theory it should offer a substatial speed up to parts of my Python code but we’ll see what actually happens. Here are some patterns which I’m sure I’ll need to refer multiple times in the future. Sorting a Table To sort a table by value, you need to create something to help with the sorting. [Read More]

Auto Differentiation From Scratch

In this post we’ll look at building auto-differentiation from scratch in Python. Although there are many frameworks for doing this already, its always interesting to peak behind the covers and see how you might do it yourself! The broad approach is to compose the variables and functions into components whereby the value and the gradients can be easily passed alongside each other. The most basic unit is a single variable [Read More]

Notes On Extending Multiagent Environments

In this post we’ll quickly go through how to add “single” agent policies to QMIX (multiagent) environment specifically to RLlib. Why would we want to do this? In my mind, algorithms like MAVEN make use of hierarchical policies would need constructs like this. Side note: I dislike using PyMARL as an environment as theres not enough examples to get things working quickly. Whilst RLlib is fairly opinionated - at the very least, its easier to mix and match items together; especially when working with other problems which are not MARL (Multi-agent reinforcement learning). [Read More]

Plan Attend Generate In Pytorch

There have been several different blog posts which talk about Encoder Decoder mechanisms and annotate their implementation. In this blog post we will do like-wise and annotate the “Plan, Attend, Generate: Planning for Sequence-to-Sequence Models” which was part of NIPS 2017. Although this isn’t a particular famous or popular paper (with 5 citations); it is a generalisation of a RL approach called STRAW from “Strategic Attentive Writer for Learning Macro-Actions” which was part of NIPS 2016. [Read More]

Beyond Gridworld

Gridworld is a commonly used environment for Reinforcement Learning (RL) tasks; it is simple to implement and understand for the purposes of evaluating a wide number of different agents. In a nutshell Gridworld presents: A grid-based world, which can be easily represented as a matrix (or by extension a tensor, if its a world with different ‘layers’) A simple turn-based mechanism for which agents interact with the world A set of rules and actions which are evaluated per ‘step’ within the RL environment. [Read More]

My Notes On Graph Neural Networks

As the year draws to an end I thought its time to reflect on what I’ve been reading and researching. One topic which I’ve realised I’ve been very “light” with respect to notes is Graph Neural Networks. This post is to place a high level my thoughts on Graph Neural Networks and how they relate to other models in Deep Learning sense. But before we dive in let’s talk about the purpose of graph neural networks. [Read More]

Reproducibility Is Overrated

What’s with the the popular research papers in Dota 2 and StarCraft? Yes, OpenAI and deepmind have done amazing work, but their benchmarks and specs make it intractable for the normal research lab to even make it palatable or think about reproducing the results! Baseline estimate for either approach is in the order of 5M USD of training time! (This is computed through, use of 500 GPUs for the period of roughly 12 months, at say 4. [Read More]

Using Spektral Top K Api

Theres a lot of somewhat undocumented APIs in the Spektral APIs. As I’m interested in some of the Graph coarsening - I thought we’ll look at how some of the graph networks are implemented and how they are used. Firstly for a node wise graph - it can be implemented where the inputs are 2 items; the node-feature input, and adjacency matrix. nodes = 5 feats = 4 X = np. [Read More]

How Hard Could It Be

Is it really that hard to teach a software engineer machine learning? This is a thought experiment that I had; that implementing simple variations of popular algorithms which are commonly used is sufficient for a software engineer to build out pipelines and help the wider team be more effective. Gradient Descent - Or how to iteratively find the “line of best fit” One of the simplest algorithms you could use for line of best fit is simply doing an interval halving method, or a binary search method. [Read More]

Thinking About Linear Ensembles

Can linear ensembles yield non-linear boundaries when the base learner is a linear model? On the surface, the answer surely is “no”. How could a linear model, when combined in a linear combination yield a non-linear boundary? On closer inspection, the answer lies on how we may ensemble, and also how we may introduce non-linearities, similar to how neural networks are trained to represent non-linear decision boundaries. Thinking about neural networks If we had simple linear models, which are ensemble with another linear operator, this could be interpreted as a two layer neural network where the first layer represents your linear models and the second layer represents your simple ensemble. [Read More]

Building Out Stable Baseline Benchmarks

In this post we’ll go through how one could build a Keras model using stable-baselines library as well as the conditions to create a default gym environment. The simple question is if we want to use the Keras APIs to build the basis for a Policy using PPO; we should be able to do this in a fairly straightforward manner. Suppose you wanted to create a simple MLP model: import tensorflow as tf # . [Read More]

Testing Out Ksql For Ml

In this post we’ll test out how we can deploy a machine learning model over KSQL when it has been successfully transcompiled. This leverages new features which I had a part in for adding new mathematical functions in KSQL. Getting Started To get started ensure that you have Kafka and KSQL installed. You may need to install KSQL from source if it has not been updated in the monthly snapshots. For the purposes of this example, we will also leverage the inbuilt default topic generator in kafka: [Read More]

Convolutions For Time Series Data Mining

When we’re moving towards time series or transactional data; it is interesting to think about how features are generated and how we perform data mining within this context. The “state of the art” approaches may involve simply throwing neural networks on it, however it may be challenging when additional engineering constraints are put on top. The good news is that this can be solved through purely engineering approaches; rather than fancy algorithms or theorems. [Read More]

Difflib And Sequence Matching

How do we detect plagiarism? There’s probably many state of the art ways on how one could approach this problem; in this post we’ll explore how we can use standard Python library to do some simple diffs and comparison across large corpuses to fine out when things overlap or don’t overlap. To use the difflib library to match arbritary strings is quite easy: import difflib text1 = 'text1 says hello world because there is only one! [Read More]

Approximations To Array Operations

When you think about Machine Learning operations, often the complexity arises when the operations which are used operate over arrays rather than over records. For example, elementary operations which are row-based generally have simple SQL query analogues; whether you are doing something like addition, subtraction or even calculating the mean via a group by. However if you are doing group by comparisons or operations over a sequential data, you will essentially be performing an operation which is like an inner loop (or an all-pairs operation). [Read More]

My Research Workflow

When you work in data analytics realm, whether you are performing exploratory analysis or some production grade machine learning model, the workflow really shouldn’t be all that different. This post is just a summary of what I do at this current point in time. In essence it is reduced to two broad ideas: Reduce context switching Make your stuff portable Reduce Context Switching To reduce context switching means that one should aim to use the same set of tools no matter what they are doing. [Read More]

Thinking About Differentiable Functions

Dirac Delta Function If we wanted to have a function that was differentiable, and behaved like a discrete function how would we do it? As a starting point we could use the Dirac Delta Function for a point estimation. library(ggplot2) a = 0.1 dirac_delta <- function(x, a, loc=0){ (1/(abs(a)*(pi^(0.5))))* exp(-((x-loc)/a)^2) } xrange = (-100:100)/100 y = dirac_delta(xrange, a=a) ggplot() + aes(x=xrange) + aes(y=y)+ geom_line() But perhaps we want to have a function where it is defined to be 1 within a range, and 0 elsewhere. [Read More]

Designing Online Boosting Algorithms

This post revolves around several papers on online boosting including: Optimal and Adaptive Algorithms for Online Boosting - ICML 2015 Online Gradient Boosting - NIPS 2015 The ideas presented here are not the original algorithms, but seek to gain an understanding into the intuition when switching from batch to online for boosting algorithms. Such as the implication of moving to regression variant of online adaboost as done in “Improving Regressors using Boosting Techniques” (1997). [Read More]

Interpretting Text Embedding Models

This post borrows code from https://github.com/hiranumn/IntegratedGradients which in itself is based on the Integrated Gradients paper which was part of the WHI workshop at ICML 2018 Interpretting word embedding models is fairly difficult - how do we know what words (or phrases) were an indication of why a particular instance was predicted in a certain way? In this post we primarily go through the code that can be used to describe this. [Read More]

Linear Decision Boundaries

This post is to quickly go through linear SVMs and decision boundaries. Decision Rules in Binary Classification Consider the simple case where we have one predictor and it is a binary classification problem. Then the formulation for linear SVM would be: $$\hat{y} = wx + b$$ Where if $\hat{y} \geq 0$, then the positive class is predicted, and would be assigned to the negative class otherwise. In this simple formulation, the decision rule surrounding this model would be: [Read More]

Reflections On Interview Process

This post is a reaction to this article: https://www.businessinsider.fr/us/microsoft-new-developer-interview-process-2018-12 How do we do more objective interviews? In a recent article, Microsoft promotes an approach in order to reduce bias, ensure the interview is more relevant and promote greater empathy. In this post, we’ll summarise what was actually done, and comments on how this could be applied to data analytics interview processes. Share Interview Questions with the Candidate: one of the more challenging things is to first share with the candidate what interview questions will be asked and what they will be working on (pair programming). [Read More]

Build Ml Pipelines Once In Databases

In a previous post we thought through how we can build a Python pipeline and deploy to Javascript. This indeed does allow us to deploy practically on anything with a computer, but let’s take a different spin on it - can we deploy onto anything with a database? There are many reasons why we want to do this. On an enterprise level, there would be a preference on deploying onto the same shared intermediary architecture. [Read More]

Build Ml Pipeline Once Deploy Everywhere

There is a certain appeal to building a machine learning pipeline once and deploying everywhere. Now often this refers to pipelines which are built via batch, and deployed as a batch, an api or via a stream; in this post, I thought I’ll explore what it may mean if we build a Python pipeline once and deploy over node. Advantages Easier integration into the node ecosystem - we no longer have to worry about mixing languages, nor are we necessarily tied to a pure python server-side stack. [Read More]

From Local To Cloud Using Colab For Model Training

In this post I’ll quickly go through my tips and tricks for using Colab in conjunction with my local environment. Setup My general setup is to write a batch file locally and run it. Then change all references to local directories to the Google drive location. This involves adding a new cell: from google.colab import drive drive.mount('/content/gdrive') At the top of the file. To confirm this is working as intended you can verify by writing a dummy file like this: [Read More]

Lessons On Design And Engineering Decisions

One of the interesting posts to pop up in June 2018 was on Airbnb and their experiences with React. Whilst I’m obviously not a React developer, there are a lot of golden nuggets to think about in my own work and how I should approach decision making. In this post I thought I’ll comment on Airbnb’s reflections and what they mean for me, when I’m thinking about taking engineering one step further. [Read More]

Weekend Adventures In Typescript And Jupyter Variable Explorer

For data scientists moving their workflows from their desktop computer to the cloud, one of the hardest parts of “letting go” is the lack of an IDE. More specifically, one of the most requested features is a variable explorer like in Spyder. Over the weekend I decided to take a stab into Jupyterlab extensions working off some of the initial work here: https://github.com/lckr/jupyterlab-variableInspector Lessons learnt and additions made: Dealing with changing kernels and languages Improvements for Numpy and DataFrame Ability to interact with Tensorflow and Spark items Changing kernels: [Read More]

Decision Trees Via Sgd

This post was more to get myself thinking, nothing here is rigorous or necessarily makes sense - leaving it here for historical reasons How would you train a decision tree via stochastic gradient descent? This idea was covered in Efficient Non-greedy Optimization of Decision Trees. What does it mean to be “non-greedy” When we consider CART or similar algorithms, they are often greedy, that is when a split is found it cannot be changed. [Read More]

Relationship Between Resnet Boosting

The paper “Learning Deep ResNet Blocks Sequentially using Boosting Theory”[1] is a paper to appear in ICML 2018. At a high level, the paper talks through the relationship between ResNet and Boosting, and how a ResNet can be trained in a single forward pass in a manner to how boosted models are trained in lieu of back propagation. The idea here is that we can then train non-differentiable layers in a neural network. [Read More]

Reflections On Akins Law Of Spacecraft Design

Any run-of-the-mill engineer can design something which is elegant. A good engineer designs systems to be efficient. A great engineer designs them to be effective. (McBryan’s Law) You can’t make it better until you make it work. One of the most important aspects of design is simply to defer decisions as long as possible (see Clean Architecture). This avoids issues such as lock-in and perhaps to some extent bike shedding. [Read More]

Art Of Guessing

Human beings are always lazy, and nature always has in it really interesting and random phenomemon. In this post we will explore a few key ideas: What if the Pareto Principle holds? Rule of 5 Pareto Principle Pareto Principle is around the idea that life and nature generally adheres to a 80:20 ruleset. That 80% of the effect can be explained by 20% of the cause. But what if we want to know 90%, 70% or 50% of the effect - what is the corresponding proportion of cause? [Read More]

On Advice And Tools

How did I learn programming and machine learning? What tools should you learn? What tools should I use? A lot of this advice is difficult to give - simply because the landscape has changed in such a short time, and also the path that I decided to take. For example: Previously, I would recommend workflows using tools like luigi, or airflow or even rolling your own using doit or go - now I think plain old make is better. [Read More]

Stopit Patterns In Python

One useful pattern in machine learning is having time-based constraints for testing and comparing your algorithms - afterall that is the easiest way to assess its efficacy. The easiest way to do this is through the stopit library. The simpliest pattern I got working is this one: from stopit import threading_timeoutable as timeoutable import time class SimpleObject(object): """ >>> ss = SimpleObject() >>> ss.ticker(timeout=5) >>> ss.items """ def __init__(self): self.items = [] @timeoutable() def ticker(self): while True: self. [Read More]

Things I Wish I Learnt Sooner On Linux

Without a doubt, I wish I learnt a bit of vim and screen earlier on in my Linux journey. Although I now use both tools on a semi-regular basis, I would still be considered an amateur - knowing only the basic commands. However, the truth is you only need the basic commands to get a lot of work done! Here is my hack guide to using vim and screen in order to save yourself a lot of pain for the simple things; without trying to know or do everything in the terminal. [Read More]

Data Engineering What And How

Data Engineering. In my view, Data Engineers can be thought of as the “Type B” Data Scientists - the builders. The Data Engineering role is a blend between Data Scientists and Software Engineers. Recently, thats a term that has been appearing more and more (at least to me). I thought I’ll look into two things: What is it? How does one become proficient at it? To answer both of these questions, the easiest way is to look at the training which the major cloud providers have. [Read More]

On Data Engineering And Cloud Providers

Recently, I’ve been thinking “what are the skill sets required for a data engineer” and “how would one demonstrate data engineering knowledge”? Quick searches reveal several things: There aren’t a lot of (free) MOOCs on data engineering specifically All cloud providers have exams specific to their cloud offering for “data engineering” (being Google, AWS, Microsoft/Azure) - none which are more “generalist” But its clear that there are some commonalities in what “data engineering” is. [Read More]

Paper Writing Revisited

Whilst the first post on Paper Writing was focused on delivering a paper, and providing structure for peole who have yet to write a paper, one year on, what would I change about it? Is it still relevant? In the original focus, we being with no topic, no ideas, and spend half the time: Reading Crafting a proposal (the why) My Thought Process Often times writing a paper is iterative. [Read More]

Feedback Loops Lessons

Series of personal essays inspired by blog post Things I have learnt as the software engineering lead of a multinational On Planning There is a time for everything, a time to share detailed messages, and a time to share higher level thoughts. When leading a team there are two very different scenarios you will end up in: Where tasks are well-defined with easily measured goals. However most of the time, they would fall under Tasks which are unclear, with unknown requirements without a clear plan Based on both of these two situations, our communication style should change accordingly. [Read More]

On Feedback Loops

We all hear of things like: good is the enemy of the best Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away. Simple can be harder than complex: You have to work hard to get your thinking clean to make it simple. But have we truly thought of the consequences of thinking in this way? [Read More]

Short Notes On Gensim

When designing interfaces for software, a commonly cited one is the “Principle of Least Astonishment”; interfaces should work the way that you want them to! However sometimes there are weird oddities which pop up where people need to be aware of. One of these is the current (in development) interface for gensim to scikit learn. Imagine you are doing a TFIDF model in Scikit learn right now. The code to do this looks like: [Read More]

Orchestrating In Memory Jobs In Luigi

Orchestrating in-memory jobs in Luigi need not be hard. The easiest way is to roll your own luigi.Target. For example the following works perfectly: _data = {} class MemoryTarget(luigi.Target): _data = {} def __init__(self, path): self.path = path def exists(self): return self.path in _data def put(self, value): _data[self.path] = value def get(self): return _data[self.path] The reason why I’ve kept the _data outside the function is simply to ensure that Python objects persist outside of their respective tasks (of course you could bring it into MemoryTarget if you didn’t want that to happen). [Read More]

Lessons From Developing In Graphene

For the sake of my own sanity, everything I developed as in Graphene, so whenever I mention GraphQL please substitute with “GraphQL/Graphene” where it makes sense GraphQL is an interesting approach to graph queries (after all it stands for Graph Query Language), as it does not explicitly sit on a graph database. Rather it seems to make use of various data loaders and constructors in order to create a graph-like experience. [Read More]

Generalising Transfer Learning

Typically when we think of transfer learning, we naturally think of Deep Learning algorithms. Afterall it makes it easy to “transfer” learnings from a similar problem to a the current domain. For example, we can generate image features from one domain and use it to “kick-start” another problem. Can we do the same thing for other machine learning problems? What is Transfer Learning I like to think of transfer learning as an extension of online learning, in the sense that we can learn off another problem, and without retaining any knowledge of the underlying dataset, used that information to inform us better on how we should go ahead and tackle new information for our machine learning problem. [Read More]

Creating Partial Plots

Partial dependency plots are an important part of post-hoc modelling, particular when we are dealing with complex ensemble based models. The idea behind these plots is to simply show the effect of modifying a single variable assuming that all other variables are the same. However there are several ways how this could be achieved, both with different assumptions. Using Base Data One approach for calculating partial dependency is to simply calculate at $n$ points spread evenly for the variable of interest. [Read More]

Approximating Correlation

There are of course several ways one can approximate correlation. In this post I thought I would outline the use of kernel approximation and how to relate that to correlation measures. Rough outline: Realise Cosine similarity is the same as correlation when centered Use kernel approximation method (Nystroem) Cosine Similarity The link to cosine similarity is best described in this post. The important aspect is that $$ \rho_{xy} = \frac{1}{n} \sum_i z_x z_y$$ [Read More]

Ideas For Next 10 Months

Planning for 10 months is always going to be hard, but sometimes it is worthwhile to note it down so that you can review later to see what has changed and arrange any crazy ideas that you might have. Research Within the next 6 months, one achievable goal is to complete a paper which is “good enough” to think about publication. The goal isn’t necessarily to publish, but to have something which we can at least make a decision whether it is sufficient down the road. [Read More]

A Somewhat Wrong Overview Of Yolo Framework

The YOLO (you only look once) framework is a cutting edge approach to object detection in images. In this post I thought I’ll go through in high level detail how it works, and how we might build our own YOLO architecture (if we wish). The aim of this post is not to provide a comprehensive view, but rather demonstrate some of the ideas that might be different for users who don’t have a background in object detection. [Read More]

Determinantal Point Process In Bad Pseudocode

In attempting to write a pure Python sampling version of determinantal point process I thought I’ll write through some pseudo-code and go through what worked. I have provided my current work in progress here. It is based on the Matlab code of Alex Kulesza, which was converted (partially) using SMOP package. Changes to the Original Implementation The changes to the Matlab implementation revolve around how to calculate the new orthogonal basis for the subspace relative to the vector which has been chosen. [Read More]

Relational Data Mining

Relational data mining is one where the data that is provided isn’t a single “flat” table, but rather it is a series of relational data tables. There are several ways to deal with this in data mining research, but in this post we will only cover a really small and specific type - propositionalization. In propositionalization, the solution to the problem is simply to convert the relational data to a flat structure. [Read More]

The Search For The Boring

I’ve been thinking a lot about research and the attempt to solve practical problems (after all that is in a way the goal of my PhD). The questions come around what is the easiest way to gain acceptance or to ensure other people use what you are going to build? Now part of this is undoubtedly my own world view and not necessarily the view of my supervisor or company, because personally I believe several facets determine whether I personally believe something is worthwhile. [Read More]

Feature Engineering And Deep Learning

In this post I will have a look at the first two projects within Udacity’s self-driving car project. More specifically I will share some of my thoughts in a “meta-learning” sense; how do things that I know and current do relate to machine perception/deep learning problems. Finding Lane Lines on the Road Traffic Sign Recognition The aspect of these projects which interest me much isn’t so much the “deep learning” portion, but rather tackling problems which I will describe as perception problems. [Read More]

Taking Bayesian Optimization For A Test Run

Some notes on Bayesian Optimization using Matern Kernel as per NIPS Practical Bayesian Optimization paper . # see http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_prior_posterior.html import numpy as np import matplotlib.pyplot as plt import matplotlib.image as mpimg import numpy as np %matplotlib inline from sklearn.gaussian_process import GaussianProcessRegressor from sklearn.gaussian_process.kernels import Matern kernel = Matern(nu=2.5) gp = GaussianProcessRegressor(kernel=kernel) # suppose we are fitting this function.. X = np.linspace(0, 5, 100) y = np.sin((X - 2.5) ** 2) plt. [Read More]

Software Design Advice Taken Out Of Context

Lately I’ve been thinking about software design. What is considered “good” design? As a meta question, does design even matter? The Unix Way (Keep it simple stupid!) There is a lot of different views of Unix philosophy, and many people have tried to summarise it. The general feeling is that unix favours: Simple: make each program do one thing and one thing well. Modular: output of a program should be an input to another, unknown, program. [Read More]

A Quick Look Into Python Doit

Make files have been around for a long time. They are probably one of the most popular build automation tools being used. In this post, we will look at a “lighter” version that is implemented in Python called doit. Since the goal is not necessarily to have a complicated build tool, but rather automate simple tasks, doit is perfect for this situation. In an analytics situation, the goal of this automation tool generally is: [Read More]

Testing Simple Modules In R And Python

Testing is an important part of writing reusable code and having confidence when deployment production code. How should we write tests? In many analytics applications, code is generally in the form of scripts, and there is rarely a need to completely “package” an application; especially when you are still prototyping code or in a “discovery” or “experiment” phase. In this post, I will look at how we can create simple templates in Python and R in order to be able to write and develop tests with confidence when we have simple scripts. [Read More]

Should We Teach

My colleague had an interesting question Why should we do showcases if no one will action our work? Their point was fairly valid. We had set up a git repository with our work, sample code to get started on a variety of “difficult” problems which people across the organisation have generally been interested in. However, the consensus within the team was that no one outside of the team had actually cloned the repository; much less tried running any of the code. [Read More]

Simple Concurrency Workflow Using Go

I’ve been thinking about designing my own concurrency workflow (or least thinking what it should look like). There are several tools which already exist, which if it was in a production setting I would probably use. As they say, what I can not build, I do not understand… The idea is if we have shell scripts which run our code and is modular, then could we concurrently run our code? And once certain portions are done could it then know to combine the outputs together so that it could continue in some other workflow. [Read More]

Graphql Graphene Nodes

Here are some random notes for myself based on my experience in GraphQL/Graphene. Graphene has two main parts: Queries: which are just for querying data Mutation: when you have a query which modifies data in some way Mutation To write a mutation there are a few components: Structure of the output Accepting input parameters Modifying data Sample mutation (taken from docs): import graphene class Person(graphene.ObjectType): # this is the object which is called in the line # `` name = graphene. [Read More]

Understanding The Conditions For Ai

We hear about the amazing things of various AI software, for example Watson. It would appear based on first glance that things like Watson will replace humans very quickly, but how exactly are the metrics for Watson or other frameworks measured? One of the biggest and most important algorithms are centered around correlation. These are extremely powerful models which pop up in many situations, such as recommender systems, or how medical portion of Watson works; whereby it “intelligently” (via determining correlations) see the interaction among different kind of medicines or diagnosis which describes a patient state. [Read More]

Naive Ways For Automatic Labelling Of Topic Models

Trying to decipher LDA topics is hard. In this post I propose an extremely naïve way of labelling topics which was inspired by the (unsurprisingly) named paper Automatic Labelling of Topic Models. The gist of the approach is that we can use web search in an information retrieval sense to improve the topic labelling of our LDA model. Algorithm (Extracting Relevant Information): Input: top keywords for an LDA topic for each search result in Wikipedia for each keyword: if any other keyword appears in article result summary: yield article header end if end for human topic = keywords(list of article headers) return human topic Of course this can be extended in many ways. [Read More]

Naive Ways Of Parallelising Gradient Descent

It is easy to see that stochastic gradient descent is somewhat an “online learning” algorithm. It can be computed in a minibatch manner that can be parallised. If we take this idea to the extreme, we actually have an algorithm called the “Hogwild!” algorithm. The gist of this algorithm is: Loop For some observation, with current solution "x": grad_v <- calculate gradient for only the "v"th column update "x_v" <- "x_v" - learning_rate * grad_v end end For a sufficiently large dataset with many columns, the chance of collision is small, and the worse case scenario that the algorithm updates twice is rather minimal. [Read More]

Considerations When Extending Online Lda With Decay

Gensim has a lovely online LDA module which works extremely well. One of the challenges when placing it in a modelling environment is the assumption that the topics shouldn’t change too much over time. This assumption when you have a massive dataset is more than reasonable, however sometimes when you are just starting to build the model out you don’t really have “good” data, hence the need to retrain and refit the dictionary. [Read More]

Things I Learnt This Month

I’ve always tried to regularly string together something interesting that I’ve picked up this month however, was a bit of a strange month; the many things that I’ve picked up weren’t necessarily difficult but in some ways obvious to people who might have been a bit more observant (or should have been obvious to me earlier!) Neural Networks This month two insights which I had already knew, but never managed to piece together finally clicked: [Read More]

Quick Models In Keras

Keras is an interesting framework which allows one to easily define and train neural networks. After a long time “avoiding” deep learning libraries, I have finally taken a dive using Keras. Here are some notes and examples for getting started! Multinomial Regression The corner stone of any neural network is the last layer plus activation function. If we use the activation function being the softmax function, we will essentially have a softmax regression problem (multinomial regression). [Read More]

Teaching Is Hard

At some stage of your career someone would ask “Well, how did you learn this?”. One of the difficulties I found with answering it is that when we think how we solve something retrospectively we generally see that it is a linear experience; however very often that is very often far from the truth. Very few people like the idea of learning slowly, and articles like “Teach Yourself Programming in Ten Years” are generally quite rare. [Read More]

Teaching Machines To Think Pre Release

Teaching Machines to Think is a series which I am intending on starting. The focus on the series is on the engineering considerations made when we train machine learning and artificial intelligence algorithms. We can create a Naive Linear Regression Solver: Guess what m, b Make another guess on the incremental change for the best observed m and b Compare the proposed improved with our best m and b. [Read More]

Pydoc Thoughts

If you Google documentation with Python, there is a lot of posts and articles on how you can use sphinx with automatic documentation generation. These work well however for a simple script pydoc is more than enough to the task.

pydoc script > script.man

Or for html files (which seem a bit more ugly):

pydoc -w script

There is certainly an element of elegance to the minimalistic man pages.

As this mimics the docstrings used within Python, in some ways it is cleaner than literate programming approaches such as Pycco

Paper Writing

Here is the spark notes edition on the scheduling for OMSCS course cs6460, which is quite fast-paced and something I would like to imitate in the future. Week 1-4 The first week, you got familiar with the overall landscape. The second week, you zoomed into a particular area of that landscape. The third week, you zoomed in further to a particular problem or question in that area. This week, focus on deciding what you want to do to contribute to the area of this problem or question. [Read More]

Reflections On Omscs

It is all over, finally finished my last assessment for OMSCS program. Some of the common questions I got was: Was this worth it? How difficult is the program? Why this program? Would you recommend this program to someone else? Overall if I was to rate my experience, I have to say that OMSCS is a bit of a mixed bag. It is difficult to compare with another course or program since this is my first program which I worked through whilst working full-time. [Read More]

Imagine This

**Here are some of my notes from the “speaking to inform” series on selecting speech topics. This is not my speech, but rather an altered version to make it more suitable for reading. ** You’ve been thinking about a speech topic for a long time, you’ve started writing up your speech, maybe it’s already all written up - word for word. Now what? Ladies and gentleman and fellow toastmasters, tonight I’m going to talk about something that isn’t necessarily my strong suit, but something that is really important - what exactly is my thought process when transferring something that’s written, to something that’s spoken. [Read More]

Spark Custom User Aggregated Functions

Not much to really say here. This was a question which was asked from OMSCS about creating your own user defined aggregation function, so here it is. It is clear there are influences around the notion of a monoid. initialize is essentially your identity. merge is where you put the two objects together. and so on… Unfortunately this function is (poorly?) documented, though perusing Google suggests that this will be fixed in Spark 2. [Read More]

On Spark Things Which Should Have Cookbooks

There are some things which to be are completely inexplicable in the world of Spark. One of them is finding help for things which really should be recipes. What makes it worse is that sometimes when you find the question asked on Stackoverflow; it is marked as a duplicate, despite the question not being answered in the other thread. For me it is this one: How to access the values of denseVector in PySpark [Read More]

Extending Sparklyr

Okay, so you’ve looked at sparkhello; now what? How can I extend and make use of some scala code! Simple example: def addOne(df:DataFrame) : DataFrame = { val colname: String = "test" df.withColumn("test1", df("test") + 1) } How do we add this in R? #' @import sparklyr #' @export spark_addOne <- function(df) { sdf_register( sparklyr::invoke_static(spark_connection(df), "SparkHello.HelloWorld", "addOne", spark_dataframe(df)) ) } Important to enforce the spark_dataframe object, what is interesting is that we don’t need to specify “sc” as it is given based on spark_connection [Read More]

Random Notes On Sparklyr

I recently started playing with sparklyr and have found it an amazing package. Here are some notes from reading the source code and the documentation: sparklyr::compile_package_jars This method compiles the jars, but for whatever reason basically disobeys every single default installation path! How can we get this working? Reading the documentation didn’t help me either. It claimed that the compilers had to be in the following paths: /opt/scala /opt/local/scala /usr/local/scala ~/scala (Windows-only) Finally realised it means that the location of scalac should be: [Read More]

Exporting Parquet Files Via Drill

A problem I’ve been struggling with is this: How can I export/import a parquet file without Spark? There are various efforts in the form of Apache Arrow, and a Python Parquet package which allows you to read, but how might one export the data? Enter Apache Drill This isn’t exactly a new approach, but rather I thought I’ll document what I did to get it working locally as there aren’t any really clear instructions online! [Read More]

Programming In Big Data Land

This post is more of a stream of consciousness about big data technology and the things I do everyday Big data is great; we all talk about it, and when you run fancy tutorials there are nicely configured virtual machines with point and click (also with GUI’s!) to allow you to get aquainted with big data. However on the practical side, companies don’t often have nice vendor solutions for you, and often you find yourself staring at a blank terminal screen… [Read More]

Apache Tika With Spark

Since there isn’t really any tutorial on this (possibly because its too simplistic) here is some starter code for working with automatic document conversion. Apache Tika will automatically guess file formats based on the MIME type. This is well documented on the Tika site, and easy to use: val content = new BodyContentHandler val parser = new AutoDetectParser val metadata = new Metadata val stream = new BufferedInputStream(new FileInputStream(path)) parser.parse(stream, content, metadata) If we are after plain text from the body content, then we simply can do: [Read More]

Four Schools Of Ai

I really like this diagram, so I’ll keep it here for future reference. This is one way of thinking about AI approaches. The layman may often only associate “tangible” things with AI. That is the bottom two quadrants. This could be the form of self-driving cars (agents that act optimally), or robots (agents that act like humans). At the same time we have to consider the “invisible” forces, for example face recognition software, or photo recognition which classifies faces, or locations given a set of images. [Read More]

Generalising Order Statistics

Order statistics problems are very common in 2nd year statistics courses. Often they come in the following form: Suppose \(X_1, X_2, X_3, …, X_k\) is iid by some distribution \(F\). What is the pdf of \(Y = max(X_1, X_2, X_3, …, X_k) \)? This problem is generally solved as follows: $$ P(Y \leq x) = P(X_1 \leq x, …, X_k \leq x) = F_X(x)^n $$ How can we generalise this? [Read More]

Playing With Ocr Using Tesseract

Recently I began playing around with OCR using tesseract at work. Getting it to work proved to be a pain, since there was no administration rights when installing the software. A brief outline of the approach is highlighted here. pypdfocr was the package which was used to convert pdf’s (or even non-pdf files could be used here). Based on the documentation, the external requirements that were used were: Portable version of Tesseract (download the “tesseract-XXX-win32-portable. [Read More]

Sphinx Autodoc Is Needlessly Complicated

Sphinx and automatic documentation; where do I even begin! I would hope that a solution similar to roxygen2 in R would exist; where I would simply write all the docstrings and run devtools::document() and it would all be done; however that is not the case for Python. Here are the instructions which I had to source from two (yes two!) locations in order to make sense of it. Perhaps you shouldn’t be generating all documentation automatically, however due to the nature of solo projects I think the reality is you want to depend on as much automation as possible and in that sense automatic documentation is the only sane way to do it. [Read More]

Introducing Binst

Today I am releasing binst which an “optimal binning” package using supervised and unsupervised methods including: kmeans entropy decision trees Motivations This package was firstly spurred by smbinning which to me seemed to be very confusing to use. This was what spurred the decision tree method in this package. Although this package “worked”, I had trouble from an interoperability perspective to apply it on H2OFrames. For example I wished to perform something as simple as: [Read More]

Regression Trees

What about regression trees? There are implementations in SciPy and different R libraries (for example rpart but how do they actually work intuitively compared with decision trees? In general the implementation is based on the famous CART (classification and regression trees) and work through using recursive partitions to seperate your continuous response. Once the stopping criterion is reached it will use local regression techniques to finally predict your answer. Essentially they created through piece-wise linear models. [Read More]

Relearning Things Again

WHen I’m revisiting computer science and thinking about Georgia Tech’s AI and ML courses one question which pops up is: Why are you looking at this again? Yet if you were to ask me truthfully; especially for these two courses I would say without a doubt I do not regret doing them. Why do I feel this way? In some way it is a reflection of the education system, where we constantly re-learn ideas over and over again. [Read More]

Writing Angular App The Convoluted Way

I’ve been playing around with angular as part of my Health of Informatics at Georgia Tech and as part of an experiment I decided to try to write and “deploy” an app using only github pages. You can see my attempt here. The restrictions in this scenario was that I chose not to “preview” or “check” any changes actually worked before pushing to the repository, so here you can see all the changes which I made and all the history of what I was doing. [Read More]

You Are Not Average

The chances are if you are reading this, you are not average. You are probably above average. According to Time the average American reads 19 minutes a day! I have no reason to think that this is any different for any other western country. Even discounting that. If you are reading this, you probably have a degree and have an interest in computer science or analytics; since many of the topics I will write about it will be on those topics. [Read More]

Veblen Goods

The new Veblen Profession Veblen goods are generally referred to commodities where demand increases as price increases. This occurs when the increase in price is linked to a good being exclusive and hence prestigious. However I would argue this extends even in education. Having a ethnic background being south-east Asian means that there are two main occupations which are highly valued: Medical Doctor Lawyer This prestige around the world is highlighted by the sheer difficulty in completing and earning their respective titles. [Read More]

Better Than You

** Here are some of my notes for International Speech Contest at Toastmasters. This is not my speech, but rather an altered version to make it more suitable for reading. ** Being Australian born Chinese generally would mean that you would hear things like “Its great you got 90 in your last exam, but Jackie got 95! Why can’t you be like Jackie?” Luckily for me; those comparisons quickly disappeared especially between my younger brother and I. [Read More]

Implementing Simple Melt Function For Pyspark

With the introduction of the pivot function within Spark 1.6.0, I thought I’ll give implementing a simple version of melt a go. Currently it isn’t as flexible as the reshape2 library within R but it already does a pretty good job following the same approach to which the reshape library does it in. The essential idea behind the code is using flatMap functionality on DataFrame objects to emit multiple rows (observations) per each row in the data frame and remapping the resulting values. [Read More]

Simple Boosting Algorithm

We can think of boosting as some kind of weighted sample. Essentially we build models and have higher weights assigned to the observations which we score incorrectly. Our simple boosting algorithm is as follows: Build a model, on a weighted sample of points See which ones we score incorrectly/correctly and assign a penalty of 110%/90% Repeat 1. Then we can average out the predictions to get the final score. [Read More]

Thinking About The Design Of Interactive Books

Recently I’ve been thinking about the design of interactive books. If you check out ideas shown by Bret Victor, you can see some of the ideas he has showcased with regards to interactive documents. I think the challenges of creating interactive books stem from not only on the technical challenge, but making them interesting even when they are in a hardcopy format. I think the newspaper to digital news medium is one interesting medium which is worth considering and comparing. [Read More]

Being Wrong

**Here are some of my notes from the “Speaking to Inform” from the advance competent speaker series. This is not my speech, but rather an altered version to make it more suitable for reading. ** Has anyone ever said to you “Well that’s your opinion” What if the context isn’t really say a value judgement, but something which ought to be rooted in scientific fact. On one hand we have a group which perhaps has entirely eroded trust in science - namely anything about diet and fitness; ideas like superfoods, eating chocolate, and notion that sitting is killing you. [Read More]

Playing With Tensorflow

To play around with tensor flow I used vagrant. The first thing that happend was a bug due to an older version of virtualbox; which didn’t solve itself as I needed to upgrade my vagrant installation… 30 minutes later, vagrant up finally works, and another 30 minutes later everything is ready and installed. We can then login via ssh: Host: 127.0.0.1 Port: 2222 Username: vagrant Private key: C:/Users/XXXXX/.vagrant.d/insecure_private_key Now you can run through the tensorflow examples! [Read More]

Understanding Scene Completion

I have always been interested in the Scene Completion using Millions of Photographs paper and I was fortunate enough to learn enough computation photography to be able to replicate it in a rather naive way during my final project within my Georgia Tech course. Pipeline What exactly does “scene completion” try to accomplish? It can be broken down innto the following three points: Select a suitable scene with the hole boundary from the original image. [Read More]

Simple Syntax Highlighting Using Nltk

Programming and coding is usually done with some kind of syntax highlighting, to make it easier to read and reason with a program. It helps determine where we might have a number or string in our SQL query , or determine where is the start and end of a function code block. Then why doesn’t one of these exist for say essay writing? Is it actually difficult to build a syntax highlighter for the English language? [Read More]

Reflections Of Four Years After University

Its the end of the year, and like many places it is time to complete development plans and aspirations (for work!). For myself, I already have a rough guide of what I wish to achieve in the coming year, however for my colleagues who have just finished university its a crazy world out there. It feels like you know so much and at the same time know so little. After four years of finishing my undergraduate studies I think its time to reflect on the books, courses and principles that have brought me to where I am today. [Read More]

Build A Supervised Learning Bootcamp

Following my previous post, I had the idea pull together a series of modules which demonstrate how you could build supervised models from the bottom up. Through this you would have a strong appreciation of the underlying processes and models used in supervised learning. I think this is also a really good time to learn a bit of Julia. abstract Learner type LinearModel <: Learner coef function LinearModel(X, y) this = new() this. [Read More]

Thinking About Teaching Data Science

Recently someone asked me, If you had a three day data science bootcamp, what would be in it? Placed on the spot I really didn’t have any idea what could be covered in three days. Now that I have had enough time to reflect on this problem, here is my proposed outline: Day 1 - Supervised Learning The assumption would be that anyone participating would have working knowledge of linear regression, and also model training, validation and testing. [Read More]

Standalone Spark With Python And R On Windows

Trying to figure out how to install Spark on windows is a bit of a pain so here are some basic instructions: Download any of the prebuilt hadoop distribution for spark. For me downloading it for Hadoop version 2.6 worked perfectly. Download the winutils for Hadoop. Google should bring up several results. At the time of writing this one worked fine. Now we have to make sure all our environmental variables arre set up correctly. [Read More]

Blending Images Using Python And Opencv

I’m currently taking the Georgia Tech course on computational photography and I thought I’ll give some of the lecture materials a go, more specifically the sections on blending images. In brief, if we convert all the image channels (RGB) to be in the range between \(0\) and \(1\), we can then can apply the following blending approaches to images \(a\) and \(b\). Divide the two images (brightens) Addition (brightens, but adds too many whites) Subtract (darkens, but too many blacks) Darken, \(f(a,b) = min(a,b)\) Lighten, \(f(a,b) = max(a,b)\) Multiply (brightens) Screen \(f(a,b) = 1-(1-a)(1-b)\) Overlay \(f(a,b) = 2ab\), if \(a < 0. [Read More]

Genetic Algorithms Parallelism In R

The easiest way to get some performance gain when using genetic algorithms in R (using the library GA and on Windows) is to set parallel to snow. This is a documented method, but not exactly clear. library(GA) Rastrigin <- function(x1, x2) { Sys.sleep(10) 20 + x1^2 + x2^2 - 10*(cos(2*pi*x1) + cos(2*pi*x2)) } system.time(GA4 <- ga(type = "real-valued", fitness = function(x) -Rastrigin(x[1], x[2]), min = c(-5.12, -5.12), max = c(5.12, 5. [Read More]

Educationals Impromptu Speaking And Other Thoughts

Starting a new speaking club means redoing educationals. Since the club itself is a corporate club meaning there would be limited time to accomplish prepared speeches and limited experience to complete roles like evaluators. The first educational was done on impromptu speaking. I focused on the key idea that building public speaking skills revolves around not only increasing our confidence, but also “faking” our confidence. The key idea behind starting table topics are three simple points: [Read More]

Selecting A Speech Topic

**Here are some of my notes from the “better speakers” series on selecting speech topics. This manual can be freely downloaded online from the Toastmasters international shop. This is not my speech, but rather an altered version to make it more suitable for reading. ** If I was to ask the any member here “would you like to do a speech next meeting” there are two possible answers Yes! I have a burning topic I want to talk about No! [Read More]

On Implementing Algorithms

Recently I have had time to implement a variation of the RIDIT score in R. The variation used was in the form $$ B_i = \sum_{j < i} P_j - \sum_{j > i} P_j $$ Where \(B_i\) is the (transformed) RIDIT score for the value \(i\) and \(P_j\) is the probability of observing the value \(j\). This RIDIT score was proposed by Bross and used for Fraud Detection. The desirable features of this score is that it is in the range \([-1, 1]\), with a median value of \(0\). [Read More]

How Long Does It Take To Setup Intellij With Scala

How long does it take, if you only have half an idea of what you’re doing? After a few days of trying and somewhat succeeding, I decided to try installing everything from scratch on a new install of windows. It took me almost 30 minutes to get Scala with Intellij 14 working. Granted if you had a better idea of what to do, you probably could cut some time off it (and if you were more attentive than me to the computer screen), but nevertheless it shows it takes a while for things to get going; not that I actually understand the cryptic warning messages splashed over my screen! [Read More]

The Rule Of Three

or what in the world is \(-log(0.05)\) If you type into a calculator \(-log(0.05)\), you will realise this number is approximately three. This is basically where the rule of three within statistics comes from. This rule comes from the binomial proportion confidence interval, where we may want to calculate the confidence interval around the hyperparameter \(p\). However in the case where \(\hat{p} = 0 \), what can we do? Since we have 0 actual cases, we then know the probability of this occuring is: [Read More]

What Would I Tell My 18 Year Old Self

For the last year, there has been constant talk about fee deregulation within Australia. This has made me think what precisely would I advised myself, if I was finishing up high school this year. Expert opinion suggests that fees would rise in imitation of the US college system. In this thought experiment I will divide it into three broad areas: Global Education Trends University Experience Practical Thoughts. But firstly, what is the goal of post-secondary school education? [Read More]

Binary Classification With Pca

One question which puzzled me was how can PCA be used in a classification sense? Well here is an approach which is used in unsupervised setting based on my reading on PRIDIT modelling. Basically you approach PCA from a factor analysis perspective, providing ranks on your variables. Then you can segment your scores in the normal way and group them as your classification. In general it has been found that this approach has worse accuracy than other approaches (unsurprising since this is an unsupervised technique), however it performs better in the situation where you have unbalanced data and the statistical power is highly important. [Read More]

Asking The Right Question

I’ve spent the last two days being acquainted with stack exchange. One of the things I have discovered is the sheer inadequacy in the quality of the questions. Many of the questions simply can not be answered in a sane manner. This comes as a shock to me simply because my previous casual usage (through landing via Google) has been so positive. It is clear under all that there are a lot of rubbish which we may not see. [Read More]

Getting Started With Elm

elm is an interesting language which compiles to javascript. It has many advance features (which I am definitely not across). I was trying to integrate elm into Cordova, so here are some of my notes on how to pull together an example webpage. Running an example Consider the simple markdown example: import Markdown main = Markdown.toElement """ # Welcome to Elm This [markdown](http://daringfireball.net/projects/markdown/) parser lives in the [elm-markdown][] package, allowing you to generate `Element` and `Html` blocks. [Read More]

Comparison Of Ngram Fuzzy Matching Approaches

String fuzzy matching to me has always been a rather curious part of text mining. There are the canonical and intuitive Hamming and LevenShtein distance, which consider the difference between two sequences of characters, but there are also less commonly heard of approaches, the n-gram approach. Within text mining, n-grams can be addressed at a word or character level, in this post we shall only consider the character representation for purposes of approximate matching. [Read More]

Spark And Aggregate

Getting started with Spark The easiest way to play with Spark is to check out the Developer Resources over at databricks, it was the only Windows solution that worked “out-of-the-box” without installing a VM. A quick look at Spark Many of the normal functions in Spark are similar to what you expect. We have our typical map and reduce functions, but Spark introduces an aggregate function. How does it work? Consider the example below, which calculates the sum of all the value, and keeps track of how many values have been processed: [Read More]

The Relentless Ladder Of Success

Over the weekend I was asked this question by a family friend (friend of parents); You and your brother are obviously very intelligent people, how do you relate to other people who aren’t as intelligent as you. This was interesting for a number of reasons, the family, if I recall correctly, the father was a medical doctor and they had a daughter (a few years younger than me) who was starting her journey in the medical profession - clearly from another very intelligent family. [Read More]

The Friends You Have

This speech officially marks the end of my toastmaster competent communicator journey. Its been quite a ride, and I’ve certainly improved alot. Initially this speech was to be on “Lockhart’s Lament” but was changed when a friend of mine came to spectate. Here are some of my notes from a toastmasters CC8 speech I completed recently. This is not my speech, but rather an altered version to make it more suitable for reading. [Read More]

What In The World Is Advance Analytics

Advance analytics is a term that suddenly popped up in my world in the last few months. Since this is my area of expertise I thought I would investigate what exactly advance analytics is. What is analytics? It is clear to me that the notion of analytics consists of two parties; those who make use of tools and programming languages, and those who insist on making use of Excel, and powerpoint slides as the basis of their analysis. [Read More]

Getting Started With Scalaz

This is just some notes on my way to learning scalaz. Of course I got help from the actual tutorial…learning scalaz. The contents are based on my knowledge and as such is most likely wrong. Please message me any mistakes! Step 1 Assuming you have scala and sbt installed, to launch scalaz in your console simply create a file called build.sbt with the following lines thanks stackoverflow): scalaVersion := "2.11.2" val scalazVersion = "7. [Read More]

Random Projections An Implementation In R

Random projections is an interesting concept. It focusses mainly on the result from the Johnson-Lindenstrauss lemma, which essentially says, give me an error, and I will give you a projection to a lower dimension in such a way that the distances between points are nearly preserved (with respect to the error you gave me). This is interesting and can be used in dimension reduction problems, in a similar way that PCA is used. [Read More]

Feature Selection With Genetic Algorithms

One way which you can use randomized optimization is to determine the best subset of features for your particular model. The idea is that we want to find the best vector of 0’s and 1’s which represent the selection of features. The setup of the problem is quite easy. Simply define the function which you want to maximize and wrap it around the optimization problem. For example: fitFun <- function(ind, x, y) { ind <- which(ind == 1) if (length(ind) < 2) return(0) out <- caret::train(x[, ind], y, method="rpart") caret:::getTrainPerf(out)[, "TrainAccuracy"] } In the function above, we have used caret::train to model the data of interest. [Read More]

Parsing Xml Using Ramble

What is interesting when you actually start using things that you build beyond simply just toys. Recently I had a conversation with a colleague on why you shouldn’t use regex to parse xml data. The main points for why it should be permissible are mostly around practicality, and that our jobs might be ad-hoc. This begged the question; if we weren’t going to use a library and wanted to parse XML files how would we do it in R in particular. [Read More]

Probably Approximately Correct

There are some things that I don’t quite understand, but I thought that writing the high level information about it may help my intuition. Judging from the title it is the idea of being “Probably Approximately Correct” (PAC) in the realm of machine learning. Let’s not worry about the details, but simply look at a few conclusions we can draw: For consistent learners… We can create some equality on the number of training examples \(m\), and the allowed error \(\epsilon\), with some probability of failure at some level \(\delta\), with the size of our Hypothesis \(H\), we have the following: [Read More]

Automatic Modeller

How can you build an auto-modeller? Off the back of my first assignment (which we’ve been given permission to share with others) I thought I’ll begin to generalise some of my code so that I could apply a bunch of classifiers on a data set and produce some graphs and summaries to be able to hopefully make an informed decision on what action you would take next. The goal of such a script wouldn’t be to remove all human knowledge (domain knowledge in building orthogonal data sets will still build applying models naively), but if it can even meet 50% of what a human can do, that is important information in estimated the amount of effort needed or even the feasibility of the problem on hand. [Read More]

Extending Abagail

I’m currently working through my second assignment for CS7641 at Georgia Tech and I thought I’ll record some of my personal notes and coments on the ABAGAIL library. Since I’m not primarily a Java developer, working of this code obviously took longer than a seasoned Java developer. As such I wouldn’t be surprised if the methods and approaches I used here are: Not efficient Outright wrong! Nevertheless, here are my attempts to extend the functionality in ABAGAIL. [Read More]

Swimming For Success

Here are some of my notes from a toastmasters CC10 speech I completed recently. This is not my speech, but rather an altered version to make it more suitable for reading. This was the basis for my Toastmaster International Speech Contest Speech When I was in school they gave out participation awards at swimming carnivals. I remember participating in the 50 metre butterfly event; one of the most exhausting swimming strokes. [Read More]

Why I Run

Why are you always so busy miss? Don’t we all have 24 hours in a day? That was a statement I made to my teacher when I was starting high school. The teacher in question was a single mother, always out of time, always frazzled. It never really occurred to me why this was the case. I had so much time, why is it so different for someone else? [Read More]

Value Driven Planning

Since embarking on the journey to pursue further studies, the most important aspects is currently planning. How can I possibly plan to accomplish? Should it be done in a daily, weekly or some other level? How can I gauge whether whether I’m on track on behind the times? I believe in principles, they’re far easier to remember and adhere to: Always remember to exercise, especially if your workout is as simple as going for a 15 minute run. [Read More]

Opening Thoughts On Omscs

I’ve now started my Masters of Computer Science with Georgia Tech (commonly shortened to OMSCS, being an online degree). I thought that it might be worthwhile to share some of my opening thoughts to the program. This is more for my benefit as I think it would be useful to gauge how my thoughts change over time the further into the program I get. 1. The program is hard The warning signs were there. [Read More]

Using Ipython Notebooks As A Form Of Learning

One of the more interesting aspects of using iPython is the ability to combine different languages into one notebook. Below is a session that I was doing as part of reddit’s dailyprogrammer challenges. As you can see below, notebooks are a great way to comment and display your code in an informative way to “show all working”. I believe that this model is best for when you first begin to learn how to code, especially since it is hard to “visualise” the progress of your code. [Read More]

Reflections Of Last Year

… we are the average of the five people we spend the most time with - Jim Rohn Walk with the wise and become wise, for a companion of folls suffers harm. - Proverbs 13:20 Often when I examine myself and think “Oh I haven’t changed too much, I’m very much the same person that I was in the previous year.” But on closer introspection I think I have changed far more than I anticipated. [Read More]

Introduction To Causal Inference

Causal inference is concerned with what would happen to an outcome if we hypothetically did something else to it (i.e. treatment). This idea is commonly used in medicine to determine the effect that medicine might have to a person. It is important to remember that you cannot substantiate causal claims from associations alone - behind any causal conslusion there must lie causal assumptions that is not testable in observational studies (i. [Read More]

Extending Ramble To Build A Programming Language

In a previous post, I looked at how I build Ramble and an example of a build a calculator. Since then, I have changed how the parser works and I thought it would be a good opportunity to demonstrate something you could do as a weekend hack in Ramble. Here I will take you through how to write a parser for a subset for a PL/0 language. I would probably need to spend a little more time to get the whole language working. [Read More]

A Quick Attempt On Converting Ziffersystem To Lilypond

If you are to look at the collection of guitar music which my dad owns, you will find that a fair amount of it is handwritten using the ziffersystem. This is due to his very strong sense of relative pitch. The the natural question is how might we convert this to a more traditional system? Indeed you can do it using Jianpu-ly (at least judging by the website, I haven’t actually tried it myself). [Read More]

Numbers Divisible By Three

Why is it that if a numbers is divisible by three, then the digits are also divisible by three? Consider every integer \(x\) can be written as $$ x = \sum_{i \in Z} a_i 10^i $$ where \(Z\) is the set of all integers, and \(a_i \in {0, 1, …, 9}\) for all \(i \in Z\). This can be rewritten as $$ x = \sum_{i \in Z} a_i (10^i-1+1) $$ [Read More]

Building A Toy Parser In R

In this blog post I will go through the process of building a combinatory parser in R. This post reflects my intent of developing an R package called Ramble. Please note that at the time of writing Ramble was still a work in progress and may differ to what I will explore in this blog post. R is a functional What do I mean by this? In R functions are first-class, and higher-order. [Read More]

A Quick Look At Bagging

Continuing on the previous post, I thought we’ll have a quick look at bagging (bootstrap aggregating). We will have a look at using bagging for regression and classification. Firstly what is bootstrapping? Bootstrap Bootstrap is where we try to estimate something based on a sample. This is important when we don’t know the distribution of the variable, and we want to get a feel for the error which we may have for our estimate. [Read More]

A Quick Look At Stacked Regression

The Netflix Prize competition has always been interesting to me, however I had never really taken the time to really think about the implications of blending models together. In this post I will take a look at the easiest case of blending models; looking at a linear model example. I will preface this by saying that my R code is not the most perfect, but I hope easy to follow. I have not found any blog posts looking at blending/ensemble modelling in the context of the linear models. [Read More]

Using Dimple

I’m currently looking at the Udacity course called Data Visualization and D3.js. Working through Lesson 2 involves recreating some visualisations which were “badly” made on the internet. So here are some of the ones which I’ve come up with using dimple.js. Firstly lets consider the image showing the sales of various fast food restaurants: We can see that firstly the graphics although attempt to be to scale, it is very difficult to determine if the Taco Bell logo is indeed half the size of the Pizza Hut logo. [Read More]

First Attempt At Writing Choral Music For The Left Hand

Ostinato is my attempt at writing an ostinato for choral music. At the time of writing this blog post, I must say it was a lot harder than I anticipated for the following reasons: Limited support of any modules for music editing Personally, limited knowledge and patience for the existing modules! I used the abjad module, which although contained documentation, it quite frankly might as well not be there. [Read More]

When The Ben Franklin Effect Goes Wrong

Anchoring Anchoring is a cognitive bias based on our tendency to rely too much on our first impression in how we may act, respond and make decisions. Anchoring has the ability influence answers we provide, whether it be conscious or subconscious level. This post isn’t aimed at how we might defeat our anchoring bias but more so on some observations that I’ve come across in my own life about how anchoring could be used to perhaps influence other people. [Read More]

A Look At Folding

Currently I’m sitting the FP101x from edX. A portion of the homework was on foldr and foldl (i.e. right fold and left fold). This got me thinking how I might think about these ideas. Lets firstly compare the two. Prelude > foldr (-) 0 [1..4] -2 Prelude > foldl (-) 0 [1..4] -10 Why are they different? How do they work? foldl is probably the intuitive one. It simply “inserts” the function between the elements. [Read More]

How To Debate

Typically in a competitive debate the goal might be to sway to audience. The more audience members which one team (or person) has convinced is declared the winner. However, outside of a competition, the goal is general quite simple; to change your opponents mind. How might we do this? Why would we want to ever do this? Iterated Prisoner’s Dilemma In game theory we have prisoner’s dilemma. From wikipedia: If A and B both betray the other, each of them serves 2 years in prison If A betrays B but B remains silent, A will be set free and B will serve 3 years in prison (and vice versa) If A and B both remain silent, both of them will only serve 1 year in prison (on the lesser charge) In iterated prisoner’s dilemma, this game may be played more than once. [Read More]

Using J To Solve Euler Project

Today I thought I’ll have a quick look at J and how to use J to solve some Euler project problems. This post will go through one possible way to solve the first Euler project problem. If we list all the natural numbers below 10 that are multiples of 3 or 5, we get 3, 5, 6 and 9. The sum of these multiples is 23. Find the sum of all the multiples of 3 or 5 below 1000. [Read More]

The Minto Pyramid Method For Planning Presentations

Roughly a month ago I was introduced to the Minto pyramid principle for planning presentations. In this post I will show you what it is, not really distilling whether I personally like it, but rather just treating it as an additional tool to assist you with planning presentations. Structure Whether you’re presenting or writing, there is a general structure that we follow. It always is (with some variations) like this: [Read More]

Download Stuff From Reddit Using Java

Previously I’ve looked at how to download stuff from Reddit. So I thought I’ll re-apply the same code, but using Java. Here is the same code to do just that. import java.io.*; import java.net.URL; import java.nio.channels.Channels; import java.nio.channels.ReadableByteChannel; import java.nio.charset.Charset; import java.nio.file.*; import java.util.Arrays; import org.json.JSONArray; import org.json.JSONException; import org.json.JSONObject; public class getWallpaper { private static final String OPEN_SUBREDDIT = "http://www.reddit.com/r/%s/.json"; public static String readAll(Reader rd) throws IOException { StringBuilder sb = new StringBuilder(); int cp; while ((cp = rd. [Read More]

Developing On Android Adding Dialogs And Wrapping Up

This is a series of post as I will hopefully build a simple wallpaper app. In this post, we will build a simple “guess the number game”. There will be a input area, which you enter you guess, and simple prompt to tell you whether your guess was too high or too low. <RelativeLayout xmlns:android="http://schemas.android.com/apk/res/android" xmlns:tools="http://schemas.android.com/tools" android:layout_width="match_parent" android:layout_height="match_parent" android:paddingLeft="@dimen/activity_horizontal_margin" android:paddingRight="@dimen/activity_horizontal_margin" android:paddingTop="@dimen/activity_vertical_margin" android:paddingBottom="@dimen/activity_vertical_margin" tools:context=".LowerHigher"> <TextView android:text="Enter a Guess. The random number will be between 1-10 inclusive. [Read More]

Developing On Android Using Buttons

This is a series of post as I will hopefully build a simple wallpaper app. In this app, we will build a simple counter which counts the number of times a button in the app was pressed. Firstly we should design layout of the app. It should have text which is the counter and a button. Be give the button and the counter id’s so that it can be referenced in our code. [Read More]

Develping On Android Hello World

This is a series of post as I will hopefully build a simple wallpaper app. When creating a new application using Android Studio, you will immediately be given a “Hello World!” boilerplate. But how can you extend it? But firstly, let’s have a quick look at the different files which build up an Android app. res/layouts I like to think of layouts as your HTML code. It tells you where everything sits. [Read More]

Downloading Stuff From Reddit

Recently I’ve been playing around with changing wallpapers, but I realised it is just so much work curating images to use. So I thought one way was to use Reddit and rely on various subreddits as inspiration. As I got into it, I found that if the subreddit was relevant, I was just downloading everything. So why not automate this! Using Reddit’s API, where all we have to do is add . [Read More]

Rough Guide To Dcjs

dc.js is quite a nice library which allows you to filter of multiple graphs intuitively. Surprisingly it is not too difficult to develop on. If you have an understanding of .dimensions and .group(), you’re basically ready to go and start creating beautiful interactive dashboards. Below I have a simple example which I will quickly describe, particularly the parts of the library which I really like. Exploring the Code .dimension allows you to easily define how you would like the “cut” the data. [Read More]

Adding Tooltips To Vega Visualisations

Recently I have been looking at a visualisation grammar called Vega. The main reason is the simplicity with generating graphs for the web when you come from other non-web areas such as Python. But why Vega? The visualisation grammar is something which I believe is easily accessible for people who don’t have knowledge in HTML or JavaScript. D3 although a wonder and powerful library, is way too difficult to even do the simplest of actions, whilst many of the other libraries seem to connect to D3, removing parts of its abstractions. [Read More]

Introducing Formdown

Announcing formdown. This is a simple python module which assists with generating pesky forms without all the html tags. This can be combined with datepicker. Example: form = """name=____ date: datepicker=_____""" print formdown.parse_form(form) Output <p><label>Name</label><input type="text" name="name" id="name" /></p> <p><label>Date</label><input type="text" name="date" id="date" class="datepicker"/></p> <p><input type="submit" value="Submit"></p> Note that since we have class="datepicker", we will need to change the sample on the homepage: $(function() { $( ".datepicker" ).datepicker(); }); Hopefully this can be helpful! [Read More]

Bad Arguments

Here are some of my notes from a toastmasters CC5 speech I completed recently. This is not my speech, but rather an altered version to make it more suitable for reading Bad Arguments What do you want to do when you grow up? This is a question which you might have been asked many times throughout my childhood. But there are certain times when the answer actually does matter; like high school, or universities (which actually define who you are). [Read More]

Using Gtfs Sydney Buses Data With Dplyr

General Transit Feed Specification (GTFS) data is freely available for Sydney Buses. I thought I would go through and see how easy it is to query and create something meaningful. Reading the files in was rather trivial and required the work of the very common eval(parse()) combination: tables <- c("trips", "stops", "stop_times", "routes") for(tbl in tables) { eval(parse(text=paste0(tbl, " <- read.csv(unz('./sydney_buses_gtfs_static.zip','", tbl, ".txt'), row.names=NULL)") )) } My first attempt made use of plyr library, but I ended up finding that slightly messy. [Read More]

Python Forms With Tkinter And Py2Exe

Decided to quickly pull together a simple application which could add entries to a sqlite database. Some quick observations about tkinter: Entry : is really easier to pick up input fields. Label : feels extremely hacky. .grid() : must be one of the ugliest things I’ve had to work with, but it works. Packaging up the app into a single file (i.e. bundle_files : 1) required some small changes to the py2exe/build_exe. [Read More]

Geocoding Using Osm

Using API calls is probably the easiest way to geocode address. Here is some sample R code which demonstrates how this can be done using two different sites: library(rjson) address = "321 kent st, sydney,nsw 2000, australia" format = "json" addressdetails = 0 # nominatim API has a "fair usage" policy but is more or less unlimited # you could in theory just install your own stack, and query that... url = paste0("http://nominatim. [Read More]

Convergence Of Kmeans

k-means clustering is a popular cluster analysis method. When using it the question might arise like, does the algorithm always converge? Does it converge to the global minima (it doesn’t, but it does converge to the local minima). Here, I will take a brief look only at the convergence of k-means. Suppose a single cluster \(C\) is assigned representative \(z\). The cost is then $$ cost(C; z) = \sum_{x \in C} ||x-z||^2$$ [Read More]

Love And Creative Rut

Lately I feel as though I have been stuck in a creative rut. Perhaps it is a winter thing, where it doesn’t feel like there is any energy to do something exciting. I tried picking up various game frameworks to give a hand at making a simple game. Above is my version of breakout written in Lua using the LOVE framework. It has provided me with simple abstraction to create a simple game. [Read More]

Feedback Sandwich

An important component of toastmaster’s leadership component is evaluations. The commonly cited method to accomplish this is the commend-recommend-comment (CRC) method. Now when we have time to prepare feedback it may be quite straightforward to construct useful points of development, however when an evaluation is to be given in a short time after the speech, it may be difficult to construct an effective evaluation. The truth is effect evaluations are difficult; if not the most difficult speech you could present at toastmasters. [Read More]

Whistling Vivaldi

Here are some of my notes from a toastmasters CC4 speech I completed recently. This is not my speech, but rather an altered version to make it more suitable for reading Whistling Vivaldi Should you put toastmasters on your resume? The typical response is “Of course you should!” but for myself in particularly, it will remain purposely absent. Micro-affirmations Imagine a generic asian male who has a technical, male dominated role sitting in a work-related meeting. [Read More]

Build Your Own Degree

With the increase in the popularity of MOOCs why not try to build your own degree! The easiest way is to consider some guidelines while copying a sample course outline in university. Course distribution guidelines (for bachelor degree): 2/3 of time dedicated to your major 1/3 of time dedicated to general education (subjects not belonging to your school) 500 hours of study per semester (full-time) or 250 hours (part-time), or 3000 hours per degree This is roughly 4 courses with 10-15 hours of work / week The best way (with course information) would be to make use of MIT Open Courseware curriculum. [Read More]

Recorded Crime Rates In Nsw From 1995 To 2009

Learn About Tableau I thought I would have a look at some open data and I came across the crime data set. The data used is: Recorded Crime Dataset (NSW) Postcode 2011 to Local Government Area 2011 Using R I thought this was a good opportunity to practise some R skills, though it could have easily been accomplished in Python as well. In order to read the data into R “nicely” I saved the relevant data into csv format. [Read More]

Career Readiness

#What is a ‘career’? To different people, a career can mean different things. It is shaped by context (what stage of life you’re in) and experience. For me, the best metaphors which describe my career right now are: Ascending: I want to climb as high as I can Learning: It is important to be able to exploit opportunities which allow me to pursue my interests Expressing: Can I be myself? [Read More]

How To Get That Job

I recently had a one-on-one meeting with my manager and I asked this question: How would I get into management? (If I want to go down that route) Given my technical background it is a strange question; quite frankly it would probably be to the surprise of some people if I went full management and relinquish all programming at such a young age. My leader gave some extremely simple advice which would apply to any job. [Read More]

Career Skills

Notes on the MOOC “Enhance your career and Employability skills” Week 2 SkillDescriptionRatingEvidence TeamworkRespecting others, co-operating, negotiating, persuading, contributing to discussions, your awareness of interdependence with others2I work in a rather solo field, the communication overhead is really where teamwork comes into play; am I doing something which someone else has done before? Having regular catchups is difficult since technical people find it a waste of time (me included) though it has the potential of saving yourself time! [Read More]

Career Values

Notes on the MOOC “Enhance your career and Employability skills” Week 1 Integrity. “Form follows function” - results driven, challenging work. Flexibility with respect to tooling (flexibility/innovation over process). Data driven decisions. The customer is not always right; sometimes they don’t know what they’re looking for. Mass collaboration over individual work. Automate relentlessly. Sense of purpose and results. Lifeline Positive Emotions: curiosity, hope, gratitude, joy, enthusiasm, pride, generosity Negative Emotions: worry, dread, anger, sorrow, frustration, envy, selfishness var graph = new Rickshaw. [Read More]

Disc

Here are some notes based on a toastmasters session (May 2014) with Ben Reeve from Inform, there notes are just for my own sake, and maybe helpful for any readers. The most important thing when using this framework is to consider what is your behaviour. And then, think about the behavious of others. As you will notice, I belong to the “Dominance” category. In a nutshell, you could describe any person using two attributes: [Read More]

Framing

Here are some notes based on a toastmasters session (November 2013) with Ben Reeve from Inform, there notes are just for my own sake, and maybe helpful for any readers. The goal of framing is how to approach difficult questions. This can be important for people who may sometimes be too direct, or perhaps require some structure. Placing objections is often times one of the hardest thing you have to do, whether it is in your personal or professional life. [Read More]

Cmusphinx Quickstart On Windows

CMUSphinx is an open source toolkit for speech recognition. Here I have some notes which helped me with getting started with CMUSphinx. This is not a comprehensive tutorial, but rather just to give you the bare minimum to get it up and running. On the download page, download (at the time of writing, it was version sphinxbase-0.8 and pocketsphinx-0.8): sphinxbase (windows binaries) and pocketsphinx (windows binaries). Extract them in the same folder (say CMUSphinx), and rename them so that the folder only has two folders called “pocketsphinx” and “sphinxbase” And that’s it! [Read More]

Failure Is An Option

Here are some of my notes from a toastmasters CC2 speech I completed recently. This is not my speech, but rather an altered version to make it more suitable for reading Failure is an Option Failure is a difficult topic. Most people (and rightly so) have the notion that failure is a negative thing. But it should not universally be the case: Failure is necessary for learning allows for rapid growth important for creativity Failure is necessary for learning Failure begins in childhood. [Read More]

Pokemon Master

For April fools, Google released their Pokemon Challenge. As a challenge from a friend I decided to start a quick prototype of how you could clone this. The library of choice I used was leaflet. Goals: Create a map which spawns pokemon. When you click on a pokemon to “catch” it, it turns into a pokemon. More complex interactions like a Pokedex could then be easily added. Limitations: [Read More]

The Non Linear Path To Learning

After stumbling across The Odin Project, I thought how could I approach this but using data science instead; perhaps I could write a rails application! But then after thinking about it just a little deeper I realised that it would be difficult to be comprehensive (though it is possible). Why is it difficult? Well people come to data analytics from a variety of backgrounds: IT backgrounds (databases) Computer Science backgrounds (programming and software engineers) Mathematics, engineering (hard sciences) Economics, social sciences (soft sciences) Each one would take a different path for data science. [Read More]

Pseudo Log Transformation

\(log\) transforms are commonly used for data visualisation. But there is a downside. The domain of \(log(x)\) is \(x > 0 \). Which then leads to the next question. What if my data can extend across the whole real line? For example, profit and loss numbers. How could you transform it? Maybe the immediate thought is to do $$sign(x) log(|x|+1)$$ Which is an acceptable response answer. Another possible solution is using \(sinh(x)\). [Read More]

Productivity And Lines Of Code

How do you determine if you are productive? Lets assume that coding is not a creative role. Assumptions: Typing speeding is 67wpm (assuming 6 characters a word, this will be ~400 characters a minute) Line length is 80 characters per line of code This means that an extremely conservative estimate is 5 lines of code a minute. With that in mind, to create 1,000 line piece of code would take 200 minutes, which is only half a day. [Read More]

Notes On Node And Express

Express gets updated relatively frequently, so unfortunately tutorials go out of date (and due to the nature of the internet, are updated infrequently). Inevitably, this tutorial will also go out of date. Nevertheless, here are just some notes from following online tutorials on building web applications using node and express. Getting Started Assuming node is installed, you can go ahead and install all the necessary modules that would be used for making a webapp. [Read More]

A Degree Is A Degree

After applying (and being rejected) from Georgia Tech Online Masters degree, I wondered what the options for me actually was. As a sidenote, I met the minimum requiremenets, but of course coming from a non-computer science background could not of possibly helped. So here are some breakdowns (in AUD) the cost of various degrees (in no particular order): Georgia Tech (Online Masters of Science - Computer Science) ~7,500 University of Newcastle (Masters of Information Technology) ~17,000 University of New South Wales (Graduate Certificate in Computer) ~10,000 Central Queensland University (Graduate Certificate in Information Technology) ~4,500 Deakin University (Graduate Certificate in Information Technology) ~11,000 Charles Sturt University (Graduate Certificate in Information Technology) ~10,000 University of Liverpool (Masters of Science - Software Engineering) ~22,000 University of Hertfordshire (Masters of Science - Computer Science) ~11,000 Walden University (Masters of Science - Information Technology) ~32,000 South New Hampshire University (Masters of Science - Information Technology) ~13,000 Swinburne University (Graduate Certificate of Information Technology) ~10,000 Stanford University (Master of Science - Computer Science) ~65,000 There are many more universities which offer degrees and certificates but this provides some indication of what the fees look like. [Read More]

Introduction To Flask

Flask is not something I haven’t heard of before, but I thought I’ll have a look at this microframework just to see whats different, and perhaps learn a different way to build web applications. Getting Started Lets begin with the typical “Hello World” example. from flask import Flask app = Flask(__name__) @app.route('/') def get_tasks(): return "Hello World" if __name__ == '__main__': app.run(debug=True, port=8080) This isn’t anything overly complicated or magical. Though when compared with some other frameworks this is beautifully simplistic. [Read More]

Documentation Generator For Sas In Python

After playing around with pycco I decided to give a go of creating a custom one for SAS. When applying pycco onto SAS, you will run into strange formatting since when you have: some sas code /*a comment*/ you would expect the comment to reference that line of code. pycco handles this correctly. In comparison since SAS doesn’t have a concept of “functions” (yes there are macros, but lets ignore that for a moment). [Read More]

Flex And One Random Thought

After working on parsing SAS code over the last few weeks, I thought I’ll revisit on the tools I used. I used pyparsing. Theres nothing wrong with using pyparsing (in fact it was immensely helpful to get things done quickly) but I thought it would be a good opportunity to examine flex and bison. I’m more interested in looking at and improving on my C code. The structure of the lexer and grammar file isn’t completely new to me, since I had been exposed to ply through Udacity course, but it still took more time that I would have liked to get things “up and running”. [Read More]

I Wish Sas By Statement Was Retired

I’ve been debating whether by statement should be retained in Stan. This is quite easily my least favourite “feature” of SAS. The modern methodology of “split-apply-combine” in R and Pandas is far superior and easier to read compared with the many hacks which come along. For example: data class; set sashelp.class; by sex; /*just pretend its been sorted...*/ retain cumulative_age name_list; if first.sex then do; cumulative_age = 0; name_list = ''; end; cumulative_age = cumulative_age + age; name_list = catx(', ', name); if last. [Read More]

What I Have Learnt Writing A Sas Parser

So I’ve just released v0.01-alpha of my SAS transcompiler to Python (Stan). Here is just a list of things I’ve learnt : SAS very similar to a PL\0 language by statements are inferior to the split-apply-combine strategy pyparsing makes life very easy (compared with dealing with lots of regex) iPython magics are ridiculously easy to write writing Python packages isn’t that hard, but there is a lot of extraneous options Some of the (many) things which are missing: [Read More]

Learn For The Sake Of Learning

If I could go back in time and give advice to myself before starting my full time job what would I say? You see many people will tell you something similar to: What you learn in university is irrelevant to the real world. And the truth is, they’re right. But that doesn’t mean you should just forget everything and no progress in knowledge. You see to function in the real world is like being an end-user. [Read More]

Who Owes What Revisited

Perhaps its just an itch, but I couldn’t resist. So here is Wow. As stated before I thought the best way was to emulate APScheduler. Usage is fairly similar. To start Wow simply use. {% highlight python %} z = Wow() {% endhighlight %} Then to add a payment: {% highlight python %} z.add_payment(“deposit”, 39.25, “Chapman”, “Tim”) {% endhighlight %} Finally to calculate, simply do {% highlight python %} print z. [Read More]

No Resolutions

This year I’m not making any resolutions. I’m just going to do what I enjoy and what I want and go from there. Since working full time (only been working for 2 years or so) I’ve been doing too much. Been spread too thin. This year is the year of doing less, but at the same time, doing more. Do less “things” but doing them in higher standard of quality. [Read More]

Who Owes What

Over the Christmas break I went on a holiday with a group of friends, and we stumbled on a bit of a dilemma: Who owes what? Of course its easy to determine that before your trip, with the plane tickets and accommodation. But what about during your trip? The taxi fares, the meals, souvenirs, etc. You wouldn’t want to pick it through with a fine comb whilst trying to enjoy your holiday! [Read More]

Gentle Introduction Into Bayesian Inference

Following on from a previous post, lets look into Bayesian inference. Recap Recall that under Bayesian probability which have this formula. $$ P(A \cap B) = P(A)\times P(B | A) $$ We can rearrange this: $$ P(A | B) = P(A)\times \frac{P(B|A)}{P(B)} $$ Now we have the equation for Bayesian inference. The central idea for Bayesian inference is this: $$ posterior \propto prior \times likelihood $$ Right now all these things don’t really make sense, so lets revisit the equation we have above. [Read More]

Sydney Hat Restaurants

This weekend, I decided to create a visualisation, to the best of my ability using the hat restaurant list from Gault and Millau Sydney Restaurant Guide. The result is here, and the screenshot is below: Lessons Learn Scraping the Table This is not the first (and probably not the last time) I will scrap data off websites like that. Firstly the table from www.noodlies.com is slightly malformed. It isn’t quite in the format you expect, and is not in unicode format which makes apostrophes look strange in python (when you force it into unicode). [Read More]

Fostering Creativity

There is a current trend, where company values are “creativity”, “innovation”. But how do we encourage these things? Learning to Fail Creativity : the use of imagination or original ideas to create something; inventiveness. – Google (Side note, using the word “create” as a definition of “creativity” is probably poor form…since if you didn’t know the definition of creativity, you probably don’t know the definition of create!) You can’t be creative, without creating. [Read More]

Create R Package As Fast As Possible

Apparently creating R packages is a good idea for code reuse. So what’s the best way to do it? Hopefully in this short blog post I will take you from start to finish as quickly as possible (omitting details for you to fill in) Useful Packages Firstly install devtools: install.packages("devtools", dependencies = TRUE) This is for using dev_mode() which will isolate an environment for you to do more testing. library(devtools) dev_mode() create() Navigate to the working directory and just create a package: [Read More]

A Glimpse Into Bayes Probability

Let’s firstly think what is the probability of two events happening. Lets say Coin toss is heads, \(P(Heads)\) It will rain, \(P(Rain)\) Now these two events are independent of each other, so the probability of both occurring is: $$P(Heads \cap Rain) = P(Heads)P(Rain)$$ But!! The probability of two events happening is not always this simple! Lets take another two events: Next person you meet is male, \(P(Male)\) Next person you meet is wearing a dress, \(P(Dress)\) The probability of this happening is not independent, since we know (or at least have a prior belief) that there are not too many males who wear dresses. [Read More]

Data Visualisation Using Fitbit Data

As a farewell gift from my colleagues at Westpac, I was given a Fibit Flex! Lucky me! One of the cool things about Fitbit is the dashboard which has a wealth of information in it. But for me, its not enough. I don’t want 15 minute segments, I want it at a finer scale! Also, what if I don’t really care about steps taken, but rather I want to know distance travelled. [Read More]

Unintentionally Disconnected

I am a minimalist. Perhaps not an extreme one, but am always looking for ways to do less. This isn’t always easy, infact it is increasingly hard. Often times you find yourself more and more engrossed in elements such as the daily news for example, or the latest trends. I had come to a moment of serendipity (is this correct usage?). When I aimed to replace my current smartphone (HTC Trophy) with Firefox OS phone from ZTE Open. [Read More]

Coprimes

Claims: If \(n\) is coprime with \(a\) and \(b\), then \(n\) is coprime with \(ab\). Firstly let us define what we mean by coprime. If \(n\) and \(a\) are coprime, then $$ gcd(n, a) = 1 $$ Using Euclidean algorithm, we know that if (and only if) the above is true then there must exist integers \(x\) and \(y\) such that $$ ax + ny = 1 $$ So then to prove the result above, let \(v\), \(w\), \(x\), \(y\) be some integer such that [Read More]

Contingency Tables Measures Of Fitness

In the area of machine learning, and statistical modelling, logistic regression, and the use of grouping objects in groups is extremely important. There are plenty of documented ways to access model suitability, primarily dealing with false positive and true positive ratios. But when items are difficult (or expensive) to access and the prevailence of type III errors (correct classification for the wrong reasons, which strangely enough is quite important in some areas), different methods have to be employed to think about model fitness. [Read More]

Extremely Short Guide To Web Scraping Tables

Scraping information from the internet can be very handy. But often, information is located somewhere within the tags of a webpage. Scraping this information into a format which can be used by another source can be easily achieved using Beautiful Soup. An Example In this example we shall retrieve the Barclay’s premier table located here. To open this webpage we shall use urllib2. import urllib2 webpage = urllib2.urlopen('http://www.premierleague.com/en-gb/matchday/league-table.html') We can then “soup” up the webpage using Beautiful Soup. [Read More]

Constructors In R And Python

The ability to construct your own abstractions is an important part of object orientated programming language. In this post we will quickly run through the differences between building constructors in R and Python. Declaring constructions Python In python to declare a constructor we use: class Person: def __init__(self, name): self.name = name In python methods can be added dynamically. e.g. from types import MethodType class Person: def __init__(self, name): self. [Read More]

Kantorovich Inequality

Kantorovich Inequality is used to show linear convergence for steepest descent algorithm (for the quadratic case). This result is important in some optimization algorithms. Kantorovich Inequality Suppose \(x_1 < x_2 < … < x_n \) are given positive numbers. Let \(\lambda_1, … , \lambda_n \geq 0\) and \(\sum_{i = 1}^n \lambda_i = 1\) then $$ \left( \sum_{i = 1}^n \lambda_i x_i \right) \left( \sum_{i = 1}^n \lambda_i x_i^{-1} \right) \leq \frac{1}{4} \frac{(x_1 + x_n)^2}{x_1 x_n}$$ [Read More]

Short Introduction To Ggplot2 In R

ggplot2 (grammar of graphics plot) is a twist on the traditional way of displaying graphics, and will differ slighly compared with the plot functions in R. Layers The general idea of GOG (grammar of graphics) is that graphics can be seperated into layers. From Wickham’s paper, the layers can be summarised as follows: data or aesthetic mappings, which describe the variables (e.g. x ~ y plot) geometric objects, which describe what you see, (e. [Read More]

Simple Memoize In Scheme

Here is some simple code to memoize a recursive function in Scheme. I originally got this from a Stackoverflow question, but the link is now dead, so I thought I will post it up here. Memoize function: (define (memoize op) (letrec ((get (lambda (key) (list #f))) (set (lambda (key item) (let ((old-get get)) (set! get (lambda (new-key) (if (equal? key new-key) (cons #t item) (old-get new-key)))))))) (lambda args (let ((ans (get args))) (if (car ans) (cdr ans) (let ((new-ans (apply op args))) (set args new-ans) new-ans)))))) After we have memoized this, we can use this within our recursive function. [Read More]

Moocs Can Not Make You A Data Scientist

This is a statement which I’ve been thinking about extensively for a while now. How do you become a data scientist? (or rather how should I become a data scientist). The conclusion that I have now (subject to change) is not MOOCs. Without a doubt, MOOCs will help, infact I’m eyeing an article right now and following it as the basis of improving my knowledge. To be fair the article also links an additional two articles on management and [technical] (http://www. [Read More]

Short Guide To Bazaar Shared Repositories

This guide is meant as an introduction to shared repositories in Bazaar. I am by no means an expert at Bazaar, hence there is no guarentee that all terms provided are correct. Shared Repositories Whenever we want to share our repositories across different users, it may be wise to use shared repositores. To create a shared repository: $ bzr init-repo foo-repo init-repo : Creates a shared repository (this is the same command as init-repository foo-repo : this is the name of the repository [Read More]

Short Derivation Of Log Normal Distribution

If \(X~LN(\mu,\sigma^2)\) then \(ln(X)=Y\) is \(Y\) is distributed \(N(\mu,\sigma^2)\).

Derivation of log-normal distribution

\(\begin{align}
Pr(X < k) &= Pr(e^{Y} < k) \\
&= Pr(Y < ln(k)) \\
&= \int_{\infty}^{ln(k)} \frac{1}{\sqrt{2\pi \sigma}} e^{- \frac{(Y-\mu)^2}{2\sigma^2}} dy \\
&= \int_{\infty}^{ln(k)} \frac{1}{\sqrt{2\pi \sigma}} e^{- \frac{(ln(x)-\mu)^2}{2\sigma^2}} \frac{1}{x} \frac{dx}{dy}dy \\
&= \int_{\infty}^{ln(k)} \frac{1}{x\sqrt{2\pi \sigma}} e^{- \frac{(ln(x)-\mu)^2}{2\sigma^2}} dx \end{align}\)

Getting Started With Measure Theory

Rigour is lost in the real world. But that does not mean I shouldn’t continue to pursue and practise my mathematical skills. Just like a programmer might learn to code by writing programs, to improve my mathematical thinking I must read other people’s proofs and write my own. (emphasis mine) Adapted from “Why there is no Hitchhiker’s Guide to Mathematics for Programmers”. So perhaps here will be a collection of (mostly other people’s) proofs, until I feel confident writing my own. [Read More]

Missing The Point

After working full time for one year, one of the stumbling blocks for (good) analysts producing results is tools. There is such a huge emphasis on what tools you should be using, rather than how you should use it! Big Data is all the rage and interest right now, and perhaps that might be what is to blame. ##A Shifting Focus? The industry may be moving in the right direction. With the introduction of MOOCs like Udacity, Coursera, and EdX, vigilant and proactive computer programmers or statisticians can increasingly improve and widen their knowledge. [Read More]

Tangle.Js And Fangle

gist here Heres just an update on what I’m working on right now. I’m quite interested in markdown as an alternative for quick prototyping for many applications which we would normally associate with Office suites. For example, in my previous post, I’ve talked about deck.js and how it could be easily combined with a simple script to produce presentations quickly. Now, I’m considering to what I believe can replace spreadsheets (though unlikely). [Read More]

Markdown Deck.Js

update: check out Puma.js which uses the same ideas but with no server-side compilation needed. I’ve been interested in various forms of markdown for a while (after all jekyll uses markdown heavily). Thats when I’ve discovered deck.js. A very short python script will allow you to generate presentations without having to worry about layout. The structure of the presentations is using markdown as normal and seperating each slide by ---. For example, the markdown presentation to generate this presentation is as below: [Read More]

Technically Right

Being technically right. Theres many variations to this theme, including code smell. But what about data analytics? We know that we can obviously have code smell within our own programming, whether it be in SAS, R or any other programming language we intend on using. But what about when dealing with business users? Is there some equivalent to code smell? Consider this situation: A business user requests for information You promptly and quickly provide precisely what was requested; in a spreadsheet with 5 tabs, each tab containing 1000 rows. [Read More]

Reponsive D3 Venn Diagrams

Here is an outline of how I managed to use SVG elements ‘responsive’. I am be no means an expert by javascript, infact I would actually say I’m a copy and paste programmer when it comes to javascript. My trail of though on how you could make SVG elements reponsive is through jQuery. More specifically through $(window).resize. After defining a function which will draw the SVG element based on the browser size, then $(window). [Read More]

Cron

I have found a useful cron-like script for windows using python. Although it is relatively old (last update being in 2011) it still works really well. I have applied some changes below to extend its functionality to allow you to put “dash” between numbers. e.g. 1-5 0-59 and so on, so that it is even more cron-like. Changes are as follows: def listing(expr): listing = [dash(x) for x in string. [Read More]

Get That Google Drive Static Webpage

My weekend project! Recently I’ve been on Codeacademy learning jQuery and javascript. So as an exercise I’ve made a simple website which tries to get the webpage from your googledrive account based on this blog post. The way it tries to get the website is using regex on your google drive URL /\/(\w{13,})\//. No error checking was done to ensure that a valid URL was placed in any of the fields. [Read More]

Openshift And Webpy

I’ve just finished my first web app on Openshift using the web.py format! This will be a list of random thoughts and ideas where I will record them. I like to think of this as my Moleskin notebook which people carry around. There are still various improvements which I can make: Adding a framework. Boostrap seems to be skeleton’s big brother, so it would be a good starting to point to apply it to this website Responsive web is not the main goal, though for learning purposes; why not! [Read More]

Openshift And Postgresql

I’ve been giving Openshift a shot for a few days now. What I’ve got running: webpy PostgreSQL Annoyances I’ve had: Getting multiline text to run with proper escaping. This really is the only gripe I’ve had with Postgresql (not with Openshift). But I think enough is enough for now, I’ll revisit you later! Lessons learnt: Use insert in mingw32 to paste Escaping strings is a POA Using webpy on a framework better suited to Django or Rails means lots of experimentation (i. [Read More]

Jquery Table Of Contents

Based on Janko’s jQuery table of contents. Repository: github repository, using Skeleton as a basis for responsive web design. Click here to view it in action. _update: _ I’ve moved and refactored my code here this post will be kept for legacy, though the code is more or less the same as before (23 February 2013) ##Goals Create something which was independent of css (in the sense that you have to manually add addition css code in the html file) Create something which would fit with responsive web design Although I can see many portions which can still be improved, this is my edited version of Janko’s jQuery ToC. [Read More]

Single Source Of Truth

The biggest lesson I’ve learnt in my short time in analytics is being vigilant on sticking with a single source of truth. Every extract should be run once, cleansed through once, and then thrown into production. There should not be multiple extracts for the same ‘theme’ as this serves only to confuse not only other people but yourself (at least in my experience). Simplicity should be strived for; having one go to source will solve that issue. [Read More]

Open Source Data Analytics

Recently, I’ve been exploring outside other programming languages and ideas, in particular open source data analytics and d3. This space has really broaden my mind with what can be achieved outside the boundaries of SAS in data mining, especially with zero-installation modules and applications. Python has been my main weapon, making use of modules such as pyodbc pyper networkx has really allowed me to be able to extract (pyodbc), and perform analysis (pyper/R) and visualize data effectively (networkx). [Read More]

Prototypes

Too often, prototype creep occurs. Prototypes end up being the production version, regardless of whether it is appropriate. “We don’t have enough time.” Is the most common excuses for not creating a production copy. However, we really should plan accordingly. There are many reasons for creating production versions from scratch. One being: Once in production, a prototype will never die. This is why “we don’t have enough time. [Read More]

Dry

A great sin in many data analysts (especially statisticians) is repeating themselves. We need to learn from programmers over the last few decades how to code efficiently and effectively. Perhaps the easiest and quickest starting point is documentation. Why do we keep code description in our header comments and maintain separate documentation in a word document? One easy solution would be to write a program which extracts the comments from all your header files and generates the required documentation automatically through a markup language. [Read More]