Thinking About Teaching Data Science

Recently someone asked me,

If you had a three day data science bootcamp, what would be in it?

Placed on the spot I really didn’t have any idea what could be covered in three days. Now that I have had enough time to reflect on this problem, here is my proposed outline:

Day 1 - Supervised Learning

The assumption would be that anyone participating would have working knowledge of linear regression, and also model training, validation and testing. Hence on day 1 it would simply be consisted of three parts:

K-nearest-neighbours
Decision Trees
Ensemble methods

Being able to reason and distinguish linear regression, K-nearest-neighbours, and decision trees will already give you a solid background on the more common supervised learning methodologies.

Ensemble methods, such as basic bagging and boosting methods should be approached towards the end of the first day if there is time, with the mindset of teaching the theory and ways to combine models together.

Day 2 - Unsupervised Learning (feature engineering and reduction)

The unsupervised learning would actually be focused more on feature engineering and reduction methods. This comes up in sparse data sets for example text mining; and this section could look at ways on generating these features and cutting them down. Hence the topics would be focused on:

TFIDF, N-grams and information theory
PCA, ICA
Clustering approaches, mainly K-means and EM approaches

Being able to tie these ideas back to the first day will probably be the most important concept in this day; particularly on what TFIDF and information means with respect to using these features to create supervised models.

Day 3 - Miscellaneous Topics

There are many other fields that we could consider, and the final day would be combining the first two days together or looking at other areas, depending on what the focus of the participants are:

Q learning (reinforcement learning)
Deep learning
Introduction to big data frameworks (Spark, Hadoop, AWS etc) and comparison of tools and languages.

Reinforcement learning would solidify the big three areas of machine learning, whilst deep learning/neural networks would deepen the understanding in the supervised learning space.

Finally with regards to tooling; there are just so many different tools. Learning how all these tools work and becoming a generalist could be another avenue for the final day.

In the end to work with something in such a small timeframe is tremendously difficult. But at the same time I think these areas could be highly rewarding.