Planning for 10 months is always going to be hard, but sometimes it is worthwhile to note it down so that you can review later to see what has changed and arrange any crazy ideas that you might have.
Research
Within the next 6 months, one achievable goal is to complete a paper which is “good enough” to think about publication. The goal isn’t necessarily to publish, but to have something which we can at least make a decision whether it is sufficient down the road. In fact, I would like to push this idea to the next 2 months, where the time is highly dependent on actually producing a framework and repeatable experiments.
Idea
Determinantal Point Process (DPP) for diverse feature selection on streaming data. The high level goal is to ensure that new incoming features are sufficiently different as to promote exploring new feature sets which may contribute to the machine learning/data mining problem. This is done in two parts:
- DPP selection algorithm. Which presumes no streaming
- Conditional DPP, which selects streaming data conditional on the current feature set
Initial results are promising under certain conditions.
Other applications for this could be using DPP to encourage diverse model sets for ensemble/stacking.
Blockers and Challengers
One of the biggest blockers is having worthwhile challengers to this problem. There are a couple candidates which I can think of:
- Grafting
- Online Group Feature Selection
- Alpha-investing
Frameworks to Build
In order to have a reusable process for streaming data, my current plan is to build a scikit-learn framework (using pipeline idea). Broadly speaking it will be composed of components which look like this:
- “base pipeline”
- “incoming stream 1”
- “incoming stream 2”
and so on. Then the pipeline can be composed as follows:
FeatureSplitter(
FeatureUnion(
[('base', BasePipeline),
('stream1', StreamPipeline),
('stream2', StreamPipeline)
]
)
)
Where FeatureSplitter
will be a custom class to split off a stream of features by name to ensure that streaming features do not trip anything up.
Broad Plan and Milestones
For this piece of work the upcoming milestones which I will keep track of are:
- Create framework (1-2 weeks)
- Port grafting algorithm, OGFS, alpha-investing (2 weeks)
- Run experiments on synthetic datasets (1-2 week)
- Write draft paper (2-4 weeks)
- Re-run experiments on ML benchmark datasets/replicating various papers (3 weeks)
Possible parts where I will trip up:
- Framework might be hard to code up well and might have to be scrapped and re-built several times
- May underestimate the time required to do grafting in a “nice” setup. OGFS and alpha-investing have draft layouts created already and should be easier. Worse case scenario, I may remove alpha-investing approach, as that is somewhat dependent on the criterion used to assess incoming features. With multiple criterion and multiple choices of statistical tests, it may become very difficult to experiment and might be sensible to leave to later.
Future Direction
Beyond DPP, the focus I would like to have is on grafting and gradient descent approaches. I believe that extending ideas to non-linear class of models and especially tree base models have large implications.
Some ideas include:
- extending VFDT algorithm to boosting variant, using grafting ideas in an stacking sense
- Making use of RuleFit algorithm to graft future features, and then reconstructing decision trees afterwards
Fitness and Health
Health has always taken a backseat when study gets tough. Which means I will put something in here about it. Health comes in two parts:
- Diet
- Exercise
Diet
Without a “steady” or proper diet plan, it may come to a time when I should perform some kind of meal prep. What this looks like is a bit unknown at this stage. Portion control and not overeating are probably two parts which are the most important.
The goal is in the long run to loose a bit of weight with the aim to maintain what I have currently rather than making drastic changes.
Exercise
Exercise routine will probably consist of the “benchmark” crossfit workouts. In a weird kind of way these are quick and push my limits at least.
Workouts completed:
Fran
15-12-9 (should be 21-15-9)
Thrusters (40kg) + pullups
Jackie
Row 1000m, 50 Thrusters (20kg), 30 pullups
Helen
3 rounds
run 400m, 21 KB swings (24kg), 12 pullups
Future workouts:
Elizabeth
21-15-9
cleans (60kg), ring dip
Grace
30 clean and jerks
Bear Complex
7 unbroken sets for weight
power clean, front squat, push press, back squat, push press
DT
5 rounds
12 deadlifts, 9 hang power clean, 6 push press/jerk (all @70kg)
… and probably many more! There are deficiencies in my movement probably around things like:
- Handstand
- Dips
- Muscle ups
but I guess I will have to address them slowly.