Revisiting the Design of Vowpal Wabbit

Although Vowpal Wabbit has mostly lost out in the world of Deep Learning, I still think there is much value in thinking about how it serves features, as launching a Vowpal Wabbit daemon and loading new data in is very much a pipeline I would like to see from the perspective of how a feature store could be built.

The Data Format

The data format makes use of Unix-like ideals, whereby it presumes denormalised data in the form:

<label> <weight> <tag> | <feat1> <feat2>

All encoded in a sparse format, the form <key>:<value>. Then the pipeline can load from essentially a jsonlines like interface to perform both scoring and training.

Why Should We Care?

This kind of file based format is very powerful in generating and creating data for consumption in a pipeline. I think as a first prototype, an approach similar to this can be used to serve as a loose prototype. We can probably rely more on the json interface than the raw vw format. Something like:

mq --table my_table --data my_data

will consume and perform “ETL” on the incoming row of data to generate some data in a denormalized table. This format can allow us to “play back” items to repopulate, or even overwrite items depending on how it is implemented.

from tinydb import TinyDB, Query
import argparse
import tempfile
import json
import os


parser = argparse.ArgumentParser(description="Maquette interface")
parser.add_argument("--table", type=str, action="store")
parser.add_argument("--data", type=str, action="store")


def insert_data(table, data):
    try:
        json_data = json.loads(data)
    except ValueError as e:
        return None
    
    tbl = db.table(table)
    tbl.insert(json_data)
    return None


tmpdir = tempfile.gettempdir()
db_file = os.path.join(tmpdir, 'test.db')
db = TinyDB(str(db_file))

args = parser.parse_args(['--table', 'mytable', '--data', '{"hello":1, "world":2}']) 
insert_data(args.table, args.data)

args = parser.parse_args(['--table', 'mytable', '--data', '{"hello":1, "world":3}']) 
insert_data(args.table, args.data)


tbl = db.table('mytable')
query = Query()
tbl.search(query.hello == 1)  # [{'hello': 1, 'world': 2}, {'hello': 1, 'world': 3}]

Okay, so we have start; but this isn’t quite enough, we need to define what is:

The primary key
What aggregation types we want to do
What is the timestamp or time granularity on which we should consider

Let’s ignore time for now (discussion for another day), and try the first item. Let’s say we want to ingest and perform both a count and sum on the incoming data

from tinydb import TinyDB, where
import argparse
import tempfile
import json
import os


parser = argparse.ArgumentParser(description="Maquette interface")
parser.add_argument("--table", type=str, action="store")
parser.add_argument("--data", type=str, action="store")
parser.add_argument("--id", type=str, action="store")


def insert_data(db, table, data, key):
    try:
        json_data = json.loads(data)
    except ValueError as e:
        return None
    
    tbl = db.table(table)
    feats = tbl.search(where(key) == json_data[key])
    if feats:
        # perform some kind of update
        old_data = feats[0]
        for k, v in json_data.items():
            if k == key:
                continue

            # hard code for now
            sum_field = f"{k}_sum"
            count_field = f"{k}_count"
            sum_val = old_data.get(sum_field)
            count_val = old_data.get(count_field)
            if sum_val is None:
                sum_val = v
            else:
                sum_val += v
            if count_val is None:
                count_val = 1
            else:
                count_val += 1
            old_data[sum_field] = sum_val
            old_data[count_field] = count_val
        json_data = old_data.copy()
    
    tbl.upsert(json_data, where(key) == json_data[key])
    return None


tmpdir = tempfile.gettempdir()
db_file = os.path.join(tmpdir, 'test.db')
db = TinyDB(str(db_file))
tbl = db.table('mytable')

args = parser.parse_args(['--table', 'mytable', '--data', '{"hello":1, "world":2, "id":1}', "--id", "id"]) 
insert_data(db, args.table, args.data, args.id)
print(tbl.search(where(args.id) == 1))

args = parser.parse_args(['--table', 'mytable', '--data', '{"hello":1, "world":3, "id":1}', "--id", "id"]) 
insert_data(db, args.table, args.data, args.id)
print(tbl.search(where(args.id) == 1))

Okay - so now we have the start of how this update mechanism for input data may be developed! In the future, we’ll look into how we can “daemonise” it.