When considering the build of feature stores, the easiest way on “small data” is to delay the computation as late as possible. That is come up with a pipeline in Spark or otherwise, so that you compute the features when it is needed (rather than in a greedy fashion). This might work well in the scenarios where the feature set is small, but if it is massive, you will be doing lots of recomputation over and over again.

To this end, it is somewhat straightforward to have a reproducible feature pipeline; simply parse the raw data in a repeatable pipeline, and don’t store the intermediary steps.

But in an enterprise feature store, this might not be wise; it simply boils down to the needs and expectations of the particular organisation and the roles and responsibilities inherent within. If the feature store needs to be governed; who looks after it? If it requires monitoring of the intermediary pipeline steps - how would it be done?

import featuretools as ft
import datetime
import vaex
import numpy as np

data = ft.demo.load_mock_customer()
data_vx = {k:vaex.from_pandas(v) for k,v in data.items()}

def feature_customer(data_vx, end_time, start_time):
    # need to also control for time, in the switch.
    customer_df = data_vx['customers']
    session_df = data_vx['sessions']
    transaction_df = data_vx['transactions']

    cdf = customer_df[customer_df['join_date'] < end_time]
    sdf = session_df[(session_df['session_start'] < end_time) & (session_df['session_start'] > start_time)]
    tdf = transaction_df[(transaction_df['transaction_time'] < end_time) & (transaction_df['transaction_time'] > start_time)]

    return cdf.join(sdf, on='customer_id', allow_duplication=True).join(tdf, on='session_id', allow_duplication=True)[['customer_id', 'amount']].groupby("customer_id").agg("sum")


data = ft.demo.load_mock_customer()
data_vx = {k:vaex.from_pandas(v) for k,v in data.items()}

end_time = np.datetime64(datetime.datetime(2014, 6, 20))
start_time = np.datetime64(datetime.datetime(2014, 6, 20)- datetime.timedelta(days=300))

feature_customer(data_vx, end_time, start_time)

data = ft.demo.load_mock_customer()
data_vx = {k:vaex.from_pandas(v) for k,v in data.items()}

end_time = np.datetime64(datetime.datetime(2014, 2, 20))
start_time = np.datetime64(datetime.datetime(2014, 2, 20)- datetime.timedelta(days=10))

feature_customer(data_vx, end_time, start_time)  # this errors, because its empty and we don't do error handling...

This kind of approach is also done by featuretools, though not in an “as open” fashion.