Probably the more common data structure is “slowly changing dimension” data structure. We’ll frame this in the following form, whereby the table would include:
- Entity
- attribute/value
- effective date
- current flag
Though any form of it which captures history.
import pandas as pd
import numpy as np
def gen_entity(entity=1, max_rows=10):
gen_rows = np.random.choice(range(max_rows))
df = []
time = 0
for _ in range(gen_rows):
time += np.random.uniform(0, 10)
df.append({
'entity':entity,
'attribute0': np.random.choice(range(10)),
'attribute1': np.random.choice(range(10)),
'eff_date': time,
'current_flag': 0
})
df.append({
'entity':entity,
'attribute0': np.random.choice(range(10)),
'attribute1': np.random.choice(range(10)),
'eff_date': time+np.random.uniform(0, 10),
'current_flag': 1
})
return df
scd_df = []
for idx in range(10):
scd_df.extend(gen_entity(idx))
scd_df = pd.DataFrame(scd_df)
The to compute the EAR table, one needs to extract the appropriate row as of a particular time (looking backwards). Although this is somewhat “easy” to do to; computationally it is “expensive” because there is no guarentee of the look behind construct. The easiest way to observe this is to imagine the SQL code construction that is needed.
WITH my_temp_table AS (
SELECT
entity,
MAX(eff_date) AS eff_date
FROM my_table
WHERE eff_date < snapshotdate
GROUP BY entity
);
SELECT
my_table.entity,
my_table.attribute0,
my_table.attribute1,
my_table.eff_date
FROM my_table
INNER JOIN my_temp_table
ON my_temp.entity = my_temp_table.entity
AND my_temp.eff_date = my_temp_table.eff_date;
We can do this in Python as well; this might be a bit simplier if we assume the dataframe is sorted. It becomes a one-liner
snapshot = 15
scd_df[scd_df['eff_date'] < snapshot].sort_values('eff_date', ascending=False).groupby('entity').head(1)
class SCDSnapshot(TransformerMixin):
def __init__(self, time):
self.time = time
def fit(self, X, y=None):
return self
def transform(self, X, **kwargs):
time = kwargs.get('time', self.time)
scd_df = (X[X['eff_date'] < snapshot]
.sort_values('eff_date', ascending=False)
.groupby('entity').head(1))
return scd_df
This kind of transformation requires a sort and lookup; which is ideally completed within a database. This also allows for generating point in time tables quite effectively; users maybe tempted to generate additional time-based features off this data using the snapshot date.
Personally I do not believe these should form part of the feature store but rather a feature pipeline where this is calculated dynamically; as the nature of snapshots and time based features are fairly arbitrary, which suggests that composing such features may cause an exponential explosion for little performance benefit.
This kind of format also poses problems around where do you draw boundaries between event driven approaches and SCD approaches, particularly when you begin aggregations; in that scenario it is better to focus on event driven records.