Dealing with Slowly Changing Dimensions

Probably the more common data structure is “slowly changing dimension” data structure. We’ll frame this in the following form, whereby the table would include:

Entity
attribute/value
effective date
current flag

Though any form of it which captures history.

import pandas as pd
import numpy as np


def gen_entity(entity=1, max_rows=10):
    gen_rows = np.random.choice(range(max_rows))
    df = []
    time = 0
    for _ in range(gen_rows):
        time += np.random.uniform(0, 10)
        df.append({
            'entity':entity,
            'attribute0': np.random.choice(range(10)),
            'attribute1': np.random.choice(range(10)),
            'eff_date': time,
            'current_flag': 0
        })
    df.append({
        'entity':entity,
        'attribute0': np.random.choice(range(10)),
        'attribute1': np.random.choice(range(10)),
        'eff_date': time+np.random.uniform(0, 10),
        'current_flag': 1
    })
    return df

scd_df = []
for idx in range(10):
    scd_df.extend(gen_entity(idx))

scd_df = pd.DataFrame(scd_df)

The to compute the EAR table, one needs to extract the appropriate row as of a particular time (looking backwards). Although this is somewhat “easy” to do to; computationally it is “expensive” because there is no guarentee of the look behind construct. The easiest way to observe this is to imagine the SQL code construction that is needed.

WITH my_temp_table AS (
    SELECT
    entity,
    MAX(eff_date) AS eff_date
    FROM my_table
    WHERE eff_date < snapshotdate
    GROUP BY entity
);

SELECT
  my_table.entity,
  my_table.attribute0,
  my_table.attribute1,
  my_table.eff_date
FROM my_table
INNER JOIN my_temp_table
  ON my_temp.entity = my_temp_table.entity
  AND my_temp.eff_date = my_temp_table.eff_date;

We can do this in Python as well; this might be a bit simplier if we assume the dataframe is sorted. It becomes a one-liner

snapshot = 15
scd_df[scd_df['eff_date'] < snapshot].sort_values('eff_date', ascending=False).groupby('entity').head(1)

class SCDSnapshot(TransformerMixin):
    def __init__(self, time):
        self.time = time

    def fit(self, X, y=None):
        return self

    def transform(self, X, **kwargs):
        time = kwargs.get('time', self.time)
        scd_df = (X[X['eff_date'] < snapshot]
                    .sort_values('eff_date', ascending=False)
                    .groupby('entity').head(1))
        return scd_df

This kind of transformation requires a sort and lookup; which is ideally completed within a database. This also allows for generating point in time tables quite effectively; users maybe tempted to generate additional time-based features off this data using the snapshot date.

Personally I do not believe these should form part of the feature store but rather a feature pipeline where this is calculated dynamically; as the nature of snapshots and time based features are fairly arbitrary, which suggests that composing such features may cause an exponential explosion for little performance benefit.

This kind of format also poses problems around where do you draw boundaries between event driven approaches and SCD approaches, particularly when you begin aggregations; in that scenario it is better to focus on event driven records.