If we consider lambda architecture as a reference point to Maquette, what kind of considerations do we have to make when designing it? Are all components really necessarily?

In my mind, this delineation revolves around trade-offs involving:

  • Push versus Pull
  • Stateful versus Stateless
graph LR; A[New Data] --> B[Stream Processing] B --> D[Real-time View] E --> F[Application] A --> C[Master Dataset] C --> E[Batch View] D --> F subgraph b[Maybe Speed Layer] B end subgraph c[Batch Layer] C end subgraph d[Serving Layer] D E end style A fill:#FFF,stroke:#000,color:#FFF style F fill:#FFF,stroke:#000,color:#FFF style b fill:#aec7e8,stroke:#000,color:#FFF style c fill:#aec7e8,stroke:#000,color:#FFF style d fill:#aec7e8,stroke:#000,color:#FFF style B fill:#b4cbc5,stroke:#000,color:#FFF style C fill:#b4cbc5,stroke:#000,color:#FFF style D fill:#b4cbc5,stroke:#000,color:#FFF style E fill:#b4cbc5,stroke:#000,color:#FFF

In reference to a “pure” lambda architecture, the focus is on primarily a push approach, as well as injecting the notion of state in its processing. It is fundamentally an event driven approach whereby:

  • New data arrives
  • It gets pushed into a streaming processing (which keeps track of its internal state)
  • Finally moves to a real-time table
  • Made available for client to consume (to pull in this instance of the feature store).

However, even in the scenario with Michelangelo, the fundamental assumption is that the models being served will be under a pull pattern. With that in mind let’s just make our lives easier!

graph LR; A[New Data] --> B[On-Demand Processing] B --> D[???] E --> F[Application] A --> C[Master Dataset] C --> E[Batch View] D --> F subgraph b[Maybe Speed Layer] B end subgraph c[Batch Layer] C end subgraph d[Serving Layer] D E end style A fill:#FFF,stroke:#000,color:#FFF style F fill:#FFF,stroke:#000,color:#FFF style b fill:#aec7e8,stroke:#000,color:#FFF style c fill:#DDD,stroke:#000,color:#FFF style d fill:#DDD,stroke:#000,color:#FFF style B fill:#b4cbc5,stroke:#000,color:#FFF style C fill:#DDD,stroke:#000,color:#FFF style D fill:#b4cbc5,stroke:#000,color:#FFF style E fill:#DDD,stroke:#000,color:#FFF

However if we do this, we then will lose our “real-time” view - or do we? One way to resolve this, is to reset expectations on what this “real-time” view entails. Instead the guarentees should be around “best efforts” to provide a “batch with augmented latest data” as part of the serving layer. For example, if we want the latest information of a customer, we will retrieve the latest information we have as part of the batch store, and then perform an upsert based on any information provided as part of the “on-demand” processing component.

graph LR; A[New Data] --> B[On-Demand Processing] E --> F[Application] A --> C[Master Dataset] C --> E[Batch View] C --> D[Batch View
Best Efforts Upsert] B --> D D --> F subgraph b[Speed Layer] B end subgraph c[Batch Layer] C end subgraph d[Serving Layer] D E end style A fill:#FFF,stroke:#000,color:#FFF style F fill:#FFF,stroke:#000,color:#FFF style b fill:#aec7e8,stroke:#000,color:#FFF style c fill:#DDD,stroke:#000,color:#FFF style d fill:#DDD,stroke:#000,color:#FFF style B fill:#b4cbc5,stroke:#000,color:#FFF style C fill:#DDD,stroke:#000,color:#FFF style D fill:#b4cbc5,stroke:#000,color:#FFF style E fill:#DDD,stroke:#000,color:#FFF

In this way, we can retrieve workflows which are near-realtime whilst lowering the complexity of the workflow. This approach isn’t perfect though, as the inability to perform true “streaming” views may place limits on the ability to perform aggregations or other stateful features.

However what it means is that the act of retrieving on-demand features can be completely decoupled from the upstream ETL processes! Ultimately this vastly simplifies the architecture with respect to what we need to build.

Recap

What’s changed?

  • We’ve reverted to a pull model rather than an event driven push model
  • We’ve decoupled the different layers by making them completely stateless

Furthermore, this means that the components can be built out in a somewhat logical manner:

  1. Build out the batch layer, whereby we acquire data into our “data lake”
  2. Off the batch layer construct the appropriate denormalized data tables
  3. Once the denormalized data tables are built, construct a way to pull data in realtime to perform a virtual “upsert” when feature serving is needed “on-demand”

Going through the technologies - what would we actually need? What is clear from this workflow is that we do not need Spark, or any Big Data technology and that this is achieveable using modest data analytics approaches.

Thinking Mathematically

What are the machine learning consequences of removing some of these constraints? The best way to visualise this is to use probabilistic graphical models to reason around the utility of statefulness. Let \(x_i\) represents the query at time \(i\), similarly let \(s_i\) represents the state of the database at time \(t\). Then under streaming scenario with stateful interactions, the PGM is reflected as

graph LR x0((x0)) --> s0((s0)) x1((x1)) --> s1((s1)) x2((x2)) --> s2((s2)) s0 --> x1 s1 --> x2 s2 --> x3((x3)) x0 --> x1 x0 --> x2 x0 --> x3 x1 --> x2 x1 --> x3 x2 --> x3

In the stateless scenario, the previous queries have no bearing (i.e. no state is passed into the query of the database)

graph LR x0((x0)) --> s0((s0)) x1((x1)) --> s1((s1)) x2((x2)) --> s2((s2)) s0 --> x1 s1 --> x2 s2 --> x3((x3))

This decoupling of state will mean less complex transformations are possible, unless they are wholly encapsulated as part of the database design. As such there will be features missing if we presume statelessness to be present, but it will vastly simplify deployment considerations - which of course is an ideal candidate when we are trying to build and design a solution ourselves!