Designing a Feature Store - Reflecting on UBER's Michelangelo

Of course Maquette is named after UBER Michelangelo - which makes it an easy place to start. Michelangelo is probably the platform we point to and say “we want that!”.

Sketching Michelangelo

The goals of Michelangelo are numerous - but we’ll leave that for another day. First, let’s examine the choice of technology:

Serverless framework or Any other container framework

Kafka or any other stream processing engine for a push (or event driven) framework
Cassandra or any columnar database for the express purpose of serving denormalized data
Apache Spark or any framework for handling ETL workloads
Hive or any kind of data store for hosting data marts that have been processed
Data Lake (Michelangelo explicitly calls out Hadoop), which is used to populate data marts in downstream systems
Serverless Container, which seems to be increasingly Kubernetes clusters for serving models in an API farm

But what is the function of each component? Are they necessary or important? To understand how this framework would have been thought out, we should consider Lambda architecture.

graph LR; A[New Data] --> B[Stream Processing] B --> D[Real-time View] E --> F[Application] A --> C[Master Dataset] C --> E[Batch View] D --> F subgraph b[Speed Layer] B end subgraph c[Batch Layer] C end subgraph d[Serving Layer] D E end style A fill:#FFF,stroke:#000,color:#FFF style F fill:#FFF,stroke:#000,color:#FFF style b fill:#aec7e8,stroke:#000,color:#FFF style c fill:#aec7e8,stroke:#000,color:#FFF style d fill:#aec7e8,stroke:#000,color:#FFF style B fill:#b4cbc5,stroke:#000,color:#FFF style C fill:#b4cbc5,stroke:#000,color:#FFF style D fill:#b4cbc5,stroke:#000,color:#FFF style E fill:#b4cbc5,stroke:#000,color:#FFF

If we overlay the components over this architecture, then it would look like this

graph LR; A[New Data] --> B[Kafka] B --> D[Cassandra] E --> F[Application] A --> C[Hadoop] C --> E[Hive] D --> F subgraph b[Speed Layer] B end subgraph c[Batch Layer] C end subgraph d[Serving Layer] D E end style A fill:#FFF,stroke:#000,color:#FFF style F fill:#FFF,stroke:#000,color:#FFF style b fill:#aec7e8,stroke:#000,color:#FFF style c fill:#aec7e8,stroke:#000,color:#FFF style d fill:#aec7e8,stroke:#000,color:#FFF style B fill:#b4cbc5,stroke:#000,color:#FFF style C fill:#b4cbc5,stroke:#000,color:#FFF style D fill:#b4cbc5,stroke:#000,color:#FFF style E fill:#b4cbc5,stroke:#000,color:#FFF

An astute reader might ask “where is Spark in all this?” - in many frameworks, Spark has been used as the bridge for both streaming and batch processing.

graph LR; A[New Data] --> B[Kafka] B --Apache Spark--> D[Cassandra] E --> F[Application] A --> C[Hadoop] C --Apache Spark--> E[Hive] D --> F subgraph b[Speed Layer] B end subgraph c[Batch Layer] C end subgraph d[Serving Layer] D E end style A fill:#FFF,stroke:#000,color:#FFF style F fill:#FFF,stroke:#000,color:#FFF style B fill:#DDD,stroke:#000,color:#000 style C fill:#DDD,stroke:#000,color:#000 style D fill:#DDD,stroke:#000,color:#000 style E fill:#DDD,stroke:#000,color:#000 style b fill:#DDD,stroke:#000,color:#000 style c fill:#DDD,stroke:#000,color:#000 style d fill:#DDD,stroke:#000,color:#000

Within the original framework, Spark was used only for batch, whilst a niche framework called Samza was used in stream processing. I’ve noticed Samza is notably absent in subsequent articles on Michelangelo framework, which makes me think it as been replaced by Spark or Kafka (Kafka has moved towards KSQL to bridge this gap).

Within this framework there is a lot of complexity and moving parts. Furthermore each component would need a team to manage and maintain in order for it to be useful. Project Maquette is my mini-project to attempt to build a ‘pure Python’ Machine Learning Systems framework without this overhead. It won’t necessarily be as performant, but through this we’ll learn together some of the design decisions.

graph LR; A[New Data] --> B[To Be Confirmed?] B --> D[To Be Confirmed?] E --> F[Application] A --> C[To Be Confirmed?] C --> E[To Be Confirmed?] D --> F subgraph b[Speed Layer] B end subgraph c[Batch Layer] C end subgraph d[Serving Layer] D E end style A fill:#FFF,stroke:#000,color:#FFF style F fill:#FFF,stroke:#000,color:#FFF style B fill:#DDD,stroke:#000,color:#000 style C fill:#DDD,stroke:#000,color:#000 style D fill:#DDD,stroke:#000,color:#000 style E fill:#DDD,stroke:#000,color:#000 style b fill:#DDD,stroke:#000,color:#000 style c fill:#DDD,stroke:#000,color:#000 style d fill:#DDD,stroke:#000,color:#000

New posts on Project Maquette every Wednesday