Designing a Feature Store - Reflecting on UBER's Michelangelo
Of course Maquette is named after UBER Michelangelo - which makes it an easy place to start. Michelangelo is probably the platform we point to and say “we want that!”.
Sketching Michelangelo
The goals of Michelangelo are numerous - but we’ll leave that for another day. First, let’s examine the choice of technology:
Kafka or any other stream processing engine for a push (or event driven) framework
Cassandra or any columnar database for the express purpose of serving denormalized data
Apache Spark or any framework for handling ETL workloads
Hive or any kind of data store for hosting data marts that have been processed
Data Lake (Michelangelo explicitly calls out Hadoop), which is used to populate data marts in downstream systems
Serverless Container, which seems to be increasingly Kubernetes clusters for serving models in an API farm
But what is the function of each component? Are they necessary or important? To understand how this framework would have been thought out, we should consider Lambda architecture.
graph LR;
A[New Data] --> B[Stream Processing]
B --> D[Real-time View]
E --> F[Application]
A --> C[Master Dataset]
C --> E[Batch View]
D --> F
subgraph b[Speed Layer]
B
end
subgraph c[Batch Layer]
C
end
subgraph d[Serving Layer]
D
E
end
style A fill:#FFF,stroke:#000,color:#FFF
style F fill:#FFF,stroke:#000,color:#FFF
style b fill:#aec7e8,stroke:#000,color:#FFF
style c fill:#aec7e8,stroke:#000,color:#FFF
style d fill:#aec7e8,stroke:#000,color:#FFF
style B fill:#b4cbc5,stroke:#000,color:#FFF
style C fill:#b4cbc5,stroke:#000,color:#FFF
style D fill:#b4cbc5,stroke:#000,color:#FFF
style E fill:#b4cbc5,stroke:#000,color:#FFF
If we overlay the components over this architecture, then it would look like this
graph LR;
A[New Data] --> B[Kafka]
B --> D[Cassandra]
E --> F[Application]
A --> C[Hadoop]
C --> E[Hive]
D --> F
subgraph b[Speed Layer]
B
end
subgraph c[Batch Layer]
C
end
subgraph d[Serving Layer]
D
E
end
style A fill:#FFF,stroke:#000,color:#FFF
style F fill:#FFF,stroke:#000,color:#FFF
style b fill:#aec7e8,stroke:#000,color:#FFF
style c fill:#aec7e8,stroke:#000,color:#FFF
style d fill:#aec7e8,stroke:#000,color:#FFF
style B fill:#b4cbc5,stroke:#000,color:#FFF
style C fill:#b4cbc5,stroke:#000,color:#FFF
style D fill:#b4cbc5,stroke:#000,color:#FFF
style E fill:#b4cbc5,stroke:#000,color:#FFF
An astute reader might ask “where is Spark in all this?” - in many frameworks, Spark has been used as the bridge for both streaming and batch processing.
graph LR;
A[New Data] --> B[Kafka]
B --Apache Spark--> D[Cassandra]
E --> F[Application]
A --> C[Hadoop]
C --Apache Spark--> E[Hive]
D --> F
subgraph b[Speed Layer]
B
end
subgraph c[Batch Layer]
C
end
subgraph d[Serving Layer]
D
E
end
style A fill:#FFF,stroke:#000,color:#FFF
style F fill:#FFF,stroke:#000,color:#FFF
style B fill:#DDD,stroke:#000,color:#000
style C fill:#DDD,stroke:#000,color:#000
style D fill:#DDD,stroke:#000,color:#000
style E fill:#DDD,stroke:#000,color:#000
style b fill:#DDD,stroke:#000,color:#000
style c fill:#DDD,stroke:#000,color:#000
style d fill:#DDD,stroke:#000,color:#000
Within the original framework, Spark was used only for batch, whilst a niche framework called Samza was used in stream processing. I’ve noticed Samza is notably absent in subsequent articles on Michelangelo framework, which makes me think it as been replaced by Spark or Kafka (Kafka has moved towards KSQL to bridge this gap).
Within this framework there is a lot of complexity and moving parts. Furthermore each component would need a team to manage and maintain in order for it to be useful. Project Maquette is my mini-project to attempt to build a ‘pure Python’ Machine Learning Systems framework without this overhead. It won’t necessarily be as performant, but through this we’ll learn together some of the design decisions.
graph LR;
A[New Data] --> B[To Be Confirmed?]
B --> D[To Be Confirmed?]
E --> F[Application]
A --> C[To Be Confirmed?]
C --> E[To Be Confirmed?]
D --> F
subgraph b[Speed Layer]
B
end
subgraph c[Batch Layer]
C
end
subgraph d[Serving Layer]
D
E
end
style A fill:#FFF,stroke:#000,color:#FFF
style F fill:#FFF,stroke:#000,color:#FFF
style B fill:#DDD,stroke:#000,color:#000
style C fill:#DDD,stroke:#000,color:#000
style D fill:#DDD,stroke:#000,color:#000
style E fill:#DDD,stroke:#000,color:#000
style b fill:#DDD,stroke:#000,color:#000
style c fill:#DDD,stroke:#000,color:#000
style d fill:#DDD,stroke:#000,color:#000