Soumia

Posted on May 17

KubeCon Amsterdam 2026: The Industrialization of ML - A Deep Dive into Uber’s AI Platform Architecture.

#kafka #kubecon #cloudnative #machinelearningops

This article serves as a technical follow-up to our KubeCon 2026 coverage, providing a comprehensive deep dive into the architecture and evolution of Uber’s machine learning platform.

When Uber presented at KubeCon Europe 2026, the numbers they shared silenced the room: 1 million+ diverse workloads deployed onto 200 Kubernetes clusters, 20,000 models trained monthly, 5,300 models actively in production, and over 30 million peak predictions per second.

For most organizations, achieving even 1% of that scale is a multi-year roadmap. Uber’s platform doesn't just support their business; it is their business. From surge pricing and ETA estimation to fraud detection and Generative AI-driven customer support, machine learning sits in the critical path of every user interaction.

But Uber didn't arrive at this architecture overnight. Their journey from scattered Python scripts to a globally federated, Kubernetes-native AI control plane is a masterclass in platform engineering.

Here is the deep dive into how Uber industrialized machine learning, the bottlenecks they hit along the way, and the architectural blueprints they’ve proven at hyperscale.

1. The Pre-Platform Era: The Fragmentation Tax (Pre-2017)

Before 2017, data science at Uber looked like data science at most fast-growing startups today: entirely fragmented.

The How: Data scientists worked on individual laptops or dedicated EC2 instances using a fragmented toolkit (R, scikit-learn, bespoke Python scripts).
The What: Each team built separate, one-off systems to pull data, train models, and serve predictions.
The Bottleneck: Models could only be as large as what fit on a single machine. Once a model was trained, "deploying" it often meant handing an opaque pickle file to a backend engineering team to rewrite in Java or Go.

This lack of standardization meant high operational friction. Teams couldn't easily share features, monitor model drift, or scale prediction serving. Uber realized that building custom infrastructure for every ML use case was economically and operationally unsustainable. They needed a centralized factory.

2. Michelangelo: Standardizing the ML Factory (2017–2022)

To solve the fragmentation tax, Uber built Michelangelo, an end-to-end internal machine learning platform designed to democratize ML across the company. The goal was to standardize the entire lifecycle—from data prep to model deployment.

Michelangelo introduced several architectural patterns that have since become industry standard:

The Centralized Feature Store: Instead of every team writing their own Spark jobs to calculate "user's trip frequency in the last 30 days," features were calculated once, stored, and shared.
Offline vs. Online Split: Michelangelo cleanly separated batch feature computation (using Apache Spark and Hive for historical data) from real-time feature computation (using Apache Kafka and Flink for streaming data like GPS coordinates).
Deployment Standardization: Models were deployed in three specific modes: Offline (Spark batch jobs for overnight predictions), Online (load-balanced API endpoints responding in <10ms), and Library (embedded directly into microservices for the absolute lowest latency).

Michelangelo was a massive success, bringing hundreds of use cases into production. However, as the industry shifted toward Deep Learning and Large Language Models (LLMs), Michelangelo’s underlying orchestration layer began to crack under the weight.

3. Hitting the Wall: The Kubernetes & Ray Migration (2023–2024)

By mid-2023, Uber’s ML workloads were primarily running on a legacy job gateway service called MADLJ (Michelangelo Deep Learning Jobs). While functional, it forced ML engineers to manually handle resource management—choosing specific regions, zones, and clusters based on GPU availability.

This led to the "stranded compute" problem: Cluster A would be operating at 100% capacity with a massive queue of training jobs, while Cluster B sat 50% empty because engineers hadn't manually targeted it.

To prepare for the Generative AI boom, Uber executed a massive architectural shift: moving the entire ML platform to Kubernetes and Ray.

Curing Stranded Compute via Federation

Uber decoupled the user experience from the infrastructure. They introduced a Global Control Plane built on standard Kubernetes architecture.

Developers now submit declarative jobs (via a Python-native workflow service called Uniflow) simply stating: "I need to train this PyTorch model on 8 A100 GPUs."
The Global Control Plane's custom Job Controller automatically scans dozens of regional Kubernetes clusters (the Local Control Plane), identifies available capacity, and schedules the Ray workers accordingly.

Overcoming ETCD Limits with Transparent Persistence

Scaling Kubernetes to handle 100+ purpose-built Custom Resource Definitions (CRDs) representing the ML lifecycle introduced a new problem: etcd (Kubernetes’ default datastore) choked under the high-cardinality metadata of 30 million predictions a second.
To solve this, Uber engineered a transparent storage abstraction. While the system interacts with standard Kubernetes objects via the API, the underlying metadata is seamlessly synchronized with a horizontally scalable MySQL backend, completely bypassing ETCD's limitations.

4. The GenAI & Agentic Era (2024–2026)

With a federated Kubernetes and Ray foundation in place, Uber was uniquely positioned to absorb the immense compute requirements of Generative AI and Agentic systems.

Uber leverages a hybrid hardware approach: heavily utilizing on-prem A100 GPU clusters alongside Google Cloud H100 instances. To maximize GPU utilization (MFU - Model Flops Utilization) when training massive open-source models (like Llama or Mixtral), the platform engineering team implemented severe infrastructure-level optimizations:

Distributed Memory Offloading: Because GPU memory is prohibitively expensive, Uber implemented advanced CPU offloading—keeping active computations on the GPU while shifting optimizer states to CPU RAM or NVMe SSDs. This effectively doubled training throughput and allowed them to train models that previously wouldn't fit in VRAM.
Software/Hardware Co-design: By utilizing optimized frameworks like TensorRT-LLM tuned specifically for their H100 instances, Uber achieved a 2x improvement in response latency and a 6x boost in throughput.

The Shift to Agentic AI

Most recently, Uber has expanded beyond simple GenAI content generation into Agentic AI—systems capable of autonomous task decomposition, multi-agent collaboration, and real-time adaptability. By combining generative capabilities with their massive data annotation and testing engines (like uLabel and uTest), Uber is building systems where GenAI provides creative options, and Agentic logic evaluates, selects, and executes them reliably.

5. The Architecture Blueprint

Today, Uber’s ML platform can be distilled into four highly decoupled layers:

Hardware Layer (Layer 0): A hybrid mix of on-premise A100 clusters and cloud-based H100 instances, connected via 100GB/s high-bandwidth networking.
Orchestration Layer (Layer 1): Kubernetes handles the primitive scheduling and hardware constraints, while Ray (via the KubeRay operator) distributes the actual mathematical workloads across the worker nodes.
Federation Layer (Layer 2): A global control plane that treats dozens of individual Kubernetes clusters as a single, unified compute mesh, dynamically routing workloads to eliminate idle GPU time.
Developer Experience (Layer 3): Python-native workflows (Uniflow) and centralized Feature Stores that allow data scientists to focus entirely on modeling rather than infrastructure plumbing.

6. The Lesson for the Enterprise

Uber’s architectural journey validates a crucial reality for modern platform engineering: AI scale exposes design flaws. An architecture that works for 1,000 predictions an hour will spectacularly collapse at 30 million predictions a second.

The primary takeaway from Uber's Michelangelo evolution is that successful, scalable AI is not fundamentally about having the smartest neural network. It is about robust data plumbing and distributed state management. By treating machine learning not as a special, fragile science project, but as standard, declarative, Kubernetes-native infrastructure, Uber has built the blueprint for the next decade of enterprise AI.

References & Further Reading

Scaling Machine Learning at Uber with Michelangelo - Uber Engineering Blog
Uber’s Journey to Ray on Kubernetes - Uber Engineering Blog
From monolith to global mesh: How Uber standardized ML at scale - The New Stack
Open Source and In-House: How Uber Optimizes LLM Training - Uber Engineering Blog
Agentic AI + Generative AI: The Future of Enterprise Decision-Making - Uber AI Solutions

This article draws from sessions and discussions at KubeCon + CloudNativeCon EU 2026, including Agentics Day, Open Source SecurityCon, and contributions from the CNCF TAG Security community.

By Soumia, a developer advocate focused on making complex infrastructure legible — through writing, speaking, and helping technical and non-technical audiences find common ground. I work at the intersection of cloud-native systems, AI, and editorial craft. — LinkedIn · Portfolio

DEV Community