DEV Community: YK Sugi

The limiting factor in physical AI isn't compute or architecture - it's data

YK Sugi — Wed, 03 Jun 2026 17:52:15 +0000

"The core of the problem is still ... robot data, this physical interaction data ... it's just limited - it's not as big as the internet."

-- Kanishka Rao, Director of Robotics at Google DeepMind

In our last post, we covered Jim Fan's argument that VLAs are architecturally wrong - that video world models are the future of robotics. It's a compelling thesis. But there's a quieter, more fundamental problem that doesn't care which architecture wins.

Robotics is hitting a data wall.

The compute overhang

Epoch AI analyzed the training compute used by the largest robotic manipulation models and found that they typically use about 1% of the compute used by frontier AI models in other domains. Not because labs can't afford more GPUs - many of these models come from the same labs that train the biggest language models on the planet. The gap exists largely because there's not enough data to feed them.

The implication: there's a massive compute overhang sitting idle, waiting for data to catch up. If the data constraint eases, capability gains could come fast.

How big is the gap?

To put the scale problem in perspective:

Language models train on trillions of tokens scraped from the internet
Open X-Embodiment, the largest open robotics dataset, has 1M+ episodes across 22 robot types
DROID, the most consistent single-embodiment dataset, has 76K episodes across 564 real scenes
Scale AI's Physical AI Data Engine has collected 100,000+ production hours - impressive, but still orders of magnitude below what language models consume

The internet gave language models their training data. Robotics has no equivalent. You can't simply scrape physical interaction data from the web. Every trajectory requires a real robot (or a real human wearing sensors) doing a real thing in a real environment.

Why this bottleneck is architecture-agnostic

The data wall hits regardless of which architecture you pick.

VLAs rely primarily on teleoperation data today, which is limited and expensive to collect. World action models still need real-world data for the last mile - Ego-Scale needs 50 hours of motion-capture glove data and 4 hours of teleop even after 21,000 hours of video pre-training. Simulation can multiply a single demonstration into thousands of synthetic variations, but someone still has to do that first demonstration, and the sim-to-real gap remains an active research problem.

Four strategies to break through

1. Brute-force teleoperation

A human operates a robot remotely, and the robot learns from that demonstration. High-quality data, but teleoperation is upper bounded by 24 hours per robot per day, and in practice it's much lower.

2. Egocentric human video

Learn from first-person human video - YouTube tutorials, head-mounted cameras - and transfer that knowledge to robots. The data exists at massive scale, but the embodiment gap is real: a video doesn't record force or joint angles.

EgoMimic found that human egocentric data from Meta's Project Aria glasses can contribute more to policy performance than equivalent teleoperation data. But the Data Utilization Law paper found the exchange rate is harsh: roughly 10 human video samples can negate the benefit of a single teleoperated data point for in-domain performance. Human video helps generalization but can hurt precision.

NVIDIA's Ego-Scale has shown neural scaling laws for dexterity using egocentric pre-training - the strongest signal yet that this approach has legs.

3. Synthetic data and world models

Instead of collecting more real data, generate it. NVIDIA's Cosmos can generate physically plausible synthetic training scenarios. Tesla's neural world simulator uses the same architecture behind FSD to train Optimus.

4. Real-world partnerships at scale

Instead of simulating environments, get access to real ones. Figure AI partnered with Brookfield (100,000 residential units) for manipulation data in real homes. Scale AI partnered with Universal Robots to embed data collection directly into industrial arms.

All four strategies are happening simultaneously - nobody is betting on just one. The data problem will most likely be solved by hybrid pipelines mixing real and synthetic, teleop and egocentric.

Learn more

If you want to learn more about physical AI, feel free to check out our newsletter. If you're a machine learning engineer getting started with physical AI, we're building a tool for multimodal model training called MultiBase.

VLAs are dead, long live World Action Models - a summary of Jim Fan's Robotics End Game talk

YK Sugi — Wed, 13 May 2026 18:16:11 +0000

"Our generation was born too late to explore the Earth and too early to explore the stars, but we are born just in time to solve robotics."

-- Jim Fan, AI Ascent 2026

That's how Jim Fan closed his talk at Sequoia's AI Ascent 2026 - and it might be the most concise summary of where robotics stands right now.

Fan leads Nvidia's embodied autonomous research group (essentially Nvidia Robotics). At last year's AI Ascent he was a standout presenter. This year, he came back with a bolder claim: robotics is entering its end game, and the playbook is already written - because it's the same one that LLMs already followed.

Here's a breakdown of his 20-minute talk.

The Great Parallel

Fan's core thesis is what he calls "the great parallel" - the idea that robotics will follow the exact same trajectory as large language models. He frames it bluntly: "as any self-respecting scientist would do, I copy homework and I give it a new name."

The LLM path went through four stages in six years:

Pre-training (GPT-3) - learning the shape of language through next-token prediction
Supervised fine-tuning (InstructGPT) - aligning the model to do useful work
Reasoning (o1) - using reinforcement learning to surpass imitation learning
Auto research - accelerating the loop beyond what's humanly possible

The robotics parallel: instead of simulating strings, simulate the next physical world state. Align through action fine-tuning onto the slice of that simulation that matters for real robots. Then let RL carry the last mile.

Why VLAs fall short

Vision > Action" width="800" height="450">

Vision-language-action models (VLAs) have dominated robotics for the past three years. Models like pi0 and Nvidia's own GR00T fall into this category. The approach: take a vision-language model and graft an action head on top.

Fan's critique is pointed. He argues these models are really "LVAs" - because the most parameters are dedicated to language. Language is the first-class citizen, followed by vision, then action. The result: VLAs are "great at encoding knowledge and nouns, but not so much at physics and verbs. It's kind of head heavy in the wrong places."

His example: the original VLA paper showed a robot moving a Coke can to a picture of Taylor Swift - impressive generalization to an unseen concept, but not the kind of pre-training ability robotics actually needs.

Video world models: the unlikely hero

The replacement for VLAs starts in an unlikely place - AI-generated video. Fan acknowledges the irony. Nobody takes AI video slop of cats playing banjo on security cameras seriously. But something important is happening under the hood.

Video models like VEO-3 are learning to simulate physics internally. They pick up gravity, buoyancy, lighting, reflection, and refraction - all by themselves. None of it is coded in. As Fan puts it: "Physics emerge by predicting the next blob of pixels at scale."

Even visual planning emerges - VEO-3 can solve mazes by running simulation forward in pixel space. (Though Fan notes it sometimes cheats: "VEO-3 figures out that if you're not looking, geometry is optional." He calls this "physics slop.")

Dream Zero: from world models to robot actions

To make world models useful for robotics, Nvidia built Dream Zero - a policy model that "dreams a couple seconds into the future and acts accordingly."

The key insight: motor actions are high-dimensional continuous signals - which look a lot like pixels. So Dream Zero jointly decodes predicted future video frames and robot actions at the same time. The correlation is tight: "If the video prediction works, the action works. If the video hallucinates, the action fails."

The result is zero-shot generalization to tasks and verbs never seen in training. Fan compares it to GPT-2 - not 100% robust, but getting the shape of the motion correct in every case. It's a first step toward open-ended, open-vocabulary prompting for robotics.

Fan calls this new paradigm World Action Models (WAMs). And he's not subtle about the transition: "Let's all take a moment of silence for our dear friend VLAs. They've served us well. Rest in peace. Long live world action models."

The data revolution: teleop is dying

The model story is only half the picture. Fan spends equal time on data strategy - and his argument is just as aggressive.

Teleoperation has dominated robotics data collection for the past three years. VR headsets, optimized streaming latency, complex rigs that Fan says look like "medieval torture devices." The fundamental problem: teleop is upper bounded by 24 hours per robot per day. In practice, it's "more like 3 hours per robot per day, and only when the robot god is merciful."

Fan lays out three tiers of data collection, in order of scalability:

Tier 1: Data wearables (UMI)

Instead of teleoperating a robot, you wear the robot end-effector on your own hand. This is the Universal Manipulation Interface (UMI) approach. Fan calls UMI "perhaps one of the greatest papers ever written in robotics data" - it spawned two unicorn startups (Genesis and Sunday).

Nvidia extended this with DexUMI - an exoskeleton with a one-to-one mapping to five-finger dexterous robot hands. The results: fully autonomous robot policies trained on zero teleoperation data.

Tier 2: Egocentric video (Ego-Scale)

Data wearables are better than teleop but still cumbersome. Fan draws a comparison to Tesla's FSD - when you drive, you contribute to a massive data flywheel without even noticing. Robotics needs the same thing: data collection that fades into the background.

The answer is human egocentric video with hand tracking and language annotations. Nvidia's Ego-Scale system is pre-trained on 21,000 hours of in-the-wild egocentric human data with zero robot data. Then action fine-tuning uses only 50 hours of mocap data and 4 hours of teleop - less than 0.1% of the training mix.

The result: an end-to-end policy that maps camera pixels directly to 22-degree-of-freedom dexterous robot hands. It generalizes to tasks like sorting cards, manipulating syringes, and learning new shirt-folding strategies from a single demonstration.

Perhaps the biggest finding: they discovered a neural scaling law for dexterity - a clean log-linear relationship between pre-training hours and validation loss. Six years after the original neural scaling law for language models, robotics now has its own.

Fan's scalability chart puts it in perspective:

Teleop: least scalable
Data wearables: hundreds of thousands of hours
Egocentric video: easily 10 million hours in the next year if the FSD-style flywheel spins up

His prediction: teleop will drop to negligible amounts within a year or two. The main diet for robotics will be egocentric videos.

Dream Dojo: neural simulators for RL

The final piece is scaling environments for reinforcement learning. Just like LLM labs spend significant budgets acquiring millions of coding environments for RL, robotics needs the same.

Nvidia's approach has two parts:

Real-to-sim-to-real: Take an iPhone photo, run it through a 3D world scan pipeline to extract interactive objects, then augment infinitely with "digital cousins" in simulation. Your iPhone becomes a pocket world scanner.
Dream Dojo: Video world models turned into full neural simulators. Dream Dojo takes continuous action signals as input and outputs RGB frames and sensor states in real time. "No physics equation, no graphics engine involved." Every pixel is generated by the model.

The new post-training paradigm: massively parallel RL running across a few real robot stations, graphics cores doing world scans, and heavy inference compute running world models. Or as Fan summarizes: "Compute now equals environment now equals data."

Three milestones to the finish line

Fan frames the remaining robotics roadmap as three "achievements" on a civilizational technology tree:

Physical Turing test - across a wide range of activities, you can't tell whether a human or robot did the task. Fan estimates two to three years away.
Physical API - robot fleets configured like software through APIs and command lines. This enables "lights-out factories - essentially printers of atoms" that take design files as input and output fully assembled products. Also: automated wet labs for chemistry, biology, and medicine.
Physical auto-research - robots designing, improving, and building the next iteration of themselves beyond what's humanly possible.

Is this science fiction? Fan's argument: it took 14 years to go from AlexNet (2012) barely recognizing cats vs. dogs to agentic auto-research in 2026. Add another 14 years and you land at 2040. And "technology does not advance linearly, it advances exponentially."

His closing bet: 95% certainty that we reach the end of the robotics technology tree by 2040.

Key takeaways

VLAs are being replaced by World Action Models (WAMs) that use video world models as their pre-training backbone instead of language models. Vision and action become first-class citizens.
Teleoperation is on its way out. The future of robotics data is egocentric human video - and Nvidia has already shown scaling laws that validate this approach.
Neural simulators (Dream Dojo) change the RL equation. Compute equals environments equals data - no physics engine required.
The LLM playbook transfers directly to robotics. Pre-train on world simulation, align through action fine-tuning, let RL carry the last mile.

The full talk is worth watching: Robotics' End Game: Nvidia's Jim Fan (20 min, Sequoia Capital).

Learn more

If you want to keep learning about physical AI, feel free to subscribe to our newsletter. If you're in the process of setting up petabyte-scale data pipelines for physical AI, you might want to check out MultiBase - the data warehouse for physical AI, with optimized data loading for the latest NVIDIA chips.

What is physical AI - is it more than just hype?

YK Sugi — Thu, 07 May 2026 17:47:11 +0000

Physical AI has become a real trend over the past year. At CES (Consumer Electronics Show), NVIDIA's Jensen Huang declared 2026 the "ChatGPT moment for physical AI." Georgetown's Center for Security and Emerging Technology published a policy brief ranking it alongside ImageNet (2012) and ChatGPT (2022) as a genuine inflection point. Deloitte and BCG both published major reports on it in 2026.

But some call it just a buzzword - a marketing rebrand of robotics work that's been happening for years.

So is there something real here, or is it just hype?

What physical AI actually means

Physical AI refers to AI systems that perceive real environments through sensors, reason about them, and take physical action. The key difference from traditional robotics is generalization - these systems adapt to novel situations rather than repeating pre-programmed routines.

The key components:

Vision-language-action (VLA) models - the core architectural shift. Traditional robotics uses separate pipelines for perception, planning, and control. VLAs fuse all three into a single end-to-end model that takes camera input and language instructions and directly outputs motor commands. Google DeepMind's RT-2 established the paradigm in 2023, and it's since been adopted by Gemini Robotics, NVIDIA's GR00T N1, Figure AI's Helix, and Physical Intelligence's pi0
World models - neural networks trained on millions of hours of real-world video that understand physics, spatial relationships, and cause-and-effect
Sim-to-real transfer - training robots in physics-accurate simulations (digital twins), then deploying those skills in the real world
Multimodal perception - processing cameras, lidar, force sensors, and language simultaneously
Edge and cloud inference - models can run on-device for latency-sensitive control or in the cloud for higher-level reasoning and planning. For example, Google's Gemini Robotics offers both: a cloud API for embodied reasoning and an on-device model for local execution

It's worth noting this extends beyond robots. NVIDIA's definition includes autonomous cameras and smart spaces, and Honeywell frames AI-assisted control rooms and smart buildings as physical AI too.

The case that it's real

The money is serious

The physical AI market hit roughly $5 billion in 2025 and is projected to reach $68-84 billion by 2034-35. Barclays projects the humanoid robot market alone could reach $40 billion by 2035 - or $200 billion in an optimistic scenario.

Real deployments, not just demos

The strongest evidence that physical AI is real comes from production numbers, not press releases:

Waymo has completed over 10 million paid robotaxi rides
Amazon deployed its millionth robot, reporting a 10% fleet efficiency improvement
Figure AI at BMW loaded over 90,000 parts into 30,000 X3 vehicles during 10-hour shifts at their Spartanburg plant, using their Helix VLA
Honeywell has AI-assisted control rooms running at TotalEnergies' Port Arthur Refinery

Cost curves are plummeting

Robot unit costs dropped 30x over the past decade - from roughly $3 million to around $100,000. Bank of America projects humanoid robot costs will fall further, from $35,000 in 2025 to between $13,000 and $17,000 per unit in the next decade. At those prices, the math starts working for a lot of industries that couldn't justify automation before.

The case that it's more hype than substance

The term itself is a rebrand

Many of the underlying technologies - reinforcement learning, computer vision, sim-to-real transfer, sensor fusion - have existed for years. Some argue that "physical AI" is a new label on existing work.

That said, VLA models are genuinely new. The idea of fusing perception, planning, and control into a single end-to-end model only emerged in 2023 with RT-2, and the field has accelerated rapidly since - VLA submissions at ICLR (International Conference on Learning Representations, one of the top ML conferences) went from 1 in 2024 to 164 in 2026. This isn't just relabeled reinforcement learning - it's a real architectural shift.

The demo-to-production gap is massive

This is the most important counterargument. Figure AI's BMW deployment is a good example - the robot started at only 25% of human speed and improved significantly over 11 months, but still required a hardware redesign. Automotive industry insiders say deeply integrated AI won't ship in vehicles until 2030-2032. The gap between an impressive CES demo and a system that runs reliably for 10-hour shifts, 365 days a year, is enormous.

Physics doesn't forgive hallucinations

When ChatGPT hallucinates, you get a wrong answer. When a surgical robot hallucinates, someone gets hurt. When an autonomous vehicle hallucinates, someone can die.

Real-time physical operation demands near-zero latency with near-zero tolerance for errors. The bar is fundamentally higher than for software AI.

What the experts say

Deloitte sees it as real enough that 80% of surveyed business leaders plan adoption within two years, but cautions about simulation-to-reality gaps: "Visual images in simulated environments are pretty good, but the real world has nuances that look different."

Georgetown CSET treats it as strategically significant enough for a dedicated policy brief, framing it as a competitive race between the US and China that warrants immediate policymaker attention.

Honeywell calls it "a quiet revolution" - not emerging technology, but present-day impact in industrial settings, solving real problems like workforce shortages and operational reliability.

The verdict

Physical AI is more than just hype. Like LLMs before it, there will be a mix of real impact and overhype.

The investment, the deployments, and the cost curves all point to a real and accelerating trend. Companies like Waymo and Amazon aren't running pilot programs for PR - they're building production infrastructure.

If you're a developer or engineer, the underlying skills - VLA architectures, reinforcement learning, computer vision, simulation, edge computing, sensor fusion - are the real signal beneath the marketing noise. Those will matter regardless of what we end up calling this trend.

Learn more

If you want to learn more about physical AI, feel free to check out our new newsletter. If you're a machine learning engineer getting started with physical AI, we're building a tool for multimodal model training called MultiBase.

How We Cut LLM Batch Inference Time in Half with Dynamic Prefix Bucketing

YK Sugi — Mon, 10 Nov 2025 23:12:46 +0000

TL;DR

LLM batch inference is often difficult, costly, and slow - but it doesn't have to be that way. We developed a technique that cuts batch inference time in half by intelligently routing prompts with common prefixes to maximize cache usage. On a cluster of 128 GPUs processing 200k prompts (128 million tokens), we achieved a 50.7% speedup compared to naive batching approaches.

We achieved this by combining the power of the vLLM serving engine with distributed execution to implement two key techniques:

Dynamic Prefix Bucketing - improving LLM cache usage by bucketing and routing by prompt prefix.
Streaming-Based Continuous Batching - Pipeline data processing with LLM inference to fully utilize GPUs.

Combined, these two strategies yield significant performance improvements and cost savings that scale to massive workloads. We observe that on a cluster of 128 GPUs (NVIDIA L4), we are able to complete an inference workload of 200k prompts totaling 128 million tokens up to 50.7% faster.

Building on Provider Abstractions

At Daft, we develop a distributed data processing framework with native AI capabilities. The key to our optimization approach was separating the model execution layer from the application layer through provider abstractions. This design allowed us to implement complex prefix caching logic without changing how users interact with their models.

Consider running batch inference with OpenAI's API:

import daft
from daft.functions import prompt

df = daft.from_pydict({"input": ["prompt 1", "prompt 2", ...]})
df = df.with_column("output",
    prompt(
        df["input"],
        provider="openai",
        model="gpt-3.5-turbo"
    )
)

For self-hosted models with our prefix caching optimizations, you simply switch providers:

df = df.with_column("output",
    prompt(
        df["input"],
        provider="vllm-prefix-caching",  # Same interface, optimized execution
        model="Qwen/Qwen-8B"
    )
)

This provider automatically handles prefix detection, dynamic bucketing, and intelligent routing across GPU replicas to maximize cache hits - all the sophisticated mechanisms we developed to achieve our 50.7% speedup.

The following sections detail how we implemented these optimizations and the performance gains they delivered.

Introduction to LLM Batch Inference

LLM inference workloads fall into two distinct camps with fundamentally different optimization targets.

Online inference serves real-time requests: ChatGPT conversations, IDE code suggestions, agentic workflows. The model sits directly in the user loop. What matters: Time-to-first-token and individual completion tokens per second.

Batch inference pre-processes entire datasets offline: computing embeddings for vector DBs, labeling datasets for analysis, generating synthetic training data. No user waiting on the other end. What matters: tokens per dollar and aggregate tokens/second.

Batch inference presents several unique challenges and opportunities over online inference:

	Online Inference	Batch Inference
Performance	Latency of individual requests is critical. (TTFT, Tokens/sec)	Overall throughput of the inference pipeline is the main concern. (Tokens/$)
Size of data	Typically handles one or few inputs at a time, so memory limits are rarely an issue.	The entire dataset may not fit into CPU or GPU memory.
Cost and GPU utilization	Costs depend on per request or per token usage; GPUs may be underutilized between requests.	Costs are tied to GPU hours; effective utilization across the batch is essential for efficiency.
Data distribution	Prompts arrive in real time, so data distribution is unknown ahead of time.	All prompts are known in advance, allowing optimizations that leverage data distribution.

Streaming-Based Continuous Batching

A simple and scalable method of doing batch inference is as follows:

Spin up N replicas of an LLM serving engine across a compute cluster such that all GPUs are occupied.
Split your dataset into batches that are small enough to fit into memory.
Distribute those batches evenly across the replicas.
Run inference on one batch at a time.

However, you’ll observe two things:

The GPU is idle between the end of one batch and the start of the next.
- This is because there are a series of pre-inference and post-inference steps, including tokenization, data transfers, and batching, all of which will be done while the GPU sits idle.
Within a batch, some requests complete before others, leading to a lagging tail of longer sequences where the GPU isn’t fully utilized.
- Since LLM inputs and outputs have variable length, some sequences require more generation steps than others.

Simple batch inference across two batches. Notice the gaps in GPU compute.

To solve this, we can leverage a technique in vLLM called continuous batching. The fundamental improvement of continuous batching is that we’re able to now batch inference on a per token basis instead of per sequence. This allows us to start inference on prompts in the next batch as sequences in a previous batch complete. There is an excellent blog post about continuous batching if you’d like to learn more about how this works.

Diagram about continuous batching from that blog post.

To implement continuous batching across an entire dataset, we leverage Daft’s streaming execution capabilities to implement a “streaming sink”, a class of operators that are able to stream batches in and out while accumulating state across batches.

💡 Tip
Learn more about streaming execution in our blog about Swordfish, our local execution engine!

In this LLM operator, we collect input batches into a buffer that is fed into vLLM using the AsyncLLMEngine API. This ensures that there is always more data for a serving engine to add to the batch. The serving engine pushes completed sequences into an output buffer, which gets streamed out into later pipeline stages.

Dynamic Prefix Bucketing

Model prompts often contain repetitive content, like system prompts and common instructions. In those cases, we can leverage prompt caching to avoid recomputing common prefixes. In vLLM, this is called automatic prefix caching. When enabled, vLLM attempts to cache the computed values of a sequence across requests and store it in GPU memory (VRAM).

This means that if you have inputs with common prefixes, a significant amount of the computation can be avoided as long as the previous cached result is still in GPU memory.

In batch inference workloads, the challenge with effectively using the prefix cache is twofold:

Cache Eviction - GPU VRAM is a limited resource, so a prefix cache block may be quickly evicted. If you have two sequences with a common prefix, but their requests are spaced far apart, prefix caching will not take effect.
Cache Locality - The prefix cache is local to an individual serving engine. In a cluster with multiple replicas, if two requests with the same prefix are served by different replicas, we are unable to reap the benefits of prefix caching either.

One straightforward method to improve the cache hit rate is to do a distributed sort prior to inference. That way, inputs with common prefixes are grouped together on the same machine.

However, sorting is a blocking operation, meaning GPUs are sitting idle until it completes. It also requires full materialization of your dataset, which may not be possible for large-scale data.

Instead, we developed “dynamic prefix bucketing”, a method that simultaneously improves prefix cache hits while achieving high GPU utilization throughout an entire query. Dynamic prefix bucketing consists of two components: local prefix bucketing and prefix-aware routing.

Local Prefix Bucketing

On each local machine, we maintain a buffer of inputs, bucketed by prefix. To pop from the buffer, we remove input buckets by size, largest bucket first. Insertions and removals are interleaved, meaning small buckets are kept until they are able to grow large enough to submit.

Buckets are computed dynamically by first sorting the buffer, then determining bucket boundaries by checking the common prefix length of adjacent prompts. If the common prefix is under a certain threshold (e.g. 30% of each prompt), start a new bucket. Otherwise, add the next prompt into the current bucket.

Prefix-Aware Routing

To determine the replica to send a batch to, local executors query a global LLM router. The router determines the best replica to route to, factoring in both prefix cache locality and load balancing. Out of the replicas that have the lowest load (determined by a threshold value), the router selects the replica that has most recently seen the given prefix to send a batch to.

This router ensures that all replicas are sufficiently utilized, while allowing prefix caching over data from separate machines. It is also effective against data skew, because if there are some prefixes that are very common across the dataset, it will avoid routing all prompts with such a prefix to a single serving engine.

By combining local bucketing and global routing, we are able to improve cache hits across the cluster, all the while streaming data through. This method makes use of GPUs almost instantly once data is available and does not require full dataset materialization. As a result, even if your dataset is too large to fit into memory, dynamic prefix bucketing is still able to run batch inference over it with high performance.

Try Today

You can try this today on the latest version of Daft by setting your provider to “vllm-prefix-caching” on our prompt AI function. Here’s a quick example:

import daft
from daft.functions import prompt

df = daft.from_pydict({
    "input": ["How many r's are in strawberry?"]
})

df = df.with_column("output", 
    prompt(
        df["input"], 
        provider="vllm-prefix-caching",
        model="Qwen/Qwen-8B"
    )
)

df.show()

Benchmarking Setup

All benchmarking and dataset generation scripts can be found in the Daft repository on Github.

Dataset

To evaluate our system and for benchmarking, we used vLLM’s PrefixRepetitionRandomDataset to generate a 102 million token dataset with 200k prompts with 512 tokens for each prompt, with 512 unique prefixes of 256 tokens (half the prompt).

Workload

We chose the Qwen/Qwen3-8B model in bfloat16 precision, a popular model used in batch inference for tasks such as synthetic data generation, product enrichment, and structured extraction.

For each input prompt, we generated 128 output tokens and used a temperature of 1. This generates around 25.6M output tokens.

Hardware

For our hardware, we use NVIDIA L4 GPUs which have 24gb of memory and can comfortably host Qwen3-8B in bfloat16 with room for the KV Cache.

Our pick for servers were g6.12xlarge which each had 4 L4 GPUs, 48 CPU cores, 192GB of DRAM and 40 Gbps network.

We ran our setup in 3 configurations to test the scalability of our methods.

Config	Number of GPUs	CPU cores	Network (Gbps)
8 x g6.12xlarge	32	384	320
16 x g6.12xlarge	64	768	640
32 x g6.12xlarge	128	1536	1280

Benchmark Results

Methods

Naive Batching (Baseline)

Our baseline method consists of simply splitting the input data into batches of 512 prompts and sending them into the serving engines sequentially. We implemented this via Daft’s class-based batch UDFs.

Naive Batching on our 128 GPU configuration takes 977 seconds and has a 29.2% Cache Hit Rate.

Our next step is to try continuous batching that could potentially improve pipelining and combat the issue of stragglers.

Continuous Batching

With continuous batching, we instead maintain a buffer of tasks for each serving engine to process, implemented as a pool of async tasks that call AsyncLLMEngine.generate on vLLM. The serving engine pops prompts from the task pool in order to maintain a consistent batch of sequences to run inference over.

Continuous batching takes 869 seconds and yields a 11% speedup. We also see that the cache hit rate decreases from 29.2% to 26.5%. We believe this is due to the fact that when running in continuous batching mode, on average a larger batch of sequences is being processed at a time, leading to more cache evictions.

Our next step is to try to improve the cache hit rate which we can do by grouping common prefixes together. A simple way to do this is to just globally sort the data which is what we do next.

Sorting

For this method, we run the same continuous batching technique, along with a synchronous global sort of the data at the start of the workload. This ensures that for the most part, prompts with common prefixes end up in the same batch.

Synchronously sorting the data and then running the continuous batching method takes 563 seconds and yields a 35.2% speedup relative to just continuous batching. We can also verify this due to better caching by looking at the cache hit rate which increases from 26.5% to 54.5%. This means that more than half of the input tokens leverage caching now.

One of the downsides of this method is that our GPUs sit idle while the distributed sort is happening. Our next attempt is do both the continuous batching inference and prefix grouping at the same time so that our GPUs are doing useful work for the full workload. We do this by relaxing the requirement of globally sorting the data and use the Dynamic Prefix Bucketing scheme that we previously discussed.

Dynamic Prefix Bucketing

By employing Dynamic Prefix Bucketing locally and Prefix-Aware Routing globally, we are able to avoid the GPU idle time caused by the global sort, while still achieving good prefix cache hit rates across the cluster. In this method, we also make use of continuous batching, sending prefix-bucketed prompts to the inference input buffers in a streaming fashion.

Our Dynamic Prefix Bucketing method took 482 seconds which is a 12.7% speedup relative to the synchronous global sort method, and a 50.7% total speedup over our baseline. Furthermore, we are able to maintain our cache hit rate at 54%. This means that Dynamic Prefix Bucketing only has a cache hit rate penalty of 0.5% compared to globally sorting the data, while having the ability to be pipelined with LLM inference!

Ray Data

As an additional baseline, we use Ray Data with their off-the-shelf ray.data.llm batch processing APIs [1]. Since it also uses vLLM under the hood, we were able to set it to the exact same configurations as our own benchmarking scripts. The one thing we changed was the batch size, which we set to 16, since we observed that a smaller batch size performed better on their setup.

With Ray Data, we observe a runtime of 842 seconds, which is similar to our continuous batching method. Since Ray Data also utilizes continuous batching, this validates the performance of our methods.

Scalability

We next test the scalability of Daft with Dynamic Prefix Bucketing and Ray Data.

To do this we run both systems on our 32, 64, and 128 GPU configuration and measure the wall time. From this we can derive the scaling factor of how well the systems scale when we increase cluster sizes.

For Daft, we see near linear scaling from going from 32 to 64 GPUs and then a 87% efficiency when going from 32 to 128 GPUs. At this point, we notice that the overhead of downloading model weights and initialization of the model on GPU (all 128 of them) is now the bottleneck for improving scalability since it is a constant cost.

We also see that in all configurations below Daft with Dynamic Prefix Bucketing is slightly more scalable than Ray Data.

	Daft Runtime (s)	Daft Speedup (vs 32 GPU)	Daft Scaling Factor (vs 32 GPU)	Ray Data Runtime (s)	Ray Data Speedup (vs 32 GPU)	Ray Data Scaling Factor (vs 32 GPU)
32 GPUs	1682	1	1	2915	1	1
64 GPUs	865	1.94	0.97	1548	1.88	0.94
128 GPUs	481	3.49	0.87	842	3.46	0.86

Ablation on Prefix Count

Finally we see how Daft with Dynamic Prefix Bucketing adapts to different dataset with varying number of prefixes. Here we sweep the number of unique prefixes in a 102M token data with 200k prompts.

Here we see that Dynamic Prefix Bucketing works better when there are more entries of a common prefix in the dataset. We see that the more common a prefix is the faster the workload runs and the cache hit rate is higher.

Future Work

The vLLM Prefix Caching model provider is available to try today. Below are some future improvements that we would like to make to the implementation.

Beyond text generation
- The vLLM Prefix Caching provider currently only supports text generation with our prompt function, but the same techniques described in this post can also be applied to embedding generation and structured outputs.
Smarter load balancing
- The router currently load balances using the number of prompts sent to each serving engine replica. This assumes that all GPUs are equally as fast and that sequences take around the same time to generate, which may not be true in real-world scenarios. Instead, the router should monitor the actual number of unfinished requests on each replica to better load balance.
More accurate cache modeling
- The router estimates the prefix cache on each replica via a bounded queue of sent prefixes for each replica. We found that this is already very effective, but a more accurate model of the prefix caches or a method to inspect the cache metrics on serving engines would improve the ability to route batches to the best replica.
Further improve scaling
- We should investigate the current bottlenecks for scaling the system to larger clusters. In theory it should be possible to achieve super-linear scaling, where 2x more GPUs can achieve more than 2x speedup, since a larger cluster will have a larger total prefix cache.

In addition, we welcome your feedback on these features! Let us know how we can improve Daft and what you would like to see, by submitting an issue on Github or sending us a message on our community Slack.

Appendix

[1] We encountered an issue using Ray Data’s build_llm_processor where we would get an error about no running async event loop. We were able to resolve this issue by downgrading our uvloop version to v0.21.

Daft vs Ray Data: A Comprehensive Comparison for Multimodal Data Processing

YK Sugi — Mon, 27 Oct 2025 16:17:59 +0000

Multimodal AI workloads break traditional data engines. They need to embed documents, classify images, and transcribe audio, not just run aggregations and joins. These multimodal workloads are tough: memory usage balloons mid-pipeline, processing requires both CPU and GPU, and a single machine can't handle the data volume.

This post provides a comprehensive comparison of Daft and Ray Data for multimodal data processing, examining their architectures and performance. Benchmarks across large-scale audio, video, document, and image workloads found Daft ran 2-7x faster than Ray Data and 4-18x faster than Spark, while finishing jobs reliably.

The Multimodal Data Challenge

Multimodal data processing presents unique challenges:

Memory Explosions: A compressed image like a JPEG inflates 20x in memory once decoded. A single video file can be decoded into thousands of frames, each being megabytes.
Heterogeneous Compute: These workloads stress CPU, GPU, and network simultaneously. Processing steps include resampling, feature extraction, transcription, downloading, decoding, resizing, normalizing, and classification.
Data Volume: The benchmarked workloads included 113,800 audio files from Common Voice 17, 10,000 PDFs from Common Crawl, 803,580 images from ImageNet, and 1,000 videos from Hollywood2.

Introducing the Contenders

Daft

Daft is designed to handle petabyte-scale workloads with multimodal data (audio, video, images, text, embeddings) as first-class citizens.

Key features include:

Native multimodal operations: Built-in image decoding/encoding/cropping/resizing, text and image embedding/classification APIs, LLM APIs, text tokenization, cosine similarity, URL downloads/uploads, reading video to image frames
Declarative DataFrame/SQL API: With schema validation and query optimizer that automatically handles projection pushdowns, filter pushdowns, and join reordering - optimizations users get "for free" without manual tuning
Comprehensive I/O support: Native readers and writers for Parquet, CSV, JSON, Lance, Iceberg, Delta Lake, and WARC formats, tightly integrated with the streaming execution model

Ray Data

Ray Data is a data processing library built on top of Ray, a framework for building distributed Python applications.

Key features include:

Low-level operators: Provides operations like map_batches that work directly on PyArrow record batches or pandas DataFrames
Ray ecosystem integration: Tight integration with Ray Train for distributed training and Ray Serve for model serving

Architecture Deep Dive

Daft's Streaming Execution Model

Daft's architecture revolves around its Swordfish streaming execution engine. Data is always "in flight": batches flow through the pipeline as soon as they are ready. For a partition of 100k images, the first 1000 can be fed into model inference while the next 1000 are being downloaded or decoded. The entire partition never has to be fully materialized in an intermediate buffer.

Backpressure mechanism: If GPU inference becomes the bottleneck, the upstream steps automatically slow down so memory usage remains bounded.

Adaptive batch sizing: Daft shrinks batch sizes on memory-heavy operations like url_download or image_decode, keeping throughput high without ballooning memory usage.

Flotilla distributed engine: Daft's distributed runner deploys one Swordfish worker per node, enabling the same streaming execution model to scale across clusters.

Ray Data's Execution Model

Ray Data streams data between heterogeneous operations (e.g., CPU → GPU) that users define via classes or resource requests. Within homogeneous operations, Ray Data fuses sequential operations into the same task and executes them sequentially, which can cause memory issues without careful tuning of block sizes. You can work around this by using classes instead of functions in map/map_batches, but this materializes intermediates in Ray's object store, adding serialization and memory copy overhead. Ray's object store is by default only 30% of machine memory, and this limitation can lead to excessive disk spilling.

Performance Benchmarks

Based on recent benchmarks conducted on identical AWS clusters (8 x g6.xlarge instances with NVIDIA L4 GPUs, each with 4 vCPUs, 16 GB memory, and 100 GB EBS volume), here's how the two frameworks compare:

Workload	Daft	Ray Data	Spark
Audio Transcription (113,800 files)	6m 22s	29m 20s (4.6x slower)	25m 46s (4.0x slower)
Document Embedding (10,000 PDFs)	1m 54s	14m 32s (7.6x slower)	8m 4s (4.2x slower)
Image Classification (803,580 images)	4m 23s	23m 30s (5.4x slower)	45m 7s (10.3x slower)
Video Object Detection (1,000 videos)	11m 46s	25m 54s (2.2x slower)	3h 36m (18.4x slower)

Why Such Large Performance Differences?

Several architectural decisions contribute to Daft's performance advantages:

Native Operations vs Python UDFs: Daft has native multimodal expressions including image decoding/encoding/cropping/resizing, text and image embedding/classification APIs, LLM APIs, text tokenization, cosine similarity, URL downloads/uploads, and reading video to image frames. These native multimodal expressions are highly optimized in Daft. In Ray Data you have to write your own Python UDFs that use external dependencies like Pillow, numpy, spacy, huggingface, etc. This comes at the cost of extra data movement because these libraries each have their own data format.
Memory Management - Streaming vs Materialization: Daft streams data through network, CPU, and GPU in a continuous stream without materializing entire partitions. Ray Data fuses sequential operations which can cause memory issues. While you can work around this by using classes to materialize intermediates in the object store, this adds serialization and memory copy overhead.
Resource Utilization: Daft pipelines everything inside a single Swordfish worker, which has control over all resources of the machine. Data asynchronously streams from cloud storage, into the CPUs to run pre-processing steps, then into GPU memory for inference, and back out for results to be uploaded. CPUs, GPUs, and the network stay saturated together for optimal throughput. In contrast, Ray Data by default reserves a CPU core for I/O-heavy operations like downloading large videos, which can leave that core unavailable for CPU-bound processing work, requiring manual tuning of fractional CPU requests to optimize resource usage.

When to Choose Which?

Based on the benchmark results and architectural differences:

Daft shows significant advantages for:

Multimodal data processing (images, documents, video, audio)
Workloads requiring reliable execution without extensive tuning
Complex queries with joins, aggregations, and multiple transformations
Teams preferring DataFrame/SQL semantics

Ray Data may be preferred when:

You have tight integration needs with the Ray ecosystem (Ray Train, Ray Serve)
You need fine-grained control over CPU/GPU allocation per operation

What Practitioners Are Saying

Is Daft battle-tested enough for production?

When Tim Romanski of Essential AI set out to taxonomize 23.6 billion web documents from Common Crawl (24 trillion tokens), his team pushed Daft to its limits - scaling from local development to 32,000 requests per second per VM. As he shared in a panel discussion: "We pushed Daft to the limit and it's battle tested... If we had to do the same thing in Spark, we would have to have the JVM installed, go through all of its nuts and bolts just to get something running. So the time to get something running in the first place was a lot shorter. And then once we got it running locally, we just scaled up to multiple machines."

What gap does Daft fill in the Ray ecosystem?

CloudKitchens rebuilt their entire ML infrastructure around what they call the "DREAM stack" (Daft, Ray, poEtry, Argo, Metaflow). When selecting their data processing layer, they identified specific limitations with Ray Data and chose Daft to complement Ray's compute capabilities. As their infrastructure team explained, "one issue with the Ray library for data processing, Ray Data, is that it doesn't cover the full range of DataFrame/ETL functions and its performance could be improved." They chose Daft because "it fills the gap of Ray Data by providing amazing DataFrame APIs" and noted that "in our tests, it's faster than Spark and uses fewer resources."

How does Daft perform on even larger datasets?

A data engineer from ByteDance commented on Daft's 300K image processing demonstration, sharing his own experience with an even larger image classification workload: "Not just 300,000 images - we ran image classification evaluations on the ImageNet dataset with approximately 1.28 million images, and Daft was about 20% faster than Ray Data." Additionally, in a separate technical analysis of Daft's architecture, he praised its "excellent execution performance and resource efficiency" and highlighted how it "effortlessly enables streaming processing of large-scale image datasets."

Resources

Benchmarks for Multimodal AI Workloads - Primary source for performance data and architectural comparisons
Benchmark Code Repository - Open-source code to reproduce all benchmarks
Distributed Data Community Slack - Join the community to discuss with Daft developers and users