<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: YK Sugi</title>
    <description>The latest articles on DEV Community by YK Sugi (@yks).</description>
    <link>https://dev.to/yks</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3579164%2F2e342ed1-6352-4fef-8e6d-ee819e501720.jpg</url>
      <title>DEV Community: YK Sugi</title>
      <link>https://dev.to/yks</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yks"/>
    <language>en</language>
    <item>
      <title>How We Cut LLM Batch Inference Time in Half with Dynamic Prefix Bucketing</title>
      <dc:creator>YK Sugi</dc:creator>
      <pubDate>Mon, 10 Nov 2025 23:12:46 +0000</pubDate>
      <link>https://dev.to/yks/how-we-cut-llm-batch-inference-time-in-half-with-dynamic-prefix-bucketing-183e</link>
      <guid>https://dev.to/yks/how-we-cut-llm-batch-inference-time-in-half-with-dynamic-prefix-bucketing-183e</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;TL;DR&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;LLM batch inference is often difficult, costly, and slow - but it doesn't have to be that way. We developed a technique that cuts batch inference time in half by intelligently routing prompts with common prefixes to maximize cache usage. On a cluster of 128 GPUs processing 200k prompts (128 million tokens), we achieved a 50.7% speedup compared to naive batching approaches.&lt;/p&gt;

&lt;p&gt;We achieved this by combining the power of the &lt;a href="https://docs.vllm.ai/en/latest/" rel="noopener noreferrer"&gt;vLLM serving engine&lt;/a&gt; with distributed execution to implement two key techniques:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic Prefix Bucketing&lt;/strong&gt; - improving LLM cache usage by bucketing and routing by prompt prefix.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming-Based Continuous Batching&lt;/strong&gt; - Pipeline data processing with LLM inference to fully utilize GPUs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Combined, these two strategies yield significant performance improvements and cost savings that scale to massive workloads. We observe that on a cluster of 128 GPUs (NVIDIA L4), we are able to complete an inference workload of 200k prompts totaling 128 million tokens up to &lt;strong&gt;50.7% faster&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building on Provider Abstractions
&lt;/h2&gt;

&lt;p&gt;At &lt;a href="https://www.daft.ai/" rel="noopener noreferrer"&gt;Daft&lt;/a&gt;, we develop a distributed data processing framework with native AI capabilities. The key to our optimization approach was separating the model execution layer from the application layer through provider abstractions. This design allowed us to implement complex prefix caching logic without changing how users interact with their models.&lt;/p&gt;

&lt;p&gt;Consider running batch inference with OpenAI's API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;daft&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;daft.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;daft&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pydict&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt 1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt 2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...]})&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nf"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-3.5-turbo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For self-hosted models with our prefix caching optimizations, you simply switch providers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nf"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vllm-prefix-caching&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Same interface, optimized execution
&lt;/span&gt;        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen/Qwen-8B&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This provider automatically handles prefix detection, dynamic bucketing, and intelligent routing across GPU replicas to maximize cache hits - all the sophisticated mechanisms we developed to achieve our 50.7% speedup.&lt;/p&gt;

&lt;p&gt;The following sections detail how we implemented these optimizations and the performance gains they delivered.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction to LLM Batch Inference
&lt;/h2&gt;

&lt;p&gt;LLM inference workloads fall into two distinct camps with fundamentally different optimization targets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Online inference&lt;/strong&gt; serves real-time requests: ChatGPT conversations, IDE code suggestions, agentic workflows. The model sits directly in the user loop. What matters: &lt;strong&gt;Time-to-first-token&lt;/strong&gt; and &lt;strong&gt;individual completion tokens per second&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Batch inference&lt;/strong&gt; pre-processes entire datasets offline: computing embeddings for vector DBs, labeling datasets for analysis, generating synthetic training data. No user waiting on the other end. What matters: &lt;strong&gt;tokens per dollar&lt;/strong&gt; and &lt;strong&gt;aggregate tokens/second&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Batch inference presents several unique challenges and opportunities over online inference:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Online Inference&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Batch Inference&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Performance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Latency of individual requests is critical. (TTFT, Tokens/sec)&lt;/td&gt;
&lt;td&gt;Overall throughput of the inference pipeline is the main concern. (Tokens/$)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Size of data&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Typically handles one or few inputs at a time, so memory limits are rarely an issue.&lt;/td&gt;
&lt;td&gt;The entire dataset may not fit into CPU or GPU memory.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost and GPU utilization&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Costs depend on per request or per token usage; GPUs may be underutilized between requests.&lt;/td&gt;
&lt;td&gt;Costs are tied to GPU hours; effective utilization across the batch is essential for efficiency.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Data distribution&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Prompts arrive in real time, so data distribution is unknown ahead of time.&lt;/td&gt;
&lt;td&gt;All prompts are known in advance, allowing optimizations that leverage data distribution.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Streaming-Based Continuous Batching
&lt;/h2&gt;

&lt;p&gt;A simple and scalable method of doing batch inference is as follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Spin up N replicas of an LLM serving engine across a compute cluster such that all GPUs are occupied.&lt;/li&gt;
&lt;li&gt;Split your dataset into batches that are small enough to fit into memory.&lt;/li&gt;
&lt;li&gt;Distribute those batches evenly across the replicas.&lt;/li&gt;
&lt;li&gt;Run inference on one batch at a time.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F95j4xvjasvbgmnnincu9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F95j4xvjasvbgmnnincu9.png" alt="Batch inference architecture showing dataset split into batch queues distributed across 4 LLM replicas" width="800" height="554"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, you’ll observe two things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The GPU is idle between the end of one batch and the start of the next.

&lt;ul&gt;
&lt;li&gt;This is because there are a series of pre-inference and post-inference steps, including tokenization, data transfers, and batching, all of which will be done while the GPU sits idle.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Within a batch, some requests complete before others, leading to a lagging tail of longer sequences where the GPU isn’t fully utilized.

&lt;ul&gt;
&lt;li&gt;Since LLM inputs and outputs have variable length, some sequences require more generation steps than others.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzu3wabg2yjf6oi0u6f6p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzu3wabg2yjf6oi0u6f6p.png" alt="Timeline showing GPU and CPU utilization with gaps between batches and variable sequence completion times" width="800" height="194"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Simple batch inference across two batches. Notice the gaps in GPU compute.&lt;/p&gt;

&lt;p&gt;To solve this, we can leverage a technique in vLLM called continuous batching. The fundamental improvement of continuous batching is that we’re able to now batch inference on a per token basis instead of per sequence. This allows us to start inference on prompts in the next batch as sequences in a previous batch complete. There is an &lt;a href="https://www.anyscale.com/blog/continuous-batching-llm-inference" rel="noopener noreferrer"&gt;excellent blog post&lt;/a&gt; about continuous batching if you’d like to learn more about how this works.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiqhny4pwjh6dszcolb6e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiqhny4pwjh6dszcolb6e.png" alt="Continuous batching visualization showing sequences at different stages with some completing while new ones start" width="800" height="182"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Diagram about continuous batching from that blog post.&lt;/p&gt;

&lt;p&gt;To implement continuous batching across an entire dataset, we leverage Daft’s streaming execution capabilities to implement a “streaming sink”, a class of operators that are able to stream batches in and out while accumulating state across batches.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;💡 Tip&lt;/strong&gt;&lt;br&gt;
Learn more about streaming execution in &lt;a href="https://www.daft.ai/blog/exploring-daft-swordfish-execution-mechanism" rel="noopener noreferrer"&gt;our blog about Swordfish&lt;/a&gt;, our local execution engine!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In this LLM operator, we collect input batches into a buffer that is fed into vLLM using the &lt;code&gt;AsyncLLMEngine&lt;/code&gt; API. This ensures that there is always more data for a serving engine to add to the batch. The serving engine pushes completed sequences into an output buffer, which gets streamed out into later pipeline stages.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwynaupjtqhd6mn3u41sk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwynaupjtqhd6mn3u41sk.png" alt="LLM streaming sink architecture with input/output buffers feeding Async LLM engines across multiple replicas" width="800" height="411"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Dynamic Prefix Bucketing
&lt;/h2&gt;

&lt;p&gt;Model prompts often contain repetitive content, like system prompts and common instructions. In those cases, we can leverage prompt caching to avoid recomputing common prefixes. In vLLM, this is called &lt;a href="https://docs.vllm.ai/en/v0.10.2/features/automatic_prefix_caching.html" rel="noopener noreferrer"&gt;automatic prefix caching&lt;/a&gt;. When enabled, vLLM attempts to cache the computed values of a sequence across requests and store it in GPU memory (VRAM).&lt;/p&gt;

&lt;p&gt;This means that if you have inputs with common prefixes, a significant amount of the computation can be avoided as long as the previous cached result is still in GPU memory.&lt;/p&gt;

&lt;p&gt;In batch inference workloads, the challenge with effectively using the prefix cache is twofold:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cache Eviction&lt;/strong&gt; - GPU VRAM is a limited resource, so a prefix cache block may be quickly evicted. If you have two sequences with a common prefix, but their requests are spaced far apart, prefix caching will not take effect.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache Locality&lt;/strong&gt; - The prefix cache is local to an individual serving engine. In a cluster with multiple replicas, if two requests with the same prefix are served by different replicas, we are unable to reap the benefits of prefix caching either.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;One straightforward method to improve the cache hit rate is to do a distributed sort prior to inference. That way, inputs with common prefixes are grouped together on the same machine.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fipc8piws5wczv4zp2j4q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fipc8piws5wczv4zp2j4q.png" alt="Distributed sort architecture showing data shuffled across nodes before being sent to LLM replicas" width="800" height="499"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, sorting is a blocking operation, meaning GPUs are sitting idle until it completes. It also requires full materialization of your dataset, which may not be possible for large-scale data. &lt;/p&gt;

&lt;p&gt;Instead, we developed “dynamic prefix bucketing”, a method that simultaneously improves prefix cache hits while achieving high GPU utilization throughout an entire query. Dynamic prefix bucketing consists of two components: local prefix bucketing and prefix-aware routing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Local Prefix Bucketing
&lt;/h3&gt;

&lt;p&gt;On each local machine, we maintain a buffer of inputs, bucketed by prefix. To pop from the buffer, we remove input buckets by size, largest bucket first. Insertions and removals are interleaved, meaning small buckets are kept until they are able to grow large enough to submit.&lt;/p&gt;

&lt;p&gt;Buckets are computed dynamically by first sorting the buffer, then determining bucket boundaries by checking the common prefix length of adjacent prompts. If the common prefix is under a certain threshold (e.g. 30% of each prompt), start a new bucket. Otherwise, add the next prompt into the current bucket. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4rhn1t5iuz4bco6twiro.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4rhn1t5iuz4bco6twiro.png" alt="Local prefix bucketing showing input buffer sorted and grouped into color-coded prefix buckets" width="800" height="509"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Prefix-Aware Routing
&lt;/h3&gt;

&lt;p&gt;To determine the replica to send a batch to, local executors query a global LLM router. The router determines the best replica to route to, factoring in both prefix cache locality and load balancing. Out of the replicas that have the lowest load (determined by a threshold value), the router selects the replica that has most recently seen the given prefix to send a batch to.&lt;/p&gt;

&lt;p&gt;This router ensures that all replicas are sufficiently utilized, while allowing prefix caching over data from separate machines. It is also effective against data skew, because if there are some prefixes that are very common across the dataset, it will avoid routing all prompts with such a prefix to a single serving engine.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ficcyx2fk4h4tpb814v9v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ficcyx2fk4h4tpb814v9v.png" alt="Prefix-aware routing showing local prefix buckets being routed through a central router to appropriate LLM sinks" width="800" height="464"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By combining local bucketing and global routing, we are able to improve cache hits across the cluster, all the while streaming data through. This method makes use of GPUs almost instantly once data is available and does not require full dataset materialization. As a result, even if your dataset is too large to fit into memory, dynamic prefix bucketing is still able to run batch inference over it with high performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try Today
&lt;/h2&gt;

&lt;p&gt;You can try this today on the latest version of Daft by setting your provider to “vllm-prefix-caching” on our &lt;a href="https://docs.daft.ai/en/stable/api/functions/prompt/" rel="noopener noreferrer"&gt;&lt;code&gt;prompt&lt;/code&gt; AI function&lt;/a&gt;. Here’s a quick example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;daft&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;daft.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;daft&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pydict&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;How many r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s are in strawberry?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_column&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="nf"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;input&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; 
        &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vllm-prefix-caching&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Qwen/Qwen-8B&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Benchmarking Setup
&lt;/h2&gt;

&lt;p&gt;All benchmarking and dataset generation scripts can be found in &lt;a href="https://github.com/Eventual-Inc/Daft/tree/main/benchmarking/vllm" rel="noopener noreferrer"&gt;the Daft repository on Github&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dataset
&lt;/h3&gt;

&lt;p&gt;To evaluate our system and for benchmarking, we used vLLM’s &lt;a href="https://docs.vllm.ai/en/stable/api/vllm/benchmarks/datasets.html#vllm.benchmarks.datasets.PrefixRepetitionRandomDataset" rel="noopener noreferrer"&gt;PrefixRepetitionRandomDataset&lt;/a&gt; to generate a 102 million token dataset with 200k prompts with 512 tokens for each prompt, with 512 unique prefixes of 256 tokens (half the prompt). &lt;/p&gt;

&lt;h3&gt;
  
  
  Workload
&lt;/h3&gt;

&lt;p&gt;We chose the &lt;a href="https://huggingface.co/Qwen/Qwen3-8B" rel="noopener noreferrer"&gt;Qwen/Qwen3-8B&lt;/a&gt; model in bfloat16 precision, a popular model used in batch inference for tasks such as synthetic data generation, product enrichment, and structured extraction.&lt;/p&gt;

&lt;p&gt;For each input prompt, we generated 128 output tokens and used a temperature of 1. This generates around 25.6M output tokens.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hardware
&lt;/h3&gt;

&lt;p&gt;For our hardware, we use NVIDIA &lt;a href="https://www.nvidia.com/en-us/data-center/l4/" rel="noopener noreferrer"&gt;L4 GPUs&lt;/a&gt; which have 24gb of memory and can comfortably host Qwen3-8B in bfloat16 with room for the KV Cache.&lt;/p&gt;

&lt;p&gt;Our pick for servers were &lt;a href="https://aws.amazon.com/ec2/instance-types/g6/" rel="noopener noreferrer"&gt;g6.12xlarge&lt;/a&gt; which each had 4 L4 GPUs, 48 CPU cores, 192GB of DRAM and 40 Gbps network.&lt;/p&gt;

&lt;p&gt;We ran our setup in 3 configurations to test the scalability of our methods.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Config&lt;/th&gt;
&lt;th&gt;Number of GPUs&lt;/th&gt;
&lt;th&gt;CPU cores&lt;/th&gt;
&lt;th&gt;Network (Gbps)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;8 x &lt;a href="https://aws.amazon.com/ec2/instance-types/g6/" rel="noopener noreferrer"&gt;g6.12xlarge&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;384&lt;/td&gt;
&lt;td&gt;320&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16 x &lt;a href="https://aws.amazon.com/ec2/instance-types/g6/" rel="noopener noreferrer"&gt;g6.12xlarge&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;768&lt;/td&gt;
&lt;td&gt;640&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32 x &lt;a href="https://aws.amazon.com/ec2/instance-types/g6/" rel="noopener noreferrer"&gt;g6.12xlarge&lt;/a&gt;
&lt;/td&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;td&gt;1536&lt;/td&gt;
&lt;td&gt;1280&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Benchmark Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Methods
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Naive Batching (Baseline)
&lt;/h4&gt;

&lt;p&gt;Our baseline method consists of simply splitting the input data into batches of 512 prompts and sending them into the serving engines sequentially. We implemented this via Daft’s class-based batch UDFs. &lt;/p&gt;

&lt;p&gt;Naive Batching on our 128 GPU configuration takes &lt;strong&gt;977 seconds&lt;/strong&gt; and has a &lt;strong&gt;29.2%&lt;/strong&gt; Cache Hit Rate.&lt;/p&gt;

&lt;p&gt;Our next step is to try continuous batching that could potentially improve pipelining and combat the issue of stragglers. &lt;/p&gt;

&lt;h4&gt;
  
  
  Continuous Batching
&lt;/h4&gt;

&lt;p&gt;With continuous batching, we instead maintain a buffer of tasks for each serving engine to process, implemented as a pool of async tasks that call &lt;code&gt;AsyncLLMEngine.generate&lt;/code&gt; on vLLM. The serving engine pops prompts from the task pool in order to maintain a consistent batch of sequences to run inference over.&lt;/p&gt;

&lt;p&gt;Continuous batching takes &lt;strong&gt;869 seconds&lt;/strong&gt; and yields a &lt;strong&gt;11% speedup&lt;/strong&gt;. We also see that the cache hit rate decreases from &lt;strong&gt;29.2%&lt;/strong&gt; to &lt;strong&gt;26.5%.&lt;/strong&gt; We believe this is due to the fact that when running in continuous batching mode, on average a larger batch of sequences is being processed at a time, leading to more cache evictions.&lt;/p&gt;

&lt;p&gt;Our next step is to try to improve the cache hit rate which we can do by grouping common prefixes together. A simple way to do this is to just globally sort the data which is what we do next.&lt;/p&gt;

&lt;h4&gt;
  
  
  Sorting
&lt;/h4&gt;

&lt;p&gt;For this method, we run the same continuous batching technique, along with a synchronous global sort of the data at the start of the workload. This ensures that for the most part, prompts with common prefixes end up in the same batch.&lt;/p&gt;

&lt;p&gt;Synchronously sorting the data and then running the continuous batching method takes &lt;strong&gt;563 seconds&lt;/strong&gt; and yields a &lt;strong&gt;35.2% speedup&lt;/strong&gt; relative to just continuous batching. We can also verify this due to better caching by looking at the cache hit rate which increases from &lt;strong&gt;26.5%&lt;/strong&gt; to &lt;strong&gt;54.5%.&lt;/strong&gt; This means that more than half of the input tokens leverage caching now.&lt;/p&gt;

&lt;p&gt;One of the downsides of this method is that our GPUs sit idle while the distributed sort is happening. Our next attempt is do both the continuous batching inference and prefix grouping at the same time so that our GPUs are doing useful work for the full workload. We do this by relaxing the requirement of globally sorting the data and use the Dynamic Prefix Bucketing scheme that we previously discussed. &lt;/p&gt;

&lt;h4&gt;
  
  
  Dynamic Prefix Bucketing
&lt;/h4&gt;

&lt;p&gt;By employing Dynamic Prefix Bucketing locally and Prefix-Aware Routing globally, we are able to avoid the GPU idle time caused by the global sort, while still achieving good prefix cache hit rates across the cluster. In this method, we also make use of continuous batching, sending prefix-bucketed prompts to the inference input buffers in a streaming fashion.&lt;/p&gt;

&lt;p&gt;Our Dynamic Prefix Bucketing method took &lt;strong&gt;482 seconds&lt;/strong&gt; which is a &lt;strong&gt;12.7% speedup&lt;/strong&gt; relative to the synchronous global sort method, and a &lt;strong&gt;50.7% total speedup&lt;/strong&gt; over our baseline. Furthermore, we are able to maintain our cache hit rate at &lt;strong&gt;54%.&lt;/strong&gt; This means that Dynamic Prefix Bucketing only has a cache hit rate penalty of &lt;strong&gt;0.5%&lt;/strong&gt; compared to globally sorting the data, while having the ability to be pipelined with LLM inference! &lt;/p&gt;

&lt;h4&gt;
  
  
  Ray Data
&lt;/h4&gt;

&lt;p&gt;As an additional baseline, we use Ray Data with their off-the-shelf &lt;code&gt;ray.data.llm&lt;/code&gt; &lt;a href="https://docs.ray.io/en/latest/data/working-with-llms.html#batch-inference-llm" rel="noopener noreferrer"&gt;batch processing APIs&lt;/a&gt; [1]. Since it also uses vLLM under the hood, we were able to set it to the exact same configurations as our own benchmarking scripts. The one thing we changed was the batch size, which we set to 16, since we observed that a smaller batch size performed better on their setup.&lt;/p&gt;

&lt;p&gt;With Ray Data, we observe a runtime of &lt;strong&gt;842 seconds&lt;/strong&gt;, which is similar to our continuous batching method. Since Ray Data also utilizes continuous batching, this validates the performance of our methods.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc2jjzp2y97ji11jldrc0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc2jjzp2y97ji11jldrc0.png" alt="Performance comparison bar chart showing runtime in seconds for different batching methods on 128 L4 GPUs" width="600" height="371"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fifvyylsf9ogwbcpl174p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fifvyylsf9ogwbcpl174p.png" alt="Prefix cache hit rates bar chart showing different methods with Dynamic Prefix Bucketing achieving 54% hit rate" width="600" height="371"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Scalability
&lt;/h3&gt;

&lt;p&gt;We next test the scalability of Daft with Dynamic Prefix Bucketing and Ray Data.&lt;/p&gt;

&lt;p&gt;To do this we run both systems on our 32, 64, and 128 GPU configuration and measure the wall time. From this we can derive the scaling factor of how well the systems scale when we increase cluster sizes.&lt;/p&gt;

&lt;p&gt;For Daft, we see near linear scaling from going from 32 to 64 GPUs and then a 87% efficiency when going from 32 to 128 GPUs. At this point, we notice that the overhead of downloading model weights and initialization of the model on GPU (all 128 of them) is now the bottleneck for improving scalability since it is a constant cost.&lt;/p&gt;

&lt;p&gt;We also see that in all configurations below Daft with Dynamic Prefix Bucketing is slightly more scalable than Ray Data.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Daft Runtime (s)&lt;/th&gt;
&lt;th&gt;Daft Speedup (vs 32 GPU)&lt;/th&gt;
&lt;th&gt;Daft Scaling Factor (vs 32 GPU)&lt;/th&gt;
&lt;th&gt;Ray Data Runtime (s)&lt;/th&gt;
&lt;th&gt;Ray Data Speedup (vs 32 GPU)&lt;/th&gt;
&lt;th&gt;Ray Data Scaling Factor (vs 32 GPU)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;32 GPUs&lt;/td&gt;
&lt;td&gt;1682&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2915&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;64 GPUs&lt;/td&gt;
&lt;td&gt;865&lt;/td&gt;
&lt;td&gt;1.94&lt;/td&gt;
&lt;td&gt;0.97&lt;/td&gt;
&lt;td&gt;1548&lt;/td&gt;
&lt;td&gt;1.88&lt;/td&gt;
&lt;td&gt;0.94&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;128 GPUs&lt;/td&gt;
&lt;td&gt;481&lt;/td&gt;
&lt;td&gt;3.49&lt;/td&gt;
&lt;td&gt;0.87&lt;/td&gt;
&lt;td&gt;842&lt;/td&gt;
&lt;td&gt;3.46&lt;/td&gt;
&lt;td&gt;0.86&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2mybuopa7fsmjdd6jlt8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2mybuopa7fsmjdd6jlt8.png" alt="Scaling speedup chart showing near-linear scaling from 32 to 128 GPUs with both ideal and real speedup lines" width="667" height="445"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Ablation on Prefix Count
&lt;/h3&gt;

&lt;p&gt;Finally we see how Daft with Dynamic Prefix Bucketing adapts to different dataset with varying number of prefixes. Here we sweep the number of unique prefixes in a 102M token data with 200k prompts.&lt;/p&gt;

&lt;p&gt;Here we see that Dynamic Prefix Bucketing works better when there are more entries of a common prefix in the dataset. We see that the more common a prefix is the faster the workload runs and the cache hit rate is higher.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnr4pcxokjd8b2bux12qf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnr4pcxokjd8b2bux12qf.png" alt="Performance vs number of unique prefixes bar chart showing faster runtime with fewer unique prefixes" width="609" height="371"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvxhq6myzwlczj3uhk2b1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvxhq6myzwlczj3uhk2b1.png" alt="Cache hit rates vs number of prefixes showing decreasing hit rates as prefix diversity increases" width="600" height="371"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Future Work
&lt;/h2&gt;

&lt;p&gt;The &lt;em&gt;vLLM Prefix Caching&lt;/em&gt; model provider is available to try today. Below are some future improvements that we would like to make to the implementation.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Beyond text generation&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;em&gt;vLLM Prefix Caching&lt;/em&gt; provider currently only supports text generation with our &lt;code&gt;prompt&lt;/code&gt; function, but the same techniques described in this post can also be applied to embedding generation and structured outputs.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Smarter load balancing&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;The router currently load balances using the number of prompts sent to each serving engine replica. This assumes that all GPUs are equally as fast and that sequences take around the same time to generate, which may not be true in real-world scenarios. Instead, the router should monitor the actual number of unfinished requests on each replica to better load balance.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;More accurate cache modeling&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;The router estimates the prefix cache on each replica via a bounded queue of sent prefixes for each replica. We found that this is already very effective, but a more accurate model of the prefix caches or a method to inspect the cache metrics on serving engines would improve the ability to route batches to the best replica.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Further improve scaling&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;We should investigate the current bottlenecks for scaling the system to larger clusters. In theory it should be possible to achieve super-linear scaling, where 2x more GPUs can achieve more than 2x speedup, since a larger cluster will have a larger total prefix cache.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;In addition, we welcome your feedback on these features! Let us know how we can improve Daft and what you would like to see, by submitting an issue on &lt;a href="https://github.com/Eventual-Inc/Daft/issues" rel="noopener noreferrer"&gt;Github&lt;/a&gt; or sending us a message on our &lt;a href="https://join.slack.com/t/dist-data/shared_invite/zt-2e77olvxw-uyZcPPV1SRchhi8ah6ZCtg" rel="noopener noreferrer"&gt;community Slack&lt;/a&gt;. &lt;/p&gt;

&lt;h2&gt;
  
  
  Appendix
&lt;/h2&gt;

&lt;p&gt;[1] We encountered an issue using Ray Data’s &lt;code&gt;build_llm_processor&lt;/code&gt; where we would get an error about no running async event loop. We were able to resolve this issue by downgrading our &lt;code&gt;uvloop&lt;/code&gt; version to v0.21.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>aiops</category>
      <category>distributedsystems</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Daft vs Ray Data: A Comprehensive Comparison for Multimodal Data Processing</title>
      <dc:creator>YK Sugi</dc:creator>
      <pubDate>Mon, 27 Oct 2025 16:17:59 +0000</pubDate>
      <link>https://dev.to/yks/daft-vs-ray-data-a-comprehensive-comparison-for-multimodal-data-processing-3686</link>
      <guid>https://dev.to/yks/daft-vs-ray-data-a-comprehensive-comparison-for-multimodal-data-processing-3686</guid>
      <description>&lt;p&gt;Multimodal AI workloads break traditional data engines. They need to embed documents, classify images, and transcribe audio, not just run aggregations and joins. These multimodal workloads are tough: memory usage balloons mid-pipeline, processing requires both CPU and GPU, and a single machine can't handle the data volume.&lt;/p&gt;

&lt;p&gt;This post provides a comprehensive comparison of &lt;strong&gt;Daft&lt;/strong&gt; and &lt;strong&gt;Ray Data&lt;/strong&gt; for multimodal data processing, examining their architectures and performance. &lt;a href="https://www.daft.ai/blog/benchmarks-for-multimodal-ai-workloads" rel="noopener noreferrer"&gt;Benchmarks&lt;/a&gt; across large-scale audio, video, document, and image workloads found Daft ran 2-7x faster than Ray Data and 4-18x faster than Spark, while finishing jobs reliably.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Multimodal Data Challenge
&lt;/h2&gt;

&lt;p&gt;Multimodal data processing presents unique challenges:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Memory Explosions&lt;/strong&gt;: A compressed image like a JPEG inflates 20x in memory once decoded. A single video file can be decoded into thousands of frames, each being megabytes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Heterogeneous Compute&lt;/strong&gt;: These workloads stress CPU, GPU, and network simultaneously. Processing steps include resampling, feature extraction, transcription, downloading, decoding, resizing, normalizing, and classification.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Data Volume&lt;/strong&gt;: The benchmarked workloads included 113,800 audio files from Common Voice 17, 10,000 PDFs from Common Crawl, 803,580 images from ImageNet, and 1,000 videos from Hollywood2.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Introducing the Contenders
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Daft
&lt;/h3&gt;

&lt;p&gt;Daft is designed to handle petabyte-scale workloads with multimodal data (audio, video, images, text, embeddings) as first-class citizens.&lt;/p&gt;

&lt;p&gt;Key features include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Native multimodal operations&lt;/strong&gt;: Built-in image decoding/encoding/cropping/resizing, text and image embedding/classification APIs, LLM APIs, text tokenization, cosine similarity, URL downloads/uploads, reading video to image frames&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Declarative DataFrame/SQL API&lt;/strong&gt;: With schema validation and query optimizer that automatically handles projection pushdowns, filter pushdowns, and join reordering - optimizations users get "for free" without manual tuning&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Comprehensive I/O support&lt;/strong&gt;: Native readers and writers for Parquet, CSV, JSON, Lance, Iceberg, Delta Lake, and WARC formats, tightly integrated with the streaming execution model&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Ray Data
&lt;/h3&gt;

&lt;p&gt;Ray Data is a data processing library built on top of Ray, a framework for building distributed Python applications.&lt;/p&gt;

&lt;p&gt;Key features include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Low-level operators&lt;/strong&gt;: Provides operations like &lt;code&gt;map_batches&lt;/code&gt; that work directly on PyArrow record batches or pandas DataFrames&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ray ecosystem integration&lt;/strong&gt;: Tight integration with Ray Train for distributed training and Ray Serve for model serving&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture Deep Dive
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Daft's Streaming Execution Model
&lt;/h3&gt;

&lt;p&gt;Daft's architecture revolves around its Swordfish streaming execution engine. Data is always "in flight": batches flow through the pipeline as soon as they are ready. For a partition of 100k images, the first 1000 can be fed into model inference while the next 1000 are being downloaded or decoded. The entire partition never has to be fully materialized in an intermediate buffer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Backpressure mechanism&lt;/strong&gt;: If GPU inference becomes the bottleneck, the upstream steps automatically slow down so memory usage remains bounded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Adaptive batch sizing&lt;/strong&gt;: Daft shrinks batch sizes on memory-heavy operations like &lt;code&gt;url_download&lt;/code&gt; or &lt;code&gt;image_decode&lt;/code&gt;, keeping throughput high without ballooning memory usage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Flotilla distributed engine&lt;/strong&gt;: Daft's distributed runner deploys one Swordfish worker per node, enabling the same streaming execution model to scale across clusters.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feznrf8ousj4i8vlz301k.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feznrf8ousj4i8vlz301k.gif" alt="Animated diagram of Daft's Flotilla distributed architecture showing three Swordfish workers processing data in parallel. Each worker streams data through four sequential stages: Scan, Download, Embed, and Write, with arrows indicating continuous data flow between stages. The animation demonstrates how batches flow through the pipeline without materializing entire partitions." width="720" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Ray Data's Execution Model
&lt;/h3&gt;

&lt;p&gt;Ray Data streams data between heterogeneous operations (e.g., CPU → GPU) that users define via classes or resource requests. Within homogeneous operations, Ray Data fuses sequential operations into the same task and executes them sequentially, which can cause memory issues without careful tuning of block sizes. You can work around this by using classes instead of functions in &lt;code&gt;map&lt;/code&gt;/&lt;code&gt;map_batches&lt;/code&gt;, but this materializes intermediates in Ray's object store, adding serialization and memory copy overhead. Ray's object store is by default only 30% of machine memory, and this limitation can lead to excessive disk spilling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance Benchmarks
&lt;/h2&gt;

&lt;p&gt;Based on &lt;a href="https://www.daft.ai/blog/benchmarks-for-multimodal-ai-workloads" rel="noopener noreferrer"&gt;recent benchmarks&lt;/a&gt; conducted on identical AWS clusters (8 x g6.xlarge instances with NVIDIA L4 GPUs, each with 4 vCPUs, 16 GB memory, and 100 GB EBS volume), here's how the two frameworks compare:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsidc5fbr3m74u2a7ypmc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsidc5fbr3m74u2a7ypmc.png" alt="Bar chart showing Daft significantly outperforming Ray Data across four AI workloads: Audio Transcription, Document Embedding, Image Classification, and Video Object Detection, with Daft completing tasks 2-7x faster" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload&lt;/th&gt;
&lt;th&gt;Daft&lt;/th&gt;
&lt;th&gt;Ray Data&lt;/th&gt;
&lt;th&gt;Spark&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Audio Transcription&lt;/strong&gt; (113,800 files)&lt;/td&gt;
&lt;td&gt;6m 22s&lt;/td&gt;
&lt;td&gt;29m 20s (4.6x slower)&lt;/td&gt;
&lt;td&gt;25m 46s (4.0x slower)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Document Embedding&lt;/strong&gt; (10,000 PDFs)&lt;/td&gt;
&lt;td&gt;1m 54s&lt;/td&gt;
&lt;td&gt;14m 32s (7.6x slower)&lt;/td&gt;
&lt;td&gt;8m 4s (4.2x slower)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Image Classification&lt;/strong&gt; (803,580 images)&lt;/td&gt;
&lt;td&gt;4m 23s&lt;/td&gt;
&lt;td&gt;23m 30s (5.4x slower)&lt;/td&gt;
&lt;td&gt;45m 7s (10.3x slower)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Video Object Detection&lt;/strong&gt; (1,000 videos)&lt;/td&gt;
&lt;td&gt;11m 46s&lt;/td&gt;
&lt;td&gt;25m 54s (2.2x slower)&lt;/td&gt;
&lt;td&gt;3h 36m (18.4x slower)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Why Such Large Performance Differences?
&lt;/h3&gt;

&lt;p&gt;Several architectural decisions contribute to Daft's performance advantages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Native Operations vs Python UDFs&lt;/strong&gt;: Daft has native multimodal expressions including image decoding/encoding/cropping/resizing, text and image embedding/classification APIs, LLM APIs, text tokenization, cosine similarity, URL downloads/uploads, and reading video to image frames. These native multimodal expressions are highly optimized in Daft. In Ray Data you have to write your own Python UDFs that use external dependencies like Pillow, numpy, spacy, huggingface, etc. This comes at the cost of extra data movement because these libraries each have their own data format.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Memory Management - Streaming vs Materialization&lt;/strong&gt;: Daft streams data through network, CPU, and GPU in a continuous stream without materializing entire partitions. Ray Data fuses sequential operations which can cause memory issues. While you can work around this by using classes to materialize intermediates in the object store, this adds serialization and memory copy overhead.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Resource Utilization&lt;/strong&gt;: Daft pipelines everything inside a single Swordfish worker, which has control over all resources of the machine. Data asynchronously streams from cloud storage, into the CPUs to run pre-processing steps, then into GPU memory for inference, and back out for results to be uploaded. CPUs, GPUs, and the network stay saturated together for optimal throughput. In contrast, Ray Data by default reserves a CPU core for I/O-heavy operations like downloading large videos, which can leave that core unavailable for CPU-bound processing work, requiring manual tuning of fractional CPU requests to optimize resource usage.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  When to Choose Which?
&lt;/h2&gt;

&lt;p&gt;Based on the benchmark results and architectural differences:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Daft&lt;/strong&gt; shows significant advantages for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multimodal data processing (images, documents, video, audio)&lt;/li&gt;
&lt;li&gt;Workloads requiring reliable execution without extensive tuning&lt;/li&gt;
&lt;li&gt;Complex queries with joins, aggregations, and multiple transformations&lt;/li&gt;
&lt;li&gt;Teams preferring DataFrame/SQL semantics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Ray Data&lt;/strong&gt; may be preferred when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have tight integration needs with the Ray ecosystem (Ray Train, Ray Serve)&lt;/li&gt;
&lt;li&gt;You need fine-grained control over CPU/GPU allocation per operation&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What Practitioners Are Saying
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Is Daft battle-tested enough for production?
&lt;/h3&gt;

&lt;p&gt;When &lt;a href="https://www.linkedin.com/in/timr11/" rel="noopener noreferrer"&gt;Tim Romanski&lt;/a&gt; of Essential AI set out to taxonomize 23.6 billion web documents from Common Crawl (24 trillion tokens), his team pushed Daft to its limits - scaling from local development to 32,000 requests per second per VM. As he shared in a &lt;a href="https://youtu.be/y5hs7q_LaLM?t=466" rel="noopener noreferrer"&gt;panel discussion&lt;/a&gt;: "We pushed Daft to the limit and it's battle tested... If we had to do the same thing in Spark, we would have to have the JVM installed, go through all of its nuts and bolts just to get something running. So the time to get something running in the first place was a lot shorter. And then once we got it running locally, we just scaled up to multiple machines."&lt;/p&gt;

&lt;h3&gt;
  
  
  What gap does Daft fill in the Ray ecosystem?
&lt;/h3&gt;

&lt;p&gt;CloudKitchens rebuilt their entire ML infrastructure around what they call the "DREAM stack" (Daft, Ray, poEtry, Argo, Metaflow). When selecting their data processing layer, they identified specific limitations with Ray Data and chose Daft to complement Ray's compute capabilities. As their infrastructure team &lt;a href="https://techblog.cloudkitchens.com/p/ml-infrastructure-doesnt-have-to" rel="noopener noreferrer"&gt;explained&lt;/a&gt;, "one issue with the Ray library for data processing, Ray Data, is that it doesn't cover the full range of DataFrame/ETL functions and its performance could be improved." They chose Daft because "it fills the gap of Ray Data by providing amazing DataFrame APIs" and noted that "in our tests, it's faster than Spark and uses fewer resources."&lt;/p&gt;

&lt;h3&gt;
  
  
  How does Daft perform on even larger datasets?
&lt;/h3&gt;

&lt;p&gt;A data engineer from ByteDance commented on Daft's &lt;a href="https://www.daft.ai/blog/processing-300k-images-without-oom" rel="noopener noreferrer"&gt;300K image processing demonstration&lt;/a&gt;, sharing his own experience with an even larger image classification workload: "Not just 300,000 images - we ran image classification evaluations on the ImageNet dataset with approximately 1.28 million images, and Daft was about 20% faster than Ray Data." Additionally, in a separate technical analysis of Daft's architecture, he praised its "excellent execution performance and resource efficiency" and highlighted how it "effortlessly enables streaming processing of large-scale image datasets."&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.daft.ai/blog/benchmarks-for-multimodal-ai-workloads" rel="noopener noreferrer"&gt;Benchmarks for Multimodal AI Workloads&lt;/a&gt; - Primary source for performance data and architectural comparisons&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Eventual-Inc/Daft/tree/main/benchmarking/ai" rel="noopener noreferrer"&gt;Benchmark Code Repository&lt;/a&gt; - Open-source code to reproduce all benchmarks&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dist-data.slack.com/join/shared_invite/zt-2e77olvxw-uyZcPPV1SRchhi8ah6ZCtg" rel="noopener noreferrer"&gt;Distributed Data Community Slack&lt;/a&gt; - Join the community to discuss with Daft developers and users&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>dataengineering</category>
      <category>aiops</category>
      <category>distributedsystems</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
