<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: interviewgpt</title>
    <description>The latest articles on DEV Community by interviewgpt (@interviewgpt_fd26fed0b5cf).</description>
    <link>https://dev.to/interviewgpt_fd26fed0b5cf</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3866603%2F55901fde-f633-435c-b51a-2f3d50fb0540.png</url>
      <title>DEV Community: interviewgpt</title>
      <link>https://dev.to/interviewgpt_fd26fed0b5cf</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/interviewgpt_fd26fed0b5cf"/>
    <language>en</language>
    <item>
      <title>High-Throughput GPU Inference Batching System Design</title>
      <dc:creator>interviewgpt</dc:creator>
      <pubDate>Tue, 07 Apr 2026 22:19:43 +0000</pubDate>
      <link>https://dev.to/interviewgpt_fd26fed0b5cf/high-throughput-gpu-inference-batching-system-design-ad5</link>
      <guid>https://dev.to/interviewgpt_fd26fed0b5cf/high-throughput-gpu-inference-batching-system-design-ad5</guid>
      <description>&lt;p&gt;&lt;a href="https://interviewgpt.deepchill.app/blogs/design/high-throughput-gpu-inference-batching-system-pNDGaPKW4teLBw7yf4C9dA" rel="noopener noreferrer"&gt;High-Throughput GPU Inference Batching System&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Abstract:&lt;/strong&gt; How do you build a system that supports high-concurrency requests against an API you cannot change? This article walks through a complete infrastructure design for a GPU inference batching system — one that optimizes GPU utilization via a server-side batching mechanism that intelligently balances latency and throughput. From clarifying questions to deep trade-off analysis, this is a FAANG-level deep dive into one of the hardest infrastructure problems in applied ML.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Clarifying Questions&lt;/li&gt;
&lt;li&gt;Crash Strategy &amp;amp; Key Points&lt;/li&gt;
&lt;li&gt;Elite Bonus Points (FAANG Rubrics)&lt;/li&gt;
&lt;li&gt;Functional Requirements&lt;/li&gt;
&lt;li&gt;Non-Functional Requirements&lt;/li&gt;
&lt;li&gt;Back-of-Envelope Estimation&lt;/li&gt;
&lt;li&gt;High-Level Design&lt;/li&gt;
&lt;li&gt;Low-Level Design&lt;/li&gt;
&lt;li&gt;Trade-offs, Alternatives &amp;amp; Optimizations&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1. Clarifying Questions
&lt;/h2&gt;

&lt;p&gt;Before designing anything, you need to nail down assumptions. Here are the key questions — and the assumptions we'll carry forward:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Question&lt;/th&gt;
&lt;th&gt;Assumption&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;What is the peak QPS and the target latency SLO?&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;10,000 QPS&lt;/strong&gt; with a p99 latency requirement of &lt;strong&gt;&amp;lt; 500ms&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What is the maximum batch size supported by the fixed inference API?&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Max batch size is 64 requests&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;What is the payload size for input and output?&lt;/td&gt;
&lt;td&gt;Text-based, &lt;strong&gt;~2KB per request&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Is client communication synchronous or asynchronous?&lt;/td&gt;
&lt;td&gt;Clients expect a synchronous-like experience; we use an &lt;strong&gt;async-polling or long-polling pattern&lt;/strong&gt; internally&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Do we need to handle request priorities (e.g., premium vs. free users)?&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;No — FIFO for the MVP&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Clarifying questions are not just a formality. Each assumption here directly shapes an architectural decision downstream. The batch size cap (64) determines our batcher's flush trigger. The 500ms SLO sets our tolerable wait window. Always clarify before drawing boxes.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Crash Strategy &amp;amp; Key Points
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Core Bottleneck
&lt;/h3&gt;

&lt;p&gt;When requests arrive individually at a GPU worker, two problems emerge: &lt;strong&gt;under-utilization&lt;/strong&gt; (GPUs thrive on parallelism) and &lt;strong&gt;memory overhead&lt;/strong&gt; (per-request context switching). The naive design — one request in, one inference out — kills throughput.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Key Strategy: Dynamic Batching
&lt;/h3&gt;

&lt;p&gt;The solution is a &lt;strong&gt;Dynamic Batching Service&lt;/strong&gt; that acts as a buffer between your high-concurrency HTTP API and the fixed GPU workers. It's a traffic shaper: absorb the spikes, group requests intelligently, dispatch in bulk.&lt;/p&gt;

&lt;h3&gt;
  
  
  Progressive Problem Decomposition
&lt;/h3&gt;

&lt;p&gt;Think through this layer by layer:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;How do we ingest 10k+ requests without blocking?&lt;/strong&gt;&lt;br&gt;
Use a distributed message queue. Each incoming request is a lightweight enqueue operation.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;How do we group them efficiently?&lt;/strong&gt;&lt;br&gt;
The Batcher implements &lt;strong&gt;"Wait-or-Full" logic&lt;/strong&gt; — flush when batch size hits 64, or when 50ms elapses, whichever comes first.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;How do we deliver results back to the user?&lt;/strong&gt;&lt;br&gt;
A &lt;strong&gt;Result Store&lt;/strong&gt; (Redis) holds completed inference outputs. Clients poll with a &lt;code&gt;task_id&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;How do we scale the Batcher itself?&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Partition-based batching&lt;/strong&gt; — each Batcher instance consumes from a dedicated partition of the queue, eliminating global locks and contention.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  3. Elite Bonus Points (FAANG Rubrics)
&lt;/h2&gt;

&lt;p&gt;These are the insights that separate a good design from a great one:&lt;/p&gt;

&lt;h3&gt;
  
  
  Adaptive Batching
&lt;/h3&gt;

&lt;p&gt;Dynamically adjust the &lt;code&gt;wait_time&lt;/code&gt; based on current traffic volume. During low-traffic periods, flush quickly to minimize latency. During spikes, extend the wait window to fill batches and maximize GPU throughput. A simple EWMA (Exponentially Weighted Moving Average) on the arrival rate drives this.&lt;/p&gt;

&lt;h3&gt;
  
  
  Zero-Copy Serialization
&lt;/h3&gt;

&lt;p&gt;Use &lt;strong&gt;Protobuf&lt;/strong&gt; or &lt;strong&gt;Apache Arrow&lt;/strong&gt; for internal data transfer. This reduces CPU overhead during batch construction and deconstruction — critical when you're processing millions of requests per hour.&lt;/p&gt;

&lt;h3&gt;
  
  
  GPU Backpressure Propagation
&lt;/h3&gt;

&lt;p&gt;Implement a feedback loop: if the GPU Worker's internal queue or memory utilization exceeds &lt;strong&gt;90%&lt;/strong&gt;, the Batcher slows ingestion. This prevents the queue from becoming a buffer for an already-saturated backend. Your system should degrade gracefully, not catastrophically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Locality-Aware Batching
&lt;/h3&gt;

&lt;p&gt;At global scale, ensure batching happens &lt;strong&gt;at the edge or within the same Availability Zone&lt;/strong&gt;. Cross-region data transfer costs are real, and cross-AZ latency adds up quickly under a 500ms SLO.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Functional Requirements
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Core Use Cases
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Users submit inference requests via a REST API.&lt;/li&gt;
&lt;li&gt;Requests are batched and processed by the GPU model.&lt;/li&gt;
&lt;li&gt;Users retrieve the inference result via polling.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Scope Control
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;In-Scope&lt;/th&gt;
&lt;th&gt;Out-of-Scope&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;API Gateway&lt;/td&gt;
&lt;td&gt;Model training&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Request Queue&lt;/td&gt;
&lt;td&gt;Model optimization (TensorRT/ONNX)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Batching Logic&lt;/td&gt;
&lt;td&gt;User authentication service&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Result Storage&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Scope control is not laziness — it's clarity. Defining the boundary prevents scope creep and lets you go deep on what matters.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Non-Functional Requirements
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;Requirement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scale&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Handle &lt;strong&gt;10k QPS&lt;/strong&gt;, scale horizontally&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Batching overhead &lt;strong&gt;&amp;lt; 50ms&lt;/strong&gt;; total E2E latency &lt;strong&gt;&amp;lt; 500ms&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Availability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;99.9% uptime&lt;/strong&gt;; requests must not be lost on worker failure (at-least-once delivery)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Consistency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Eventual consistency for results; strict ordering within a batch is not required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fault Tolerance&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dead-letter queues (DLQ) for failed inference attempts&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  6. Back-of-Envelope Estimation
&lt;/h2&gt;

&lt;p&gt;Let's do the math to validate our architecture can actually work.&lt;/p&gt;

&lt;h3&gt;
  
  
  Traffic
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;10,000 requests/sec&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Storage
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;10,000 req/s × 2KB/req = 20 MB/s&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;For 1 hour of retention: &lt;strong&gt;~72 GB&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Bandwidth
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Ingress: &lt;code&gt;10,000 × 2KB = 20 MB/s&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Egress (Results): &lt;strong&gt;~20 MB/s&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  GPU Worker Count
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;One batch of 64 requests takes ~200ms to process&lt;/li&gt;
&lt;li&gt;One GPU worker handles: &lt;code&gt;64 / 0.2s = 320 QPS&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Workers needed: &lt;code&gt;10,000 / 320 ≈ 32 GPU workers&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This gives us a concrete target — 32 GPU workers — and validates that the batching approach is load-bearing, not cosmetic.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. High-Level Design
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Design Summary
&lt;/h3&gt;

&lt;p&gt;A high-throughput pipeline using a &lt;strong&gt;distributed queue&lt;/strong&gt; to decouple request ingestion from GPU execution, featuring a dedicated &lt;strong&gt;Batcher Service&lt;/strong&gt; for optimal GPU utilization.&lt;/p&gt;

&lt;h3&gt;
  
  
  Major Components
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API Gateway&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Entry point — SSL termination, request validation, rate limiting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Request Queue (Redis Streams)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fast, in-memory buffer for incoming inference tasks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Batcher Service&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Core logic — aggregates N messages or waits T milliseconds before calling the GPU API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Result Store (Redis)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Short-lived storage for finished inference results&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  System Architecture Diagram
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌────────┐      ┌─────────────┐      ┌───────────────┐      ┌──────────────────┐      ┌───────────────┐
│ Client │ ───► │ API Gateway │ ───► │ Request Queue │ ───► │ Batcher Service  │ ───► │ GPU Worker API│
└────────┘      └─────────────┘      │ (Redis Stream)│      │  (Wait-or-Full)  │      └───────┬───────┘
     ▲                ▲              └───────────────┘      └──────────────────┘              │
     │                │                                                                        ▼
     └────────────────┴──────────────────────────────────── Result Store (Redis) ◄────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Simplicity Audit
&lt;/h3&gt;

&lt;p&gt;This architecture intentionally avoids complex stream-processing frameworks like Apache Flink. A lightweight &lt;strong&gt;consumer-group-based batcher&lt;/strong&gt; is easier to deploy, debug, and scale for an MVP — and can be upgraded later if needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Low-Level Design
&lt;/h2&gt;

&lt;h3&gt;
  
  
  8.1 Edge Layer
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Traffic Routing:&lt;/strong&gt; A Global Server Load Balancer (GSLB) routes traffic to the nearest regional API Gateway.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security:&lt;/strong&gt; The API Gateway handles JWT validation and rate limiting — &lt;strong&gt;1,000 requests per user per minute&lt;/strong&gt; — to protect the GPU cluster from abuse.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  8.2 Service Layer
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Topology:&lt;/strong&gt; Stateless API instances deployed in Kubernetes, auto-scaling on CPU (70% threshold).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API Schema:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST /v1/inference
Body: { "input": "...", "client_id": "..." }
Response: { "task_id": "abc-123" }

GET /v1/result/{task_id}
Response: { "status": "PENDING" | "SUCCESS", "output": "..." }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Resilience:&lt;/strong&gt; 3 retries with exponential backoff for the Batcher when calling the GPU API.&lt;/p&gt;

&lt;h3&gt;
  
  
  8.3 Storage Layer
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Access Pattern:&lt;/strong&gt; High write/read (1:1 ratio). Data is transient — results expire after 10 minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result Table (Redis Hash):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Key: &lt;code&gt;task_id&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Fields: &lt;code&gt;status&lt;/code&gt;, &lt;code&gt;output&lt;/code&gt;, &lt;code&gt;timestamp&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Distribution:&lt;/strong&gt; Partitioning by &lt;code&gt;task_id&lt;/code&gt; using &lt;strong&gt;Redis Cluster&lt;/strong&gt; to handle 20k+ operations per second.&lt;/p&gt;

&lt;h3&gt;
  
  
  8.4 Cache Layer
&lt;/h3&gt;

&lt;p&gt;Deduplicate identical inference requests (e.g., the same prompt submitted multiple times) to save GPU cycles.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Key:&lt;/strong&gt; &lt;code&gt;SHA256(input_payload)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Value:&lt;/strong&gt; &lt;code&gt;task_id&lt;/code&gt; or &lt;code&gt;cached_result&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TTL:&lt;/strong&gt; 5 minutes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failure Handling:&lt;/strong&gt; If Redis fails, bypass the cache and go straight to the queue.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  8.5 Messaging Layer
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Topic Schema — &lt;code&gt;inference-requests&lt;/code&gt;:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"task_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"abc-123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"payload"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-29T19:50:55Z"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Throughput:&lt;/strong&gt; Redis Streams with &lt;strong&gt;16 shards&lt;/strong&gt; to allow parallel Batcher consumers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Redis Streams?&lt;/strong&gt; Low latency, built-in consumer groups for at-least-once delivery, and operational simplicity compared to Kafka for this scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  8.6 Data Processing Layer — The Batcher (Core Design)
&lt;/h3&gt;

&lt;p&gt;This is the heart of the system. The Batcher Service uses a &lt;strong&gt;hybrid trigger model&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Trigger&lt;/th&gt;
&lt;th&gt;Condition&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Size Trigger&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;64 messages accumulated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Time Trigger&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;50ms elapsed since the first message in the current window&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Whichever fires first wins. This guarantees that no request waits longer than 50ms regardless of traffic volume.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Processing DAG:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Read from Stream
      │
      ▼
Accumulate in Memory (per-partition buffer)
      │
  Size=64 OR Time=50ms
      │
      ▼
Call Fixed GPU Worker API (one HTTP/gRPC call with full batch)
      │
      ▼
Disperse Results → Write each result to Redis by task_id
      │
      ▼
ACK Stream (mark messages as processed)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Scalability:&lt;/strong&gt; Multiple Batcher instances consume from different partitions of the Redis Stream — no shared state, no global locks.&lt;/p&gt;

&lt;h3&gt;
  
  
  8.7 Infrastructure &amp;amp; Observability
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Key Metrics to Track:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;batch_size_distribution&lt;/code&gt; — Are we consistently filling batches? Or flushing early on timeouts?&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;gpu_worker_latency&lt;/code&gt; — Are workers saturating?&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;queue_depth&lt;/code&gt; — Leading indicator of system stress.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Distributed Tracing:&lt;/strong&gt; Jaeger for end-to-end tracing from the API Gateway through the Batcher to the GPU API response. Essential for diagnosing latency regressions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Technology Stack:&lt;/strong&gt; Redis Streams, Redis Cluster, Kubernetes, Envoy, gRPC, Prometheus, Jaeger, JWT, mTLS.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Trade-offs, Alternatives &amp;amp; Optimizations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Fundamental Trade-off: Latency vs. Throughput
&lt;/h3&gt;

&lt;p&gt;We accept up to &lt;strong&gt;50ms of additional latency per request&lt;/strong&gt; in exchange for dramatically higher GPU utilization. A single request arriving at an idle batcher window waits up to 50ms. In return, the system can sustain &lt;strong&gt;10x the load&lt;/strong&gt; of a naïve pass-through design.&lt;/p&gt;

&lt;p&gt;This is the right trade-off for batch inference workloads, where throughput matters more than tail latency for individual requests.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reliability: At-Least-Once Delivery
&lt;/h3&gt;

&lt;p&gt;Redis Streams' &lt;strong&gt;Consumer Groups&lt;/strong&gt; ensure durability. If a Batcher instance crashes mid-processing, unacknowledged messages are automatically re-delivered to another healthy instance (NACK mechanism). No request is silently dropped.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bottleneck Analysis
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;Fixed GPU API is the ultimate bottleneck&lt;/strong&gt;. If it slows down, the Request Queue depth grows. Two mitigations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;TTL on the Queue:&lt;/strong&gt; Drop stale requests before they age out of user tolerance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backpressure:&lt;/strong&gt; Signal upstream components to slow ingestion when GPU memory &amp;gt; 90%.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Security
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;All internal communication between the Batcher and GPU API uses &lt;strong&gt;mTLS&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Input sanitization is performed at the API Gateway to prevent prompt injection attacks.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The "Hot Key" Problem: Request Collapsing
&lt;/h3&gt;

&lt;p&gt;When many users submit the same inference request (e.g., a trending query), the Batcher can implement &lt;strong&gt;request collapsing&lt;/strong&gt;: identify duplicate payloads within the same batch using a hash, send only one to the GPU, then replicate the result to all matching &lt;code&gt;task_id&lt;/code&gt; entries. This is a powerful optimization at scale.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;th&gt;Key Design Decision&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;API Gateway&lt;/td&gt;
&lt;td&gt;Envoy / K8s Ingress&lt;/td&gt;
&lt;td&gt;Rate limiting, JWT validation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Request Queue&lt;/td&gt;
&lt;td&gt;Redis Streams (16 shards)&lt;/td&gt;
&lt;td&gt;Partitioned for parallel consumers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Batcher&lt;/td&gt;
&lt;td&gt;Custom service, per-partition&lt;/td&gt;
&lt;td&gt;Wait-or-Full hybrid trigger (N=64, T=50ms)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU Worker&lt;/td&gt;
&lt;td&gt;Fixed external API&lt;/td&gt;
&lt;td&gt;Called with full batch payload&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Result Store&lt;/td&gt;
&lt;td&gt;Redis Cluster&lt;/td&gt;
&lt;td&gt;TTL=10min, keyed by task_id&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Observability&lt;/td&gt;
&lt;td&gt;Prometheus + Jaeger&lt;/td&gt;
&lt;td&gt;Batch size distribution, queue depth, GPU latency&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The architecture centers on one insight: &lt;strong&gt;GPUs are expensive, and idle GPUs are waste&lt;/strong&gt;. Every design decision — the queue, the batcher, the partitioning strategy, the cache — serves the goal of keeping GPU workers saturated while keeping the user experience fast and reliable.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Designed to handle 10,000 QPS with a p99 latency under 500ms. 32 GPU workers. Zero request loss. One core idea: batch everything.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;For more system-design articles, check the &lt;a href="https://interviewgpt.deepchill.app/blogs" rel="noopener noreferrer"&gt;InterviewGPT blog&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>systemdesign</category>
    </item>
  </channel>
</rss>
