Create account

DEV Community

UnitBuilds for UnitBuilds CC

Posted on Jul 5

Token Factory: Understanding the pipeline

#ai #games #machinelearning #discuss

Have you ever wondered how high-performance LLM deployment frameworks like vLLM, TensorRT-LLM, or Hugging Face TGI actually optimize model serving? While you wait for tokens to stream into your chat window, the infrastructure under the hood is executing a fragile balancing act: scheduling prompt pre-computation, paging memory segments, verifying speculative token chains, and dodging system-stalling bottleneck crashes.

To teach you how LLMs are deployed, optimized, and served under high concurrent loads, I built an interactive factory simulation game:

🏭 Inference Pipeline Tycoon

Play in Fullscreen Mode (if the embed sizing is tight)

🛠️ Choose Your Terminal Level

Your journey as an infrastructure architect is split into three distinct serving terminals, each introducing advanced optimizations:

⚡ Level 1: Prefill & Decode Basics (Easy): Route prompts from the input queue into a Prefill Core to compute key-value activations, then link them to a Decode Core to generate autoregressive token streams. Target: 30.0 TPS.
🔋 Level 2: KV-Cache Paged Memory (Medium): Process large context windows under tight VRAM constraints. You must connect virtual paging allocators to compress memory allocations and prevent CUDA Out-of-Memory crashes. Target: 60.0 TPS on a restricted 3072 MB VRAM card.
🚀 Level 3: Speculative Speedup (Hard): Autoregressive decode is too slow to hit the client quota. You must deploy lightweight draft models and validation gates to generate and verify 3 tokens in parallel per step. Target: 120.0 TPS.

🧬 Playable ML Concepts Explained

This isn't just a basic puzzle game—every component, routing direction, and memory rule represents a real-world concept in modern machine learning infrastructure. Here is how the in-game mechanics map directly to how large language models are optimized and served in production:

1. 🎯 Prefill vs. Decode (The Sequential Pipeline)

In-Game: You must place a Prefill Core (PREF) to process green prompt packets into magenta activation vectors, then route them to a Decode Core (DECO) to begin autoregressive sequence token generation.
The Real-World Counterpart: LLM serving divides inference into two phases:
- Prefill: Processes the user's prompt tokens in parallel, generating the initial Key-Value (KV) attention matrices.
- Decode: Generates one token at a time sequentially. It takes the newly generated token and appends it to the history, running a full forward pass of the model per token.
How it affects LLMs: Because decode is bound by memory bandwidth (requiring reloading billions of model weights for every single token predicted), it is much slower than prefill. Placing cores far apart adds routing latency. Clumping them together represents standard hardware co-location to maximize throughput.

2. 🔋 KV-Cache Paging (vLLM Page Allocator)

In-Game: Placing Page Allocators (vLLM) immediately adjacent to Prefill and Decode cores automatically compresses their VRAM cache footprint by 40%.
The Real-World Counterpart: Key-Value caching saves past token representations in GPU VRAM so they don't have to be recalculated. However, dynamic user prompt sizes cause severe memory fragmentation, leading to premature allocation limits and CUDA Out of Memory failures.
How it affects LLMs: Modern engines implement PagedAttention (popularized by vLLM). By allocating virtual memory tables and partitioning the KV-cache into logical pages (similar to operating system paging), engines eliminate fragmentation and cache waste, multiplying GPU serving capacity by up to 4x.

3. 🚀 Speculative Decoding (Drafter & Validation Gates)

In-Game: Placing a Draft Model (DRAF) adjacent to a Decode Core allows it to generate a draft of 3 speculative tokens per step. These drafts must pass through a Validation Gate (VALI) to verify them before reaching the output sink.
The Real-World Counterpart: Speculative decoding pairs a massive, accurate target LLM with a tiny, lightweight draft model that runs extremely fast. The draft model speculatively generates a sequence of $K$ tokens. The target model then verifies all $K$ tokens in parallel in a single forward pass.
How it affects LLMs: Since the target model can verify multiple tokens in the same amount of time it takes to generate one token autoregressively, speculative decoding dramatically reduces latency. If the draft matches, we gain $K$ tokens in a single step; if it misses, we roll back and regenerate.

🛠️ The Under-the-Hood Engineering Journey

Building a high-throughput simulation game with canvas rendering and real-time audio synthesis presented some fascinating web development challenges:

1. Fixed Timestep Physics (Decoupling FPS from TPS)

When rendering hundreds of active token particles simultaneously, canvas draw overhead can drop the browser's render rate to 15–20 FPS on older devices. In early drafts, this slowed down the clock, capping the throughput at 66 TPS even with optimized pipelines.

The Solution: We implemented a Fixed Timestep Accumulator (60 ticks/sec). Even if the browser's rendering frame rate lags, the accumulator catches up by running multiple simulation ticks per frame, keeping the throughput (TPS) metrics completely accurate to real wall-clock time.

2. Resolving Conduit Queue Propagation

Conduits initially processed one packet per tile per tick. When Speculative validation released batches of 3 tokens at once, the conduits created queue pile-ups, capping throughput at a hard limit. Changing conduit propagation to a while loop allowed wire tiles to behave like physical conductors, transferring all arrived tokens in the same frame.

💬 Let's Discuss:

What was your highest throughput layout on Level 3?
Did you manage to fit vLLM Page Allocators and Speculative Drafters cleanly on Level 3's grid without OOM?
What architectural combination did you find most effective?

UnitBuilds-CC / LLMs-are-Demented

An educational crossword game to learn about LLMs

📟 The Gating Crisis: Sparse MoE Router Simulator 🧠⚡

Part of the UnitBuilds CC Playgrounds Suite

Welcome, neural engineer. You have been put in charge of the Gating Network (Router) for a running Mixture of Experts (MoE) Large Language Model.

Your task is to route incoming multi-modal token streams ([T] Text, [M] Math, [V] Vision, [A] Audio, and [C] Code) to specialized Feed-Forward Network (FFN) experts in real-time. Since this is a Top-2 Routing network, you must dispatch every token to exactly two experts before it reaches the eviction threshold.

If you route tokens incorrectly, the model's output quality degrades into perplexity collapse. If you overload any individual expert beyond its queue limit, the system experiences Capacity Drops (loss of data).

🕹️ Game Mechanics (How to Play)

⌨️ Hotkey Routing: Use numbers 1 to 8 (or 1 to 4 in simplified mode) to…

View on GitHub

Disclaimer: AI was used throughout this project, it is just fitting that it would co-author with me, so special thanks to the Foundry for its tireless hours toiling away and Gemini for producing the cover image.