Joud Awad

Posted on Jun 25

50/60 Days System Design Questions!

#abotwrotethis #machinelearning #distributedsystems #systemdesign

Your ML team trained a model that hits 94% accuracy in the notebook.

Then you deploy it.

Peak traffic: 3,000 inference requests/second.
P99 latency shoots to 4.2 seconds.
GPU utilization: 23%.

The model works. The serving layer is the bottleneck.

Here's the setup:

• PyTorch model, ~6B params
• Single A100 GPU, 80GB VRAM
• FastAPI wrapper calling model.predict() one request at a time
• No batching, FP32 weights, no quantization

Users are hitting timeouts. GPUs are sitting mostly idle. Your infra bill is climbing.

What do you fix first?

A) Enable dynamic batching — buffer incoming requests, group into batches, process in one forward pass.

B) Quantize the model to INT8 — reduce weight precision from 32-bit to 8-bit, shrink memory footprint and speed up inference.

C) Switch to tensor parallelism — split the model across multiple GPUs, distribute the compute.

D) Add a request queue + async workers — decouple HTTP receiving from model inference, process jobs in background.

All four are real production patterns. Only one directly fixes the combination of low GPU utilization + high latency at 3K RPS.

Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments.

Top comments (5)

Joud Awad • Jun 25

Answer: A — Dynamic Batching ✅

Here's the breakdown:

Why A wins (Dynamic Batching):
The smoking gun is the combination: 3K RPS, P99 = 4.2s, GPU utilization = 23%.

GPU utilization at 23% tells you the GPU is starving. It's not overloaded — it's bored. Every request is arriving one at a time, paying the full GPU kernel launch overhead, then idling while the next request trickles in. GPUs are built for parallelism. They want 256 inputs to process simultaneously, not 1.

Dynamic batching buffers requests for a few milliseconds (say, 50ms), collects 32–128 of them, and fires a single forward pass. That same GPU that was at 23% utilization can jump to 85–90%. Latency improves because batch throughput dramatically outpaces sequential per-request calls, even with the small buffer wait.

This is exactly how NVIDIA Triton Inference Server, TorchServe, and vLLM (for LLMs) work by default. Batching is the first thing you reach for when GPU util is low and RPS is high.

Joud Awad • Jun 25

Why B is the trap answer (INT8 Quantization):
Quantization is legitimate and valuable — but it solves a different problem. It reduces model size (~4x smaller) and speeds up individual forward passes. If you were memory-constrained or couldn't fit the model at all, quantization is critical. But this model fits comfortably in 80GB VRAM and the GPU isn't saturated.

INT8 quantization also introduces precision loss you need to validate carefully, and the speedup on a modern A100 for a 6B param model in isolation is modest — maybe 1.5–2x. That doesn't fix 3K RPS with sequential inference. You'll still have sequential execution, just slightly faster per call.

The right order: batch first, quantize second (for memory or cost, not latency emergencies).

Joud Awad • Jun 25

Why C is wrong here (Tensor Parallelism):
Tensor parallelism splits model layers across multiple GPUs using NVLink or interconnects. It reduces per-request latency for very large models that don't fit on one device, or when you need sub-100ms latency on a single huge model.

But this is a 6B param model on an 80GB A100 — it fits. Tensor parallelism adds coordination overhead and complexity, and it doesn't help when the bottleneck is sequential request processing. You'd also need 2+ GPUs, which inflates cost before you've even tried the cheap fix.

Joud Awad • Jun 25

Why D is wrong (Async Queue + Workers):
An async queue decouples HTTP from inference and prevents timeout failures — that's a reliability win. But it doesn't improve GPU utilization or actual throughput. You're still processing one request at a time, just asynchronously. Users now get job IDs instead of timeouts, but the wall-clock wait is the same or worse.

Queuing without batching is adding a lobby to a slow restaurant. It hides the congestion — it doesn't fix it.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.