DEV Community: ambarish pathak

Three Rate Limiting Algorithms, One Race Condition, and a Demo That Was Lying to Me

ambarish pathak — Wed, 17 Jun 2026 23:30:10 +0000

Building a distributed rate limiter taught me that atomicity, distributed state, and demo fidelity matter more than which algorithm you pick. Here's what I found when I actually read my own code closely.

"Design a rate limiter" is one of those system design interview questions that sounds simple until you actually try to build one correctly. I built a distributed rate limiter with three different algorithms, Redis as the shared state store, a circuit breaker for resilience, and a full observability stack with Prometheus and Jaeger. The README calls it production-grade. Going back through the code carefully to write this post, I found three places where that label needs an asterisk, and one bug in the demo dashboard that was quietly faking its own results. All three are useful lessons, and I think they're more interesting than the algorithms themselves.
The three algorithms, briefly
Token bucket gives each client a bucket that holds up to rate tokens and refills continuously based on elapsed time. Requests cost tokens, and you're allowed as long as you have enough in the bucket. It's the most burst-friendly option since a quiet client can save up tokens and spend them all at once.
Fixed window counts requests inside a clock-aligned time slice, like "this minute," and resets the count when the slice rolls over. It's the cheapest of the three in Redis, a single INCR plus an EXPIRE, but it has the well-known boundary spike problem: a client can send a full window's worth of requests in the last second of one window and another full window's worth in the first second of the next, doubling their effective rate right at the seam.
Sliding window fixes the boundary problem by tracking the actual timestamp of every request in a Redis sorted set, trimming anything older than the window on every check, and counting what's left. It's the most accurate and the most expensive, both in Redis round trips and in memory, since it has to store one entry per request rather than one integer per window.
Where "atomic" stops being true
The README states that Redis operations are atomic with no race conditions. For fixed window, that's accurate, INCR is a single atomic command, full stop. For token bucket, it isn't quite true, and the gap is instructive.
The token bucket check reads the current token count and the last refill timestamp with two separate, unpipelined GET calls, computes the new token count in Python, and only then writes the result back with a pipelined SET. Atomicity in Redis describes what happens to a single command on the server. It says nothing about what happens between two separate round trips made from application code. If two requests from the same client arrive close enough together, both can read the same pre-update token count, both compute "yes, I have enough," and both get allowed, when only one of them should have been. The data store is atomic. The read-then-write sequence built on top of it is not, and that distinction is exactly the kind of thing that's easy to miss because the word "Redis" is doing a lot of reassuring work in your head that it hasn't actually earned here.
The fix is to push the whole read-compute-write sequence into a single Lua script executed with EVAL, so Redis runs it as one atomic unit server-side rather than as three separate network round trips from the client. That's the standard pattern for correct token bucket implementations in Redis, and its absence here is a fair thing to flag in an interview if someone hands you this code and asks what you'd change.
A subtler bug: punishing yourself for retrying
The sliding window implementation has a quieter issue. It adds the current request's timestamp into the sorted set in the same pipeline that counts existing entries, before it has decided whether to allow or block the request. That means a blocked request still gets written into the window. If a client gets rate limited and retries aggressively, every retry adds a fresh timestamp to its own sliding window, which keeps the window full of recent entries and can keep the client blocked well past when it would have naturally recovered if it had just waited. The algorithm doesn't distinguish between "this request consumed quota" and "this request was rejected." Whether a rejected request should count against the window at all is a real design decision, and here it was decided implicitly by where a line of code happened to sit in the function, not on purpose.
A distributed rate limiter with a circuit breaker that isn't distributed
The architecture has two FastAPI replicas behind an nginx load balancer, which is the whole point of calling this a distributed rate limiter. But the circuit breaker that protects against Redis failures keeps its failure count, success count, and state as plain Python instance attributes inside each process. Each replica has its own circuit breaker with no shared knowledge of the other one.
That means if Redis starts failing, one replica can independently accumulate five failures and open its circuit while the other replica, having happened to receive a slightly different mix of requests, is still on its third failure and sending full traffic at a Redis instance that's already struggling. Depending on which replica nginx routes you to, you could see completely different failure behavior for the exact same backend outage. For a project explicitly named "distributed," having a circuit breaker whose state doesn't travel past the boundary of one process is the kind of gap that's worth being able to name out loud, even if the honest fix, storing circuit state in Redis itself, adds a dependency on the very system the circuit breaker exists to protect against.
What's wired up, and what's just the schema
There's a full SQLAlchemy schema for persisting rate limit rules and historical metrics in PostgreSQL, with tables and columns ready to go. The actual rule read and write path in the API, though, operates on a plain in-memory Python dictionary that gets reset to hardcoded defaults every time the service restarts. The code comment next to those defaults says exactly that: hardcoded for MVP. Nothing in the live request path queries or writes that PostgreSQL schema. The README's own roadmap lists "database-backed hot-reload rules" as a future enhancement, so this isn't a hidden gap, it's an honestly labeled one. But it's a good reminder that a schema existing in the codebase and a feature being live are two different claims, and a README full of checkmarks can make the first one read like the second if you don't go look.
The demo that was generating its own results from a coin flip
This is the one that made me laugh a little when I found it. The React dashboard has an interactive "Demo" page meant to visually show the three algorithms behaving differently as you crank up request volume. Until a recent commit, the page wasn't running any of the three algorithms at all. It was deciding whether to "allow" each simulated request with Math.random() > (blockedSoFar / totalSoFar), a formula that produces a plausible-looking, self-correcting block rate with zero actual rate limiting logic behind it. Switch the dropdown to sliding window and the chart would shift slightly because the displayed latency number changed, not because a different algorithm was running.
The fix replaces that with real client-side implementations of all three algorithms, a refilling token count, a clock-aligned window counter, and an actual sliding array of timestamps, so the visual you watch is the algorithm you selected. The lesson isn't really about the bug itself. It's that a demo can look completely correct, animate smoothly, produce sensible-looking numbers, and still be approximating instead of implementing, and the only way I caught it was reading my own frontend code with the same suspicion I'd apply to someone else's pull request.
Why this is the part worth talking about in an interview
If someone hands you a "design a rate limiter" question, picking an algorithm is the easy fifteen percent. The interesting part, and the part these four issues all point at, is whether the thing you build is actually correct under concurrency, whether the parts you call distributed actually share their state, whether the features in your README are wired into the live path or just scaffolded, and whether the demo you'd show someone is honestly running the system you built or just performing a convincing impression of it. Those are the questions I'd rather be asked, and now, having gone looking, they're the ones I can actually answer about my own code.

Why GPUs Beat CPUs for AI Training (and Why You Can't Just Build a Bigger CPU)

ambarish pathak — Wed, 17 Jun 2026 23:19:40 +0000

A question came up in a recent interview at Intel that I keep turning over: why is a GPU actually better than a CPU for AI training, and why can't you just solve the problem by making the CPU bigger? It's a deceptively simple question. The honest answer isn't "GPUs are faster," it's that CPUs and GPUs are solving different problems by design, and AI training happens to be exactly the kind of problem a CPU's design actively works against.
The shape of the workload matters more than the chip
Training a neural network is, at its core, repeated matrix multiplication. Millions or billions of independent multiply-accumulate operations, applied to tensors, over and over, for every layer, every batch, every step. The defining property of that workload is that almost none of those operations depend on each other. Multiplying element 47 of a weight matrix doesn't need to know what happened to element 48. That's what people mean when they call it "embarrassingly parallel."
A CPU core is built for the opposite kind of problem. Most of a CPU core's silicon goes toward figuring out what to do next when the next instruction is unpredictable: branch prediction, out-of-order execution, speculative execution, deep multi-level caches, all of it exists to keep one core fed and busy when the code in front of it is full of conditionals and data-dependent branches. That's exactly what you want for parsing a web request or querying a database. It's almost entirely wasted on a workload where the instruction is "multiply these two numbers" repeated a billion times with no branching at all.
A GPU throws most of that machinery away. Instead of a handful of cores that are individually very smart, it packs thousands of simple cores that share control logic across groups and execute the same instruction in lockstep across many threads at once (NVIDIA calls this SIMT, single instruction, multiple threads). Each individual GPU core is far less capable than a CPU core. But the workload doesn't need capable, it needs many.
The numbers, as of right now
Intel's newest server chip, Xeon 6+ ("Clearwater Forest," launched mid-2026), tops out at 288 cores per socket, 576 in a dual-socket box. That's an enormous number for a CPU and represents real engineering work to pack that many general-purpose cores into one die.
NVIDIA's current flagship training and inference GPU, the Blackwell Ultra B300, has 160 streaming multiprocessors, each containing 128 CUDA cores, for 20,480 CUDA cores on a single chip. That's about seventy times the core count of the biggest CPU Intel currently ships, on a single accelerator.
Power tells the same story from a different angle. The top Xeon 6+ part draws up to 450W across 288 cores, around 1.5W per core. The B300 draws up to 1,400W across 20,480 cores, around 0.07W per core. Each individual CPU core costs roughly twenty times more power than each GPU core, because the CPU core is spending that power budget on branch prediction and speculative execution that a matrix multiply will never use.
Memory bandwidth is the other half of the answer
Compute alone doesn't tell the full story. Every one of those cores needs data to chew on, and that's where the gap widens even further. Xeon 6+ supports twelve channels of DDR5-8000 memory, which works out to roughly 770 GB/s of aggregate memory bandwidth per socket. The B300 uses HBM3e, high bandwidth memory stacked directly on the GPU package, and delivers 8 TB/s, more than ten times the CPU's bandwidth.
That gap matters because thousands of parallel cores are useless if they're sitting idle waiting on data. GPUs pair massive parallelism in compute with massive parallelism in memory access specifically so the two scale together. A CPU optimized for low-latency access to a small working set, which is the right design for general-purpose software, is the wrong design when you need to stream enormous tensors through thousands of execution units continuously.
Tensor cores: purpose-built silicon for the exact operation that matters
Beyond the general-purpose CUDA cores, the B300 has dedicated Tensor Cores, four per streaming multiprocessor, fifth generation, purpose-built to execute matrix multiply-accumulate operations directly in hardware rather than through general-purpose instructions. A CPU has to express the same matrix math through vector instruction extensions like AVX-512 or Intel's AMX, which help, but are still general-purpose vector units doing their best impression of a matrix engine. A Tensor Core is not doing an impression. It is silicon built for exactly one job, and it does that job at a fraction of the energy and time cost.
So why not just make the CPU bigger?
This is the part of the question that actually matters. If parallelism is what wins, why not put 20,000 cores on a CPU instead of a GPU?
The honest answer is that you'd be removing the parts that make it a CPU. Every CPU core's value comes from being good at handling unpredictable, branchy, latency-sensitive work, and that capability has a fixed silicon and power cost per core that doesn't shrink no matter how many cores you add. Stack 20,000 of those cores on one die and you hit a power wall and a heat wall long before you hit a useful core count. Clearwater Forest already needs 450W and serious cooling to reach 288 cores. Scaling that same design to GPU-level core counts isn't an engineering inconvenience, it's a different chip with different goals, which is precisely what a GPU already is.
There's also a parallel efficiency mismatch. Even if you somehow fit 20,000 full CPU cores on a die, AI training doesn't need 20,000 cores that are each capable of branch prediction and out-of-order execution. It needs 20,000 cores that can each multiply and add, fast, with shared control logic across groups of them so you're not paying the "smart core" tax 20,000 times over for a job that never uses that intelligence. That's the actual design insight behind a GPU: not "more cores," but "cheaper cores, because the workload doesn't need expensive ones."
A useful way to picture it: a CPU core is a Formula 1 car, exceptional at navigating unpredictable terrain, expensive, and there's a hard limit on how many you can build and fuel. A GPU core is a forklift. Less sophisticated individually, but if the job is moving an identical pallet ten thousand times, a fleet of forklifts beats one very fast car, and you can afford the fleet precisely because each forklift is cheap.
Even Intel agrees, and that's the real answer
Here's what makes this question interesting to ask at Intel specifically: Intel is the company that has spent fifty years perfecting the general-purpose CPU, and even Intel's own roadmap concedes this point. Alongside Xeon 6+, Intel is shipping its own data center GPU line, codenamed Crescent Island, built specifically for AI workloads, with its own memory subsystem optimized for AI inference and training rather than general compute. Intel isn't trying to make Xeon win the AI training race by adding more cores. Intel's own public position is that the CPU's role in AI infrastructure is to be the control plane and handle data preprocessing and orchestration, while purpose-built accelerator silicon, GPUs, in Intel's case increasingly its own, handles the actual tensor math.
That's the answer that actually lands in an interview, I think. It's not "GPUs are better, full stop." It's that modern AI infrastructure is heterogeneous by design: a CPU that's excellent at the unpredictable, control-heavy parts of the job, paired with an accelerator that's excellent at the predictable, massively parallel parts of the job. The interesting systems engineering question isn't which one wins, it's how you split the work between them, and that's a question that shows up constantly in the kind of infrastructure I want to be building.

Building a RAG Pipeline From Scratch: What SmartQueue Taught Me About Retrieval

ambarish pathak — Wed, 17 Jun 2026 22:21:45 +0000

When I set out to add an AI assistant to SmartQueue, a distributed task queue I'd already built in Go for handling IT support tickets, the obvious move was to bolt on an LLM and call it done. Type a question, get an answer. But a generic LLM doesn't know your company's password reset procedure, your P1 outage runbook, or that refunds need manager approval above $500. It needed grounding in actual internal knowledge. That's the job retrieval-augmented generation (RAG) is built for: pull the relevant facts out of your own documents first, then hand them to the model as context instead of trusting it to know your business.

This post walks through how that pipeline actually works, the architectural decision I reversed midway through (and why), the numbers I picked for things like retrieval depth and temperature, and an honest take on whether any of it counts as "real" RAG.

What the assistant actually does

SmartQueue Bot lives inside the Queue Health and AI Bot tabs of the dashboard. An agent picks a ticket, asks a question like "what are the immediate steps for this database outage," and the bot streams back an answer token by token, grounded in a small internal knowledge base of IT runbooks. The request flow looks like this:

agent question
|
v
prompt-injection check (regex guardrails)
|
v
BM25 search over 10 runbooks --> top 4 matches
|
v
system prompt assembled: ticket context + runbook excerpts
|
v
Groq (LLaMA 3.3 70B) streamed via SSE, with last 10 turns of session history
|
v
response streamed to client + written back to Redis session memory

Three things happen before any text reaches the model: the user's message is checked for prompt injection attempts, the message is used as a query against the knowledge base, and the top matches get woven into a system prompt alongside the ticket's category, priority, and description. The model never sees raw documents without that framing. It sees a structured brief.

The decision I reversed: ChromaDB, then BM25

The first version of the knowledge base used ChromaDB with its default ONNX embedding function: proper vector search, no torch dependency, queried through a thread pool so it wouldn't block the event loop. That's the textbook RAG setup, and it worked locally. It fell apart the moment I tried to deploy the whole stack as a single container on Hugging Face Spaces.

The deployment used supervisord to run Redis, the Go API, two Go worker replicas, and the FastAPI AI service all inside one container, and originally a separate ChromaDB process alongside them. That's five long-running processes competing for a small amount of memory and CPU in a free-tier container, with supervisord responsible for starting them in the right order and keeping them alive. ChromaDB was the one that kept causing startup races and silent failures. After enough commits with messages like "fix: remove ChromaDB from supervisord" and "fix: replace ChromaDB with in-memory BM25 search," I made the call to rip it out entirely.

The replacement is about 50 lines of pure Python, with no embedding model, no external process, and no network call:

def _bm25_score(query_tokens, doc_tokens, k1=1.5, b=0.75):
    avg_dl = sum(len(d) for d in _CORPUS) / len(_CORPUS)
    tf = Counter(doc_tokens)
    score = 0.0
    for term in query_tokens:
        if term not in tf:
            continue
        idf = _idf(term, _CORPUS)
        dl = len(doc_tokens)
        score += idf * (tf[term] * (k1 + 1)) / (tf[term] + k1 * (1 - b + b * dl / avg_dl))
    return score

This is the standard Okapi BM25 formula, computed fresh against the in-memory runbook corpus on every query. No index to build, no daemon to keep alive, no embedding latency on cold start. The trade-off is real: BM25 only matches on term overlap, so a query phrased very differently from the runbook's wording (synonyms, paraphrasing) won't score well. But for a fixed set of 10 short, keyword-dense IT runbooks where users are typically searching with the same vocabulary the runbooks use ("VPN," "password reset," "outage"), that weakness barely shows up in practice. The thing that mattered more than retrieval quality at this scale was that the service now starts reliably every single time.

The numbers, and why those numbers

A few of the constants in this pipeline were deliberate tuning decisions rather than defaults I left untouched. None of this is a RAGAS-style evaluation with precision/recall/faithfulness scores. There's no eval harness here, just systems-level tuning based on the constraints I was working under (a free-tier LLM provider, a single demo container, and a knowledge base that doesn't change).

Constant	Value	Why
Retrieved docs (`k`)	4	Enough runbook context to usually cover the right answer without bloating the prompt against the 800-token response budget
BM25 `k1` / `b`	1.5 / 0.75	Standard Robertson defaults, since with only 10 documents there isn't enough signal to meaningfully tune these per-corpus
Bot temperature	0.2	Troubleshooting answers should be literal and repeatable, not creative
Classifier temperature	0.1	Output is parsed as JSON; near-deterministic reduces malformed responses
Recommender temperature	0.3	Slightly more room since it's reasoning over queue state, not just extracting fields
Bot `max_tokens`	800	Long enough for multi-step troubleshooting guidance, short enough to keep streaming snappy
Classifier `max_tokens`	250	The schema is small, just eight short fields and no prose
Session history window	last 10 turns, capped at 20 stored, 1-hour TTL in Redis	Enough continuity for a real troubleshooting conversation without memory growing unbounded
Rate limit	30 requests/minute per session	Protects the free Groq quota from being burned by a single runaway client
LLM client retries	0, with a 10s timeout	Every caller already has its own fallback (keyword classifier, rule-based recommender, canned bot response), so retrying into the same failure just adds latency before falling back anyway

That last one is worth dwelling on. Every AI-backed endpoint in this system has a non-LLM fallback path. If Groq is rate-limited or down, the classifier falls back to keyword matching, the recommender falls back to threshold-based rules on queue depth, and the bot falls back to a templated response built from the same retrieved runbook excerpts. The system was designed to degrade, not fail, which matters a lot more when you're running on a free API tier than it would on a paid, SLA-backed one.

Is this actually "RAG," and is it better?

Strictly, yes: it retrieves before it generates, and the generation is conditioned on what's retrieved. But it's a narrow slice of what RAG can mean. There's no chunking (each runbook is embedded as one flat document), no re-ranking step, no hybrid retrieval, and no evaluation loop telling me whether the right runbook actually got surfaced for a given question. It's RAG sized correctly for the problem: a small, static, keyword-friendly knowledge base where the cost of building anything more elaborate would have outweighed the benefit.

Whether BM25-over-ChromaDB was "better" depends on what you're optimizing for. For retrieval quality on a larger, more varied corpus, an embedding-based approach would win, since BM25 degrades once questions stop reusing the document's own vocabulary. But for this deployment, with this knowledge base size and this hosting constraint, dropping the vector store was unambiguously the right call: it eliminated an entire class of deployment failures and removed a dependency for a problem that ten short documents don't actually need solved with embeddings.

If I were extending this rather than rebuilding it, the next real upgrades would be a basic retrieval eval (even just "did the correct runbook end up in the top 4 for a labeled set of test questions"), splitting the longer runbooks into smaller chunks so the model gets more relevant text per retrieved slot, and a hybrid approach once the knowledge base grows past roughly fifty documents. Somewhere around that scale, pure keyword overlap stops being enough to catch the paraphrased queries vector search handles for free.

That's also roughly the direction I went on a separate project, AskMyDoc, where I paired BM25 with ChromaDB in a hybrid retriever, added HyDE-style query rewriting to bridge the vocabulary gap, and built a RAGAS-based evaluation harness to actually measure retrieval quality instead of eyeballing it. SmartQueue's BM25-only pipeline was the right tool for a ten-document, single-container helpdesk demo. It's not the pipeline I'd reach for if the knowledge base were a thousand documents instead of ten, but knowing the difference, and being able to justify it with a real deployment failure rather than a hunch, is the actual lesson this project taught me.