Why Groq Feels Like Cheating

#agents #ai #llm #rag

I've been building a multi-agent LangGraph pipeline recently, and like most people stitching together free-tier LLM providers, I ended up comparing Groq against the usual suspects. The difference wasn't subtle. Other providers felt like a normal API call — you send a request, you wait, text streams back at a pace you've gotten used to. Groq felt like cheating. A 70B-parameter model returning a full response faster than I could finish reading the prompt I'd just typed.

My first assumption was the boring one: Groq must just have better GPUs, or more of them, or some clever load balancing trick. That assumption is wrong, and the actual answer is a lot more interesting than "more compute."

Groq doesn't run GPUs at all. They built their own chip from scratch, called the LPU — Language Processing Unit — and the entire thing exists because of a bet that the way GPUs are designed is fundamentally mismatched to what LLM inference actually needs.

The wrong intuition: "just better hardware"

It's worth sitting with why "better GPUs" feels like the obvious answer. Every AI performance story for the last five years has been a GPU story — bigger clusters, more H100s, better interconnects. Nvidia's dominance is so total that "AI chip" and "GPU" have basically become synonymous in casual conversation.

But GPUs were never actually designed for this job. They were designed for graphics rendering, then repurposed for deep learning because their parallel architecture happened to suit matrix multiplication well — which is most of what neural networks are made of. That repurposing worked brilliantly for training, where you're crunching enormous batches of data and raw parallel throughput is what matters.

Inference, especially single-request, real-time token generation, is a different problem. And it turns out GPUs carry a lot of training-oriented baggage that actively works against them here.

The actual bottleneck: the memory wall

The real constraint in LLM inference isn't how many calculations a chip can do — it's how fast it can feed those calculations with data. This is sometimes called the "memory wall," and it's the whole reason Groq's architecture exists.

On a standard GPU, model weights live in HBM — High Bandwidth Memory — which sits physically separate from the compute cores, on its own memory stack. HBM is fast compared to a hard drive, but it's still a clear hop away from where the actual math happens. Every time the chip needs a weight, it has to reach across that gap. During training, with massive batch sizes, that overhead gets amortized away. During single-request inference, where you're generating one token at a time, the compute units spend a huge fraction of their time just waiting on data to arrive.

Groq's answer was almost stubbornly simple: stop putting memory off-chip. The LPU uses SRAM — the same type of memory normally reserved for tiny CPU caches — as its primary weight storage, not a cache layer sitting in front of slower memory. Hundreds of megabytes of it, sitting directly on the compute die.

The bandwidth difference this produces is enormous. Groq's on-chip SRAM delivers memory bandwidth upwards of 80 terabytes per second, compared to roughly 8 terabytes per second for GPU off-chip HBM — about a 10x gap before you even account for anything else. Independent teardowns put the access-speed advantage of SRAM over HBM at roughly 20x once you factor in latency, not just raw bandwidth.

The second piece: throwing out unpredictability

Memory bandwidth alone doesn't fully explain Groq's numbers. The other half of the story is determinism, and it's the part that took me longer to actually appreciate.

GPUs are general-purpose chips, which means they rely on dynamic, runtime scheduling — hardware queues, cache hit/miss decisions, arbitration between competing tasks — all figured out on the fly, while the chip is running. That flexibility is a feature for the huge range of workloads a GPU has to support. But it has a cost: the timing of any given operation isn't fully knowable in advance, and when hundreds of cores need to synchronize, any delay in one place propagates through the whole system.

Groq's compiler takes the opposite approach. Because the LPU's architecture has no caches, no dynamic memory allocation, and no runtime scheduling decisions to make, the entire execution graph — every operation, every instruction, every chip-to-chip handoff — can be calculated ahead of time, down to individual clock cycles. The chip isn't deciding what to do next while it runs; it's executing a schedule that was already fully solved before a single token was generated.

This sounds like it should just be "less flexible," and it is — but for a workload as repetitive and predictable as transformer inference, that traded-away flexibility buys you the removal of almost all the overhead that makes GPU inference timing fuzzy. Groq's own engineering material describes this as a software-first design philosophy: rather than adapting a chip built for something else, they built hardware specifically to make the compiler's job of scheduling deterministic and complete.

What this actually buys you, in real numbers

The architecture story is nice, but the numbers are what make it land. Independent benchmarking has shown Llama 2 70B running at around 300 tokens per second on Groq, versus 30–40 tokens per second on an Nvidia H100 — roughly a 10x difference. For a smaller model like Llama 3 8B, Groq has demonstrated 1,300+ tokens per second against roughly 100 tokens per second on an H100.

There's also a counterintuitive energy story here. You'd expect a chip doing the same work faster to use more power in the moment, and it does — but it spends so much less time doing the work that the total energy cost per token actually comes out lower. Groq reports something in the range of 1–3 joules per token, compared to 10–30 joules per token on H100-based systems. Less data movement (the expensive part, energetically) plus less total time in a high-power active state nets out to a real efficiency win, not just a speed one.

The catch — because there's always a catch

None of this is free, and the tradeoffs are exactly where the SRAM decision bites back.

SRAM is fast, but it's also physically large and expensive per bit compared to HBM or DRAM — that's precisely why every other chip in the industry only uses it for tiny cache layers instead of primary storage. Groq's bet means each individual LPU can only hold a relatively small amount of memory on-chip. A 70B-parameter model doesn't fit on one chip; it doesn't really fit on a handful of chips either. Serving something that size at scale reportedly requires hundreds of LPUs working in close coordination, wired together with a custom high-speed interconnect Groq calls a plesiosynchronous protocol, designed specifically so chip-to-chip communication stays just as deterministic as everything happening inside a single chip.

That's a lot of physical silicon for one model, and it shows up directly in the economics: estimates put Groq's hardware cost meaningfully higher than an equivalent-throughput GPU setup, chip-for-chip. Groq's bet is that for latency-critical workloads, this is the right trade to make — when an application genuinely needs sub-300ms response times, the comparison isn't "Groq versus a cheaper GPU option," it's "Groq versus the feature not working at all."

It's also why Groq isn't trying to train the next frontier model. The LPU's entire design is optimized around the fixed, predictable computation pattern of running an already-trained model. Training is a much messier, less predictable workload — exactly the kind of job determinism doesn't help with. Groq has made peace with this division of labor: run other labs' open-weight models (Llama, Mixtral, DeepSeek, Qwen) as fast and cheaply as possible, rather than compete on building the model itself.

Where this is heading

The most interesting recent development here is that this is no longer framed as "Groq vs. Nvidia." Nvidia's upcoming Vera Rubin platform actually pairs Groq's LPUs alongside Rubin GPUs in a single heterogeneous system — GPUs handle the prefill stage and attention computation during decode, while LPUs take over the latency-sensitive feed-forward and mixture-of-experts execution. Rather than one architecture replacing the other, the industry seems to be converging on using each chip for what it's actually good at: GPUs for the parallel, throughput-heavy parts of inference, LPUs for the sequential, latency-sensitive parts.

That's a pretty satisfying resolution to the "wait, how is this so fast" question I started with. It's not that Groq found a trick GPU makers missed — it's that they asked a genuinely different question. Nvidia optimized for "how do we do as much computation as possible." Groq optimized for "how do we make sure compute is never waiting on anything." For training, the first question matters more. For the kind of real-time, conversational, agentic systems a lot of us are building right now, the second one might matter more than we've been giving it credit for.

I write about what I'm learning while building AI systems — currently a multi-agent financial due-diligence pipeline on LangGraph, and a RAG-based healthcare triage tool. More on both soon.