BeeLlama v0.2.0: 164 tok/s on a 27B model, one RTX 3090

#ai #llm #inference #opensource

Speculative decoding has been the rumored 3-5x throughput multiplier for about 18 months. The numbers have stayed muddled because most of the public benchmarks ride on H100s with batch sizes greater than one, where the speedup gets folded into pricing tables nobody outside a serving team reads. What teams running a single workstation actually measure has been harder to find.

The BeeLlama v0.2.0 release pins down a specific point on that map. The setup is small enough to reproduce in a weekend: one RTX 3090, 32 GB of DDR4, a Ryzen 7 5700X3D, and llama.cpp build b9275 as the baseline. The two target models are Qwen 3.6 27B at Q5_K_S and Gemma 4 31B at the same quantization. The drafter for each is a Q4_K_M DFlash variant. The benchmark prompts and configs are pinned in the README and the GGUFs are on Hugging Face under Apache 2.0.

The Qwen row is the easier of the two to read. Baseline llama.cpp turns out 37.2 tokens per second on a ~1K-token completion task. BeeLlama's DFlash path runs the same prompt at a 163.9 tok/s median, with a best run of 181.9. That is a 4.40x median multiplier on a card that costs around $700 used. The Gemma 4 31B row reports an even larger ratio: 36.1 tok/s baseline against 177.8 tok/s median, a 4.93x multiplier on a model that is 15% larger than the Qwen. The pattern — bigger model, slightly more speedup — is consistent with what speculative decoding theory predicts, because the per-token cost is dominated by the target model's verification step and the drafter is much cheaper to run in either case.

What the speedup actually costs is hidden in the acceptance numbers, and this is where the BeeLlama table earns its space. The Qwen row reports "67.7% / 89.2%" for the DFlash run. Read those as the two diagnostic rates that matter for speculative decoding economics: the fraction of drafter-proposed tokens that the target validates, and the fraction of drafted sequences that the target accepts without falling back. When the first number drops below about 50%, the drafter's compute starts costing more than it saves. When the second number drops below about 60%, the per-sequence overhead of the verifier path begins to dominate. Both Qwen and Gemma sit comfortably above those thresholds in BeeLlama's report, which is why the median speedups are close to the best-case numbers rather than spread across an order of magnitude.

Prefill stays near the llama.cpp baseline in every row of the table. That is the expected shape: prefill is already parallelizable across the prompt's tokens, so the speculative path has nothing to add. The 4-5x speedup is a decode-phase number. Practitioners who serve a workload of short prompts and long generations — agentic loops, chat completions, code suggestion streams — will see something near the headline. Workloads dominated by long prompts and short answers, like RAG with a 32K-token context and a one-sentence reply, will see almost no benefit because most of the wall-clock time is prefill.

A few caveats sit underneath the table and should travel with it. The reasoning-on configuration is excluded from the chat benchmark in the README, and the changelog notes a stricter fallback to full logits when "grammar, sampler state, or reasoning requires it." Reasoning models stream tokens with more entropy at each step, which reduces drafter acceptance rates and pushes the speedup back toward 2-3x. The 3090's 24 GB of VRAM is also doing real work in these numbers: holding the Q5_K_S target, the Q4_K_M drafter, and the K/V cache for both at the same time. A 12 GB card running the same models with the same quantizations would either spill to system memory or refuse to load, and the latency in either case would erase the win.

The teach is small and useful. Speculative decoding is not a free 5x — it is a 5x conditional on the drafter being trained well enough that its top-1 predictions match the target's most of the time, and conditional on the workload being decode-heavy. BeeLlama v0.2.0 ships both halves: the DFlash drafters trained against current open weights, and the verifier path tightened enough that the published acceptance rates hold. For a learner who has read the original speculative decoding paper but never seen the technique applied to a model they could run themselves, the README plus the GGUFs are a complete worked example. Clone the repo, pull either GGUF pair, and the throughput numbers reproduce.

Repo and quickstart guides: https://github.com/Anbeeld/beellama.cpp

DEV Community

BeeLlama v0.2.0: 164 tok/s on a 27B model, one RTX 3090

Top comments (0)