======================================================================
EVAL -- The AI Tooling Intelligence Report
Issue #001 | March 2026
The Great LLM Inference Engine Showdown:
vLLM vs TGI vs TensorRT-LLM vs SGLang vs llama.cpp vs Ollama
Hey there.
Welcome to the first issue of EVAL. No fluff, no hype cycles, no
"10 AI tools that will CHANGE YOUR LIFE" listicles. Just one senior
engineer talking to another about the tools that actually matter.
And for issue one, we're going straight for the jugular: inference
engines.
Here's the uncomfortable truth nobody on Twitter will tell you:
picking your LLM inference engine is one of the highest-leverage
decisions you'll make in your AI stack, and most teams get it wrong.
They either over-engineer it (congrats on your TensorRT-LLM setup
that took three sprints to deploy and now needs a dedicated DevOps
engineer to babysit) or under-engineer it (no, Ollama is not your
production serving layer, please stop). I've watched teams burn
entire quarters migrating between engines because they didn't do
the homework upfront. Don't be that team.
So let's break down the six engines that matter in March 2026, with
actual opinions instead of marketing copy.
======================================================================
THE QUICK RUNDOWN
Here's your cheat sheet. Pin this somewhere.
| Engine | Stars | Throughput* | Ease | Hardware | Vibe |
|---|---|---|---|---|---|
| vLLM v0.7.3 | ~50k | 1000-2000 | Med | GPU-first | Reliable workhorse |
| TGI v3.0 | ~10k | 800-1500 | Med | GPU-first | Corporate solid |
| TensorRT-LLM | ~10k | 2500-4000+ | Hard | NVIDIA only | Speed demon |
| SGLang v0.4 | ~10k | Very High | Med | GPU-first | Dark horse |
| llama.cpp | ~75k | 80-100** | Easy | Everywhere | Swiss army knife |
| Ollama | ~120k | Low | Trivial | Via llama.cpp | Gateway drug |
- tok/s on A100/H100 for Llama-70B class models, except where noted ** 7B model on M2 Ultra (CPU/Metal), not comparable to GPU numbers
All Apache 2.0 licensed except llama.cpp and Ollama (MIT). Yes, this
matters when legal comes knocking.
======================================================================
THE DEEP DIVE: ENGINE BY ENGINE
--- vLLM v0.7.3 -------------------------------------------------
"The one you'll probably end up using"
Stars: ~50k | License: Apache 2.0
vLLM is the Honda Civic of inference engines. Is it the fastest?
No. Is it the most exciting? No. Will it reliably get you from A
to B without drama? Absolutely.
PagedAttention was genuinely revolutionary when it dropped -- treating
KV cache like virtual memory pages was one of those "why didn't we
think of this earlier" ideas. Continuous batching means you're not
leaving GPU cycles on the table. The OpenAI-compatible API means
your application code is basically engine-agnostic. That's huge.
The V1 engine became default in v0.7.0, and it shows. Things just
work. Anyscale, IBM, Databricks, Cloudflare -- these aren't exactly
hobby projects. When companies with serious SLAs pick your engine,
that says something.
The honest downsides: GPU memory overhead is real. vLLM is hungry,
and if you're trying to squeeze a 70B model onto the minimum viable
GPU count, you'll feel it. AMD ROCm support exists but it's... let's
call it "maturing." If you're on MI300X, budget extra time for
debugging.
Best for: General-purpose production serving, teams that want a
large community and proven reliability, OpenAI API drop-in
replacement scenarios.
Verdict: Your default choice unless you have a specific reason to
pick something else. The boring-but-correct answer.
--- TGI v3.0 -----------------------------------------------------
"The enterprise's chosen one"
Stars: ~10k | License: Apache 2.0
HuggingFace's Text Generation Inference is what happens when you
have the world's largest model hub and decide you should also serve
those models. The Rust+Python hybrid is genuinely clever -- Rust for
the hot path, Python for the model loading and config. Flash
Attention 2 integration is solid.
800-1500 tok/s on A100 for 70B models. Not chart-topping, but
respectable. The real story here is ecosystem integration. If you're
already on HuggingFace Inference Endpoints or Amazon SageMaker, TGI
is the path of least resistance. Sometimes the best tool is the one
that's already integrated.
The downsides are real though. That Rust codebase? Good luck if
you're an ML engineer who needs to debug a serving issue at 3 AM.
Cargo and PyTorch don't exactly play nice at the boundary. Model
support consistently lags vLLM by a few weeks to months -- if you
need day-one support for the latest architecture, look elsewhere.
Best for: Teams already invested in the HuggingFace ecosystem,
SageMaker deployments, organizations that value corporate backing
and support contracts over raw community size.
Verdict: Great if you're in the HuggingFace/AWS ecosystem. Otherwise,
hard to justify over vLLM unless you really love Rust.
--- TensorRT-LLM v0.17 -------------------------------------------
"The speed freak's playground"
Stars: ~10k | License: Apache 2.0 (with caveats)
Let me be blunt: if you're serving on NVIDIA hardware and every
millisecond matters, TensorRT-LLM is the answer. 2500-4000+ tok/s
on H100 with FP8 quantization. That's not a typo. We're talking
10-30% faster than vLLM on equivalent NVIDIA hardware, sometimes
more.
Perplexity uses it. Major cloud providers use it behind the scenes.
When you need to serve millions of requests and your GPU bill looks
like a mortgage, that 30% matters. It's real money.
But -- and this is a big but -- the developer experience is, to put
it diplomatically, not great. The compilation step alone will make
you question your career choices. You're building engine-specific
plans for specific model configurations on specific hardware. Change
your GPU? Recompile. Change your batch size? Recompile. Sneeze?
Believe it or not, recompile.
It's NVIDIA-only. Obviously. This is a feature and a limitation
depending on your worldview. The learning curve is steep enough that
you should budget engineering time measured in weeks, not days.
Best for: High-traffic production serving where latency is a
competitive differentiator, teams with strong CUDA/systems
engineering talent, anyone whose GPU bill exceeds their rent.
Verdict: The right choice when you're at scale on NVIDIA and have
the engineering team to support it. The wrong choice for nearly
everyone else. If your team doesn't have at least one person who's
comfortable reading CUDA kernels, think twice.
--- SGLang v0.4 --------------------------------------------------
"The one that might eat everyone's lunch"
Stars: ~10k | License: Apache 2.0
Okay, this is where it gets interesting. SGLang came out of UC
Berkeley and LMSYS (the Chatbot Arena folks), and it's been
quietly demolishing benchmarks while nobody was paying attention.
RadixAttention for prefix caching is elegant. The constrained
decoding support is best-in-class. And the numbers are wild --
3.1x faster than vLLM on DeepSeek V3 in their benchmarks. Now,
take any "we're Nx faster" claim with appropriate skepticism
(benchmark configurations matter), but even if you halve that
number, it's impressive.
xAI chose it for Grok. LMSYS runs their arena on it. These are
demanding workloads with smart people making the decisions.
The catch: smaller community means fewer Stack Overflow answers
when things break. It's less battle-tested in diverse production
environments. The documentation is improving but still has that
"academic project" feel in places. You're betting on a trajectory
here, not a track record.
Best for: Research teams, structured output heavy workloads,
anyone serving models with shared system prompts across requests,
teams willing to be early adopters for a potentially big payoff.
Verdict: The most exciting engine in this list. If I were starting
a new project today with GPU serving needs, I'd seriously evaluate
SGLang before defaulting to vLLM. Watch this space closely.
--- llama.cpp -----------------------------------------------------
"The cockroach (complimentary)"
Stars: ~75k | License: MIT
llama.cpp will survive the apocalypse. Ggerganov's C/C++ masterwork
runs on literally everything: CUDA, ROCm, Metal, Vulkan, SYCL, CPU,
and yes, even WebAssembly. The GGUF format has become a de facto
standard for local model distribution. If you've ever downloaded a
model from a random person on HuggingFace, it was probably GGUF.
80-100 tok/s for a 7B model on M2 Ultra via Metal. Not going to
win any datacenter benchmarks, but that's not the point. The point
is that it runs, everywhere, on everything, with minimal fuss.
The quantization support is extraordinary -- from Q2_K to Q8_0,
you can trade quality for speed with granularity that the GPU
engines don't touch.
It's also the foundation that Ollama is built on, which means it
indirectly powers the local AI experience for millions of
developers.
The limitation is obvious: this is not a high-throughput serving
solution. If you're trying to serve concurrent users at scale,
you need one of the GPU-first engines above. llama.cpp is for
running models locally, for edge deployment, for weird hardware,
for places where a Python runtime is a luxury you can't afford.
Verdict: Indispensable for local/edge use cases. The widest hardware
support in the ecosystem by a country mile. Not your production
serving engine (and it doesn't pretend to be).
--- Ollama --------------------------------------------------------
"The people's champion"
Stars: ~120k | License: MIT
120,000 GitHub stars. Let that sink in. Ollama has more stars than
any other project on this list, and it's fundamentally a Go wrapper
around llama.cpp with a nice CLI and a model registry.
And you know what? That's exactly what it should be.
"ollama run llama3" -- that's it. Model downloaded, quantized,
running, chat interface ready. Your product manager can do this.
Your CEO can do this. My mom could probably do this (hi mom).
The Modelfile concept borrowed from Dockerfile is genuinely clever.
The local API is clean. The model library is curated. It's the
single best onboarding experience in all of AI tooling.
But let's be clear about what it is and isn't. It is not a
production serving solution. It does not do continuous batching. It
does not do multi-GPU tensor parallelism. It is not optimized for
throughput. If you see Ollama in a production architecture diagram,
someone made a mistake (or it's a very unusual use case).
Verdict: Perfect for what it is -- the fastest path from "I want
to try a local LLM" to actually running one. Put it on every
developer's laptop. Do not put it behind a load balancer.
======================================================================
HEAD TO HEAD: THE BENCHMARK DISCUSSION
Let's talk numbers honestly, because benchmarks in this space are
a minefield of misleading comparisons.
The throughput hierarchy on NVIDIA hardware is clear:
TensorRT-LLM > SGLang >= vLLM > TGI >> llama.cpp > Ollama
TensorRT-LLM's 10-30% advantage over vLLM is real but comes with
massive operational complexity. The interesting story is SGLang
closing the gap with vLLM and sometimes surpassing it, especially
on newer architectures like DeepSeek V3 where RadixAttention and
their optimized scheduling really shine.
But raw throughput isn't everything. Here's what the benchmarks
usually DON'T measure:
Time to first deployment: Ollama wins by hours. vLLM and TGI
are minutes to hours. TensorRT-LLM is days to weeks.
Recovery from failures: vLLM and TGI have mature health checks
and restart logic. TensorRT-LLM's compiled plans mean a failed
node isn't just a restart -- it might be a recompilation.
Long-tail latency: SGLang's RadixAttention is incredible for
workloads with shared prefixes (think: same system prompt across
requests). For random diverse queries, the advantage shrinks.
Cost efficiency: The fastest engine isn't always the cheapest.
vLLM's broader hardware support means you can shop AWS, GCP, and
Azure spot instances. TensorRT-LLM locks you into NVIDIA's
pricing power.
The hardware flexibility ranking tells its own story:
llama.cpp > Ollama > vLLM > TGI > SGLang > TensorRT-LLM
And model support breadth:
llama.cpp > vLLM > SGLang > TGI > Ollama > TensorRT-LLM
Notice a pattern? The engines optimized for raw speed tend to
sacrifice flexibility. The engines that run everywhere sacrifice
throughput. There is no free lunch in inference. Anyone telling
you otherwise is selling something (probably GPUs).
My honest take: for most teams, the difference between vLLM and
SGLang is smaller than the difference between either of them and
having a well-tuned deployment configuration. Spend your engineering
hours on batching strategies, quantization choices, and prompt
optimization before you spend them switching inference engines.
The engine matters, but it matters less than people think relative
to everything else in your serving stack.
======================================================================
THE RECOMMENDATION MATRIX: JUST TELL ME WHAT TO USE
Fine. Here's the opinionated guide.
YOU'RE A... USE THIS
--------------------------------------------|------------------------
Solo dev exploring LLMs | Ollama
Startup building an AI product (< Series B) | vLLM
Enterprise with existing HF/AWS stack | TGI
High-scale serving, performance-critical | TensorRT-LLM or SGLang
Deploying to edge / weird hardware | llama.cpp
Research team / academia | SGLang
Building a desktop AI app | llama.cpp (via binding)
Running inference on AMD GPUs | vLLM (with patience)
Need structured / constrained output | SGLang
Budget-constrained, CPU-only servers | llama.cpp
Want to future-proof your bet | vLLM (safe) or SGLang (bold)
The meta-advice: if you're asking "which inference engine should I
use?" and you don't already have strong opinions, the answer is
vLLM. It's the default for a reason. Graduate to something more
specialized when you've hit a specific wall -- and you'll know when
you have, because you'll be staring at latency dashboards at 2 AM
wondering why your P99 looks like a hockey stick.
======================================================================
THE CHANGELOG: WHAT SHIPPED THIS MONTH
Notable releases and updates from the inference engine world,
March 2026:
[vLLM v0.7.3] Landed automatic FP8 weight calibration for Hopper
GPUs. No more manual scale-factor hunting. Also: speculative
decoding now supports Medusa heads with Eagle-2 fallback. Memory
efficiency improved ~12% for long-context workloads (>64k tokens).
[TGI v3.0.2] Hotfix for the CUDA graph capture regression that was
causing OOMs on A10G instances. Added native Gemma 3 support.
Prometheus metrics endpoint now includes per-request KV cache
utilization. About time.
[TensorRT-LLM v0.17.1] Added Blackwell (B200) support with FP4
quantization. Yes, FP4 -- we've officially entered the "how low
can you go" era of number formats. Build times reduced 40% with
the new incremental compilation pipeline. Still not fast, but
less painful.
[SGLang v0.4.3] Merged the async constrained decoding PR that
eliminates the grammar-guided generation overhead on long outputs.
DeepSeek V3/R1 serving now uses 35% less KV cache via their
dynamic MLA compression. The most interesting changelog item
nobody's talking about.
[llama.cpp] Ggerganov merged the 1-bit (ternary) weight format
experimental branch. BitNet-style models now run natively. Also:
SYCL backend got a major overhaul, Intel Arc GPUs seeing 2x
performance improvement. Vulkan compute shaders rewritten for
better mobile GPU compatibility.
[Ollama v0.6.0] Added "ollama compose" for multi-model pipelines.
Think docker-compose but for chaining a router model with
specialist models. Clever concept, early days on execution.
Also shipped a built-in benchmarking tool: "ollama bench" gives
you tok/s, memory usage, and time-to-first-token in one command.
======================================================================
PARTING THOUGHTS
The inference engine landscape is consolidating and fragmenting at
the same time. Consolidating around a few winners (vLLM for general
purpose, TensorRT-LLM for max performance, llama.cpp for local).
Fragmenting because SGLang is proving that the "settled" approaches
have significant room for improvement.
My prediction: by end of 2026, the vLLM vs SGLang rivalry will be
the story, with TensorRT-LLM maintaining its performance crown but
becoming increasingly niche as open engines close the gap. llama.cpp
will quietly become the most important piece of software in AI that
nobody in the enterprise talks about. And Ollama will hit 200k
stars while remaining blissfully inappropriate for production.
That's it for issue one. If this was useful, tell a friend. If it
wasn't, tell me -- I can take it.
Until next time, keep your batch sizes high and your latencies low.
-- The EVAL Team
EVAL -- The AI Tooling Intelligence Report
No hype. No fluff. Just tools.
To subscribe: [eval-newsletter.ai]
To unsubscribe: close this email and pretend it never happened
EVAL is the weekly AI tooling intelligence report. We eval the tools so you can ship the models.
Subscribe for free: buttondown.com/ultradune
Skill Packs for agents: github.com/softwealth/eval-report-skills
Follow: @eval_report on X
Top comments (0)