NaveenKumar Namachivayam ⚡

Posted on May 13 • Originally published at qainsights.com

99% of Requests Failed and My Dashboard Showed Green

#ai #performance #llm #nvidia

In this blog post, we will see how to use NVIDIA AIPerf to expose a hidden performance problem that most LLM deployments never catch until real users start complaining.

I ran three simple tests against a local model. The results tell a story that every performance engineer should see.

The Setup

For this experiment, I used:

Model: granite4:350m running locally via Ollama
Endpoint: http://localhost:11434
Tool: NVIDIA AIPerf (the official successor to GenAI-Perf)

Head to https://github.com/ai-dynamo/aiperf to install AIPerf. It is a single pip install:

pip install aiperf

Granite 4 350M is a small, fast model perfect for local testing on a MacBook or a dev machine without a beefy GPU. The principles you will see here apply equally to larger models in cloud deployments.

Run 1: The Baseline That Lies

I started with the most common mistake in LLM performance testing a single-user baseline.

aiperf profile \
  --model "granite4:350m" \
  --streaming \
  --endpoint-type chat \
  --url http://localhost:11434 \
  --tokenizer builtin \
  --request-count 50 \
  --concurrency 1

The results looked great, as shown below.

Key numbers from this run:

Metric	avg	p50	p99
TTFT (ms)	223.11	217.60	317.61
TTST (ms)	10.94	9.99	18.00
ITL (ms)	10.67	10.51	12.35
Request Latency (ms)	1,309.30	1,043.95	3,251.73
Request Throughput (req/sec)	0.76	N/A	N/A

223ms average TTFT. Smooth inter-token latency at 10.67ms. If you stopped here, you would call this production-ready.

Most people stop here. That is the problem.

Run 2: The Wake-Up Call

Next, I pushed concurrency to 50, a more realistic number for a shared endpoint. I also added a warmup of 10 requests to eliminate cold-start noise, and ran for 60 seconds.

aiperf profile \
  --model "granite4:350m" \
  --url http://localhost:11434 \
  --endpoint-type chat \
  --concurrency 50 \
  --tokenizer builtin \
  --warmup-request-count 10 \
  --benchmark-duration 60 \
  --streaming

The results were a shock, as shown below.

Metric	avg	p50	p99
TTFT (ms)	41,660.92	50,870.37	64,201.68
TTST (ms)	10.21	10.11	13.10
ITL (ms)	10.38	10.18	13.29
E2E Output Token Throughput (tokens/sec/user)	4.86	1.85	60.87
Request Throughput (req/sec)	0.88	N/A	N/A

TTFT went from 223ms to 41,660ms. That is a 186x increase.

At p99, users were waiting over 64 seconds just to see the first token.

Your monitoring dashboard probably still shows green. Your users are staring at a blank screen.

Run 3: Goodput Exposes the Real Truth

This is where AIPerf separates itself from basic benchmarking tools. I added a --goodput flag with a TTFT SLO of 500ms. Goodput measures the throughput of requests that actually met the SLO, not just all requests indiscriminately.

aiperf profile \
  --model "granite4:350m" \
  --url http://localhost:11434 \
  --endpoint-type chat \
  --concurrency 50 \
  --tokenizer builtin \
  --benchmark-duration 60 \
  --goodput 'time_to_first_token:500' \
  --streaming

As shown below, the result is the most important number in this entire experiment.

Metric	Value
Request Throughput (req/sec)	0.91
Goodput (req/sec)	0.01
TTFT avg (ms)	37,380.20
TTFT p99 (ms)	55,777.69

Request throughput says 0.91 req/sec. Looks reasonable.

Goodput says 0.01 req/sec.

That means roughly 99% of requests failed the 500ms TTFT SLO. Your system is processing requests. It is not serving users.

The Hidden Insight: ITL Stays Rock Solid

Here is what most people miss when they first see these numbers. Look at ITL across all three runs:

Run	TTFT avg (ms)	ITL avg (ms)
Concurrency 1	223.11	10.67
Concurrency 50	41,660.92	10.38
Concurrency 50 + Goodput	37,380.20	9.71

ITL barely moves. TTST (Time to Second Token) also stayed consistent around 10ms across all runs.

The model is not the problem. The queue is.

Once the model starts generating for a request, it flies. Tokens come out at a consistent 10ms pace regardless of how many other requests are in flight. The bottleneck is entirely in the prefill phase, requests piling up waiting for the model to even begin processing them.

This is a critical distinction for capacity planning. If ITL were also degrading, you would need a faster model or better hardware. Since only TTFT is exploding, the fix is architectural, better queue management, request routing, or horizontal scaling of the inference server.

You cannot arrive at this insight without separating TTFT from ITL. A single "response time" metric would have buried it entirely.

The Lesson

Three commands. Three minutes. A completely different picture of your system.

What you measured	What you learned
Single-user baseline	False confidence
Concurrency 50	The real TTFT behavior under load
Goodput with SLO	How many users are actually being served

The takeaway is simple: always test with realistic concurrency. Always set an SLO and measure goodput against it. And always look at TTFT and ITL separately they tell completely different stories.

A system with great ITL and terrible TTFT under load has a queue problem, not a model problem. Knowing that changes everything about how you fix it.

Happy Testing!

Over to you: Have you ever shipped an LLM feature that looked great in testing but struggled under real user load? What metric finally exposed it? Drop a comment below I would love to hear your story.

Top comments (1)

Max Quimby • May 13

The TTFT / ITL split is the single most useful framing for LLM perf and it's wild how few dashboards expose it separately — most teams ship a single "latency" number that averages a 40-second wait for the first byte with a 10ms inter-token rate and call the service "fast on average."

The deeper lesson buried in your goodput finding is that LLM serving is a queueing system, not a throughput system, and queueing systems fail in ways that look like "everything is fine until it isn't." 50 concurrent users on a 350M model isn't a thermal or model-capacity problem; it's a scheduling problem — the server is willing to accept all 50, batches them, and now request 50's TTFT is essentially "wait for batches 1 through 49 to drain." If you graph TTFT against in-flight request count rather than against wall-clock time, the cliff usually shows up cleanly and you can SLA against it.

The other piece I'd add to the toolkit: report percentiles per SLO bucket, not just p95 of TTFT. "p95 TTFT = 8s" is meaningless if your SLO is 500ms and only the lucky 5% of requests meet it. Your goodput-vs-throughput chart already implies this, but stating it as an explicit metric — "SLO attainment rate at concurrency N" — is the thing that gets latency taken seriously by everyone outside the perf team.