Harish Kotra (he/him)

Posted on Jul 2

Building LLM Drag Race: A Live Benchmark That Proves Your Gateway Is Working

#ai #programming #python #dailybuild2026

How I built a real-time streaming demo that races OpenAI and Groq side-by-side through agentgateway — and surfaces live observability data to prove every request flows through the proxy.

When I first heard about agentgateway — a Rust-based, open-source AI-native proxy — my first question was: "how do I actually see it working?" The admin UI is nice, but I wanted something more visceral. I wanted to watch it route live traffic.

So I built LLM Drag Race: a single-page web app that races OpenAI GPT-4o-mini against Groq Llama 3.3 70B in real time, both routed through agentgateway, with a live telemetry panel pulled directly from the gateway's Prometheus metrics endpoint.

This post walks through every technical decision, code snippet, and architecture choice.

The Architecture

Browser (index.html)
      │
      │  POST /race  →  SSE stream of token events
      ▼
FastAPI backend (port 8000)
      │
      │  LangChain ChatOpenAI → http://localhost:4000/v1
      │  (model name in body determines which backend the gateway uses)
      ▼
agentgateway (port 4000)      ←── Prometheus metrics: port 15020
      │                │
      ▼                ▼
  OpenAI API       Groq API

Three layers: a vanilla JS frontend, a Python/FastAPI backend, and agentgateway as the LLM proxy. The interesting part is how they fit together.

The Gateway Config

The first thing I learned: agentgateway's standalone config doesn't use URL-based routing. You don't send OpenAI traffic to /openai/... and Groq traffic to /groq/.... Instead, you declare models and the gateway routes by the model field in each request body.

# agentgateway.yaml
llm:
  models:
  - name: gpt-4o-mini
    provider: openai
    params:
      apiKey: "$OPENAI_API_KEY"
  - name: llama-3.3-70b-versatile
    provider: groq
    params:
      apiKey: "$GROQ_API_KEY"

Both providers are reachable at the same URL: http://localhost:4000/v1/chat/completions. The gateway reads the "model" key in the JSON body and dispatches accordingly. This means the backend code is identical for both providers — just different model names.

Start it with:

set -a && source .env && set +a
agentgateway -f agentgateway.yaml

Note: agentgateway's llm: config (without an explicit binds: block) defaults to port 4000, not 8080.

Concurrent Streaming with asyncio

The core challenge: both LLM calls must run simultaneously, but we want a single SSE stream going to the browser. The solution is a shared asyncio.Queue.

async def run_agent(agent, llm, prompt, do_stream, queue, start):
    try:
        if do_stream:
            first = True
            async for chunk in llm.astream([HumanMessage(content=prompt)]):
                if not chunk.content:
                    continue
                if first:
                    ttft_ms = round((time.perf_counter() - start) * 1000)
                    await queue.put({"agent": agent, "type": "ttft", "ms": ttft_ms})
                    first = False
                await queue.put({"agent": agent, "type": "token", "content": chunk.content})
            total_ms = round((time.perf_counter() - start) * 1000)
            await queue.put({"agent": agent, "type": "done", "total_ms": total_ms, ...})
    except Exception as e:
        await queue.put({"type": "error", "agent": agent, "message": str(e)})

The /race endpoint creates two tasks — one per provider — and a single consumer reads from the queue and yields SSE:

@app.post("/race")
async def race(req: RaceRequest):
    queue = asyncio.Queue()
    start = time.perf_counter()

    async def stream_events():
        tasks = [
            asyncio.create_task(run_agent("openai", openai_llm, req.prompt, ...)),
            asyncio.create_task(run_agent("groq", groq_llm, req.prompt, ...)),
        ]
        done_count = 0
        while done_count < 2:
            event = await queue.get()
            if event.get("type") in ("done", "error"):
                done_count += 1
            yield f"data: {json.dumps(event)}\n\n"

    return StreamingResponse(stream_events(), media_type="text/event-stream",
                             headers={"Cache-Control": "no-cache"})

No threads, no multiprocessing — pure async cooperative multitasking. Python's event loop interleaves the two astream() generators naturally.

Measuring Time to First Token

TTFT is recorded at the application layer: time.perf_counter() at request start, then delta when chunk.content is non-empty for the first time. This is the client-perceived TTFT — it includes network latency to the gateway.

The gateway measures its own TTFT independently and exposes it via Prometheus. Comparing the two gives you the overhead of your Python layer and network stack.

The SSE Protocol

Five event types flow over the stream:

data: {"agent": "openai", "type": "ttft", "ms": 342}
data: {"agent": "groq", "type": "token", "content": "The "}
data: {"agent": "openai", "type": "token", "content": "Apollo"}
data: {"agent": "groq", "type": "done", "total_ms": 4200, "total_tokens": 892}
data: {"agent": "openai", "type": "done", "total_ms": 12480, "total_tokens": 764}

The browser parses this with a ReadableStream reader and TextDecoder, splitting on \n\n and parsing each data: line as JSON:

const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  buffer += decoder.decode(value, { stream: true });
  const parts = buffer.split('\n\n');
  buffer = parts.pop(); // keep incomplete chunk
  for (const part of parts) {
    const line = part.trim();
    if (!line.startsWith('data:')) continue;
    handleEvent(JSON.parse(line.slice(5).trim()));
  }
}

No EventSource API — that doesn't support POST requests. Manual fetch + ReadableStream is the right tool for SSE over POST.

Rendering Markdown as It Streams

LLMs emit markdown. Rendering it live is tricky: mid-stream you might have **bol — half a bold span — which will render as raw asterisks if you parse it immediately.

The solution: accumulate the full raw text and re-parse on every token.

const rawText = { openai: '', groq: '' };

function appendToken(agent, content) {
  rawText[agent] += content;
  const out = document.getElementById(`output${cap(agent)}`);
  out.innerHTML = marked.parse(rawText[agent]) + '<span class="cursor"></span>';
  out.scrollTop = out.scrollHeight;
}

marked.js is fast enough that re-parsing hundreds of kilobytes on each token doesn't cause perceptible jank. The partial markdown resolves correctly because marked.parse() is lenient — an unclosed ** just renders as text until the closing ** arrives.

The Gateway Telemetry Panel

This is the "proof it works" feature. agentgateway exposes Prometheus metrics on port 15020:

agentgateway_gen_ai_client_token_usage_sum{gen_ai_token_type="output",gen_ai_system="groq",...} 3323.0
agentgateway_gen_ai_server_time_to_first_token_sum{gen_ai_system="openai",...} 1.58
agentgateway_gen_ai_server_request_duration_count{gen_ai_system="openai",...} 3

A /gateway-stats endpoint in FastAPI fetches these, parses the Prometheus text format with a regex, and returns structured JSON:

def parse_prometheus(text):
    result = {}
    for line in text.splitlines():
        if line.startswith("#") or not line.strip():
            continue
        m = re.match(r'^(\w+)\{([^}]*)\}\s+([\d.e+\-]+)', line)
        if not m:
            continue
        metric_name, labels_str, value = m.group(1), m.group(2), float(m.group(3))
        if not metric_name.startswith("agentgateway_gen_ai"):
            continue
        labels = dict(re.findall(r'(\w+)="([^"]*)"', labels_str))
        system = labels.get("gen_ai_system", "")
        token_type = labels.get("gen_ai_token_type", "")
        result[(metric_name, system, token_type)] = result.get(..., 0) + value
    return result

The UI shows per-provider: requests proxied, input/output tokens, avg TTFT (gateway-measured), avg time-per-output-token (TPOT), and avg total duration. These are cumulative since gateway start — you can watch them increment with each race.

The Streaming vs Non-Streaming Demo

One of the app's best features: per-provider streaming toggles. When streaming is OFF, the backend does a standard ainvoke() instead of astream(), then emits all content as a single token event followed by done. The UI shows a spinner until the wall of text appears.

This contrast — watching one panel stream token by token while the other spins and then dumps — is the clearest possible demonstration of why streaming matters for perceived latency.

What I Learned

agentgateway's standalone config is simpler than the Kubernetes CRD format. The llm: top-level block handles most use cases — multi-provider routing, API key management, rate limiting, guardrails — all in one place.

asyncio.Queue is the right abstraction for fan-in streaming. Don't try to merge async generators directly; push to a queue and consume from one place.

Re-parsing markdown on every token sounds expensive but isn't. marked.js is implemented in C via WASM in some builds, and even the pure JS version handles this fine at typical LLM token rates (10-50 tokens/second).

Prometheus metrics are your ground truth for gateway observability. The TTFT the gateway measures is independent of your application code — if they diverge significantly, you have overhead to investigate.

Fork It and Extend It

Some features worth adding:

More providers: Anthropic, Mistral, Bedrock. Add a YAML entry and a new panel in the HTML.
Race history: Store results in SQLite, build a win-rate leaderboard.
Cost tracking: Token counts × provider pricing = $/race, shown after each run.
Rate limit demo: Add localRateLimit to agentgateway.yaml, then spam the race button to trigger it and show the 429 in the UI.
Prompt library: Categorized benchmark prompts (coding, math, creative) for reproducible comparisons.