Request tagging for LLM evals with Bifrost dimension headers

#machinelearning #mlops #llm

TL;DR: Request tagging with Bifrost dimension headers (x-bf-dim-*) stamps checkpoint and run metadata onto every LLM eval call, so you slice scores by model version instead of guessing which change moved the aggregate.

We ran roughly 12,000 eval requests across four fine-tuned checkpoints last sprint, and when aggregate accuracy moved three points I couldn't tell which checkpoint produced which response. Our eval harness stored prompts and scores in one table; the routing layer recorded latency and provider somewhere else, and nothing carried the experiment ID end to end. We moved the eval traffic behind Bifrost, the open-source AI gateway from Maxim AI, and used its custom dimension headers to stamp each request with the checkpoint and run ID. Request tagging turned a join-by-timestamp guessing game into a filter.

What request tagging means for LLM evals

Request tagging attaches key-value metadata to each LLM API call so downstream logs, traces, and metrics can be grouped by that metadata. In Bifrost, any header prefixed x-bf-dim-* becomes a custom dimension that is auto-forwarded to logs, traces, and Prometheus, which lets you group eval scores by checkpoint, prompt version, or suite without modifying your harness.

I lead the fine-tuning and evaluation team at Nexus Labs, a Series B company building enterprise agent automation. Our problem was attribution, not measurement. A scoring function that returns 0.81 is useless if you can't tie that number to agentqa-v7-lora-r16 versus agentqa-v6. Most eval setups solve this by threading an experiment ID through every layer of application code, which breaks the moment someone forgets a kwarg. Pushing the metadata into a request header at the gateway means the harness stays dumb and the dimension travels with the request.

Stamping requests with x-bf-dim headers

Bifrost is a drop-in replacement for the OpenAI base URL, so the only change to our harness was the base_url and three extra headers. The gateway holds the provider keys, so the client API key is unused.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="unused-bifrost-holds-keys",
)

resp = client.chat.completions.create(
    model="openai/gpt-4o-mini",
    messages=[{"role": "user", "content": eval_case.prompt}],
    extra_headers={
        "x-bf-dim-checkpoint": "agentqa-v7-lora-r16",
        "x-bf-dim-run-id": "eval-2026-06-19-batch3",
        "x-bf-dim-suite": "tool-routing-adversarial",
    },
)

Every request in that batch now carries three dimensions. When the scorer writes its verdict, I don't need to correlate anything by hand; the gateway already recorded the dimensions next to the latency, token counts, and resolved provider. The same endpoint fronts 20+ providers, so when I shadow a hosted model against a self-hosted checkpoint, both legs of the comparison get tagged identically and land in the same store.

Slicing eval results in observability

The dimensions are only useful if the read path is cheap. Bifrost writes telemetry through async observability with under 0.1ms of added overhead, using SQLite by default and Postgres for production volume. The sinks include Prometheus, OpenTelemetry, Datadog, and BigQuery, so I query the same dimensions from whichever tool the rest of the team already watches.

In practice I pull a Prometheus query grouped by checkpoint and suite, then compute per-slice accuracy from the scorer table joined on run_id. That is where the three-point aggregate move resolved: checkpoint v7 gained on the general suite and lost on the adversarial tool-routing suite, which the average had flattened. This kind of per-segment attribution is the whole reason I distrust single-number eval reports. Aggregate metrics are a summary statistic, and summary statistics hide structure by design. The methodology argument is old; the HELM evaluation work made the case for multi-metric, multi-scenario reporting years ago. Tagging at the gateway is the plumbing that makes per-scenario reporting cheap enough to actually do on every run.

One detail that saved me time: the dimensions are arbitrary strings, so I tag prompt-template hashes too. When a template edit slipped into a run, the prompt_hash dimension showed two distinct values inside one supposedly clean batch, and I caught a contaminated comparison before it reached a decision.

Trade-offs and limitations

This is not free infrastructure. Bifrost runs as a separate Go service, so you operate one more process, and a serious deployment needs Postgres rather than the default SQLite once you push real eval volume through it. If your stack is pure Python and you want everything in-process, a library like LiteLLM keeps fewer moving parts, at the cost of the gateway-level telemetry I'm describing here. Bifrost's ecosystem is also younger than LiteLLM's, so you will find fewer community examples for edge integrations.

The dimension headers are forwarded, not validated. Nothing stops a typo in x-bf-dim-checkpoint from creating a phantom slice, so I keep the tag values in one constants module and assert against it in the harness. Cluster-mode horizontal scaling is an enterprise feature, not part of the open-source core, which matters if your eval fleet outgrows a single instance. For a four-checkpoint sprint on one box, none of this bit me. Know your scale before you assume it won't.

Wrapping up

Request tagging with x-bf-dim-* dimension headers moved attribution out of my eval code and into the gateway, which is where it belongs when many checkpoints and suites share one pipeline. The model was never the hard part. Knowing which model produced which number was. If you want to see the tagging and observability path end to end, book a demo: https://getmaxim.ai/bifrost/book-a-demo