See how Bifrost handles LLM observability through default request logging, OTEL-compliant tracing, Prometheus metrics, and OTLP backend integrations.
Running LLM workloads in production without observability has stopped being a viable strategy. Once requests start fanning out across providers, models, and internal teams, the inputs, outputs, token counts, latencies, and dollar costs that govern reliability and unit economics disappear from view unless a dedicated telemetry layer captures them. Bifrost, the open-source AI gateway from Maxim AI, treats LLM observability as a first-class concern rather than an optional add-on. Every request flowing through the gateway is logged, traced, and measured against OpenTelemetry (OTEL) semantic conventions, with native Prometheus metrics and direct paths into Grafana, Datadog, New Relic, Honeycomb, and any other OTLP-compatible backend. This guide covers how LLM logging, OTEL tracing, and gateway metrics fit together inside Bifrost, and how to wire them up for a production deployment.
Defining LLM Observability
LLM observability is the discipline of capturing, structuring, and analyzing every signal that a large language model application produces: prompts, completions, token counts, latency, cost, errors, tool calls, and more. It extends classic application performance monitoring with model-specific dimensions, including provider, model version, temperature, prompt template, and embedding operations. The objective is to make AI request behavior measurable, debuggable, and attributable across services, teams, and providers.
Three signals form the foundation of any LLM observability stack:
- Logs: Complete request and response payloads, enriched with metadata such as tenant ID, virtual key, prompt version, and retry trail.
- Traces: Distributed spans that describe the full lifecycle of an LLM call, including upstream provider latency, retry attempts, fallbacks, and tool execution.
- Metrics: Aggregated counters and histograms tracking tokens, cost, cache hits, errors, and time-to-first-token.
Why the Gateway Is the Right Place for LLM Observability
Capturing telemetry inside individual application services produces fragmentation. Each service ends up with its own logging stack, its own cost dashboard, and its own provider client code. When anomalies appear in latency, spend, or model drift, correlating them across the platform becomes effectively impossible. An AI gateway sits between every application and every provider, which is exactly the architectural position where unified observability belongs.
By capturing telemetry at the gateway, Bifrost gives platform teams:
- A centralized log stream for every LLM request, no matter which service or developer issued it.
- Cost attribution by team, tenant, virtual key, or project, without any application-side instrumentation.
- Provider-level latency and error visibility across the full portfolio of integrated providers.
- Compliance-ready audit trails for SOC 2, HIPAA, GDPR, and ISO 27001 reporting through Bifrost's audit logs.
This placement also closes the loop with governance. The same telemetry that powers dashboards becomes the input to virtual key budgets, rate limits, and access controls, converting observability data into enforceable policy. For a deeper view of how telemetry and policy reinforce each other in regulated environments, the Bifrost governance resource walks through the integration in detail.
How Bifrost's Built-In LLM Logging Operates
Out of the box, Bifrost ships with built-in observability that captures every AI request and response in real time, with no application code changes required. The logging plugin runs asynchronously: database writes happen in background goroutines, with sync.Pool memory optimization keeping the hot path clean. Benchmark numbers show the logging plugin adding less than 0.1 ms of overhead per request, in line with Bifrost's broader 11-microsecond gateway overhead at 5,000 RPS.
Every log entry contains:
- Input messages: The full conversation history, prompts, and request parameters such as temperature, max_tokens, and stop sequences.
- Provider and model context: Which provider served the request and which model handled it.
- Output messages: Completions, tool calls, and function results.
- Performance metrics: Latency, prompt tokens, completion tokens, total tokens, and computed cost in USD.
- Retry and key selection trail: An ordered list of every attempt, the key used on each try, and the reason any retry failed.
-
Custom metadata: Any HTTP header listed under
logging_headers, plus any header with thex-bf-lh-prefix, captured automatically into log metadata.
On the storage side, Bifrost defaults to SQLite for development and self-hosted setups, with PostgreSQL available for high-volume production. MySQL and ClickHouse backends are on the roadmap, aimed at large-scale time-series workloads. Logs are reachable through a built-in dashboard, a REST API with rich filters (provider, model, status, latency range, token range, cost range, content search), and a WebSocket endpoint that streams live updates.
OTEL Tracing Inside Bifrost
Bifrost includes a native OTEL plugin that emits LLM traces in OTLP format to any OpenTelemetry collector. The trace schema follows the OpenTelemetry GenAI semantic conventions, which is the standard format for tracking prompts, model responses, token usage, tool calls, and provider metadata. Because Bifrost emits to this standard, traces drop into any OTLP-compatible backend without vendor-specific instrumentation work.
Every LLM span Bifrost emits carries a rich attribute set:
-
Operation type:
gen_ai.chat,gen_ai.text,gen_ai.embedding,gen_ai.speech,gen_ai.transcription, orgen_ai.responses. -
Provider and model:
gen_ai.provider.nameandgen_ai.request.model. - Request parameters: Temperature, max_tokens, top_p, presence_penalty, frequency_penalty, and tool configurations.
-
Token usage:
gen_ai.usage.prompt_tokens,gen_ai.usage.completion_tokens,gen_ai.usage.total_tokens. -
Cost:
gen_ai.usage.cost, computed in USD. - Input and output: Full chat history with role-tagged messages, prompt text, tool calls, and tool results.
- Performance: Start and end timestamps, plus error details with status codes.
Streaming requests are handled by an accumulator that waits for the stream to finish before emitting one complete span, so token counts and cost figures stay accurate. Span tracking uses a sync.Map implementation with a 20-minute TTL, which prevents memory leaks in long-running processes. Emission itself runs asynchronously in background goroutines, so request latency is unaffected.
OTEL Backends Bifrost Supports
The OTEL plugin works against any OTLP-compatible backend over HTTP (port 4318) or gRPC (port 4317). Bifrost's OTEL integration ships with ready-to-use configuration recipes for:
- Grafana Cloud: Native OTLP HTTP endpoint, authenticated with Basic auth.
-
Datadog: APM trace endpoint with the
DD-API-KEYheader, paired with the native LLM Observability dashboards. -
New Relic: OTLP HTTP endpoint with the
api-keyheader. -
Honeycomb: OTLP HTTP endpoint with
x-honeycomb-teamand dataset headers. - Langfuse: Open-source LLM observability platform reached via OTLP HTTP.
-
Self-hosted collectors: Any OpenTelemetry Collector instance, with optional TLS configured through
tls_ca_cert.
The standard OTEL_RESOURCE_ATTRIBUTES environment variable is also honored, so attributes like deployment.environment=production, service.version=1.2.3, and team.name=platform attach to every span Bifrost emits.
Prometheus Metrics and Cross-Node Observability
In addition to logs and traces, Bifrost exposes native Prometheus metrics that cover every dimension of a request. The default set includes:
-
bifrost_upstream_requests_total,bifrost_success_requests_total,bifrost_error_requests_total -
bifrost_input_tokens_total,bifrost_output_tokens_total,bifrost_cost_total -
bifrost_cache_hits_totalfor tracking semantic cache effectiveness -
bifrost_upstream_latency_seconds,bifrost_stream_first_token_latency_seconds,bifrost_stream_inter_token_latency_seconds
On a single-node deployment, Prometheus can simply scrape the /metrics endpoint. For multi-node deployments sitting behind a load balancer, the OTEL plugin offers push-based metrics export instead: every Bifrost node pushes metrics to a central OTEL Collector on a configurable interval (15 seconds by default). This sidesteps the service discovery and per-node scrape configuration that pull-based collection demands, which is exactly the part that breaks when nodes are scaled dynamically. Push metrics pair with Bifrost's clustering mode to give accurate cross-node aggregation in high-availability deployments.
Setting Up OTEL Tracing in Bifrost
Turning on OTEL tracing in Gateway mode is a single plugin entry in config.json:
{
"plugins": [
{
"enabled": true,
"name": "otel",
"config": {
"service_name": "bifrost",
"collector_url": "http://localhost:4318",
"trace_type": "genai_extension",
"protocol": "http",
"headers": {
"Authorization": "env.OTEL_API_KEY"
}
}
}
]
}
Headers accept environment variable substitution via the env. prefix, so API keys and tokens are read at runtime instead of being committed to configuration files. For gRPC transport, set protocol to grpc and target port 4317 on the collector_url. Collectors that require client certificate authentication can be wired up through the optional tls_ca_cert field for TLS support.
To turn on push-based metrics export, add metrics_enabled, metrics_endpoint, and an optional metrics_push_interval to the same plugin configuration. The Prometheus-style metrics then flow via OTLP to a central collector, which can route them onward to Datadog, Grafana Cloud, or any backend that accepts OTLP metrics.
Production Best Practices for LLM Observability
A handful of practices keep LLM observability tidy as Bifrost scales:
-
Attribute tenants through logging headers: Add
X-Tenant-ID,X-Correlation-ID, or any custom header tologging_headersso per-tenant filtering and cost attribution work cleanly. -
Tag every environment with resource attributes: Setting
deployment.environment,service.version, andteam.nameon every trace prevents staging and production telemetry from contaminating the same dashboards. - Switch to push metrics in clustered deployments: Pull-based scraping tends to miss nodes behind load balancers, while push-based OTLP metrics guarantee complete aggregation across the fleet.
-
Turn off content logging for sensitive workloads: The
disable_content_loggingflag keeps usage metadata (tokens, cost, latency) while dropping prompt and completion text, which is essential for regulated industries that handle PHI or PII. Teams operating under those constraints can pair this with Bifrost's guardrails for input and output policy enforcement at the gateway layer. - Stitch gateway traces into application traces: Since Bifrost emits OTLP-compliant spans, they join existing distributed traces automatically whenever the application propagates the W3C trace context header.
For organizations consolidating on a single observability backend, the OpenTelemetry Collector ecosystem becomes the routing layer between Bifrost and downstream tools, enabling redaction, sampling, and multi-destination export without touching gateway configuration.
Putting Bifrost LLM Observability to Work
Bifrost makes LLM observability the default behavior rather than something teams have to bolt on later. Built-in logging captures every request with zero code changes, the OTEL plugin pushes OTLP traces to any compatible backend using GenAI semantic conventions, and native Prometheus metrics drop into existing dashboards. For platform teams running AI in production, the practical effect is that cost attribution, latency analysis, error debugging, and compliance reporting all originate from a single telemetry source. To explore how Bifrost can consolidate LLM logging and OTEL tracing across your AI infrastructure, schedule a Bifrost demo.
Top comments (0)