Exporting LLM Metrics to Prometheus and Datadog

#observability #prometheus #datadog #llm

A guide to instrumenting AI applications for production observability, comparing the open-source Prometheus stack with Datadog's integrated platform. For teams building with LLMs, an AI gateway like Bifrost can centralize and simplify metric collection for either destination.

LLM observability extends traditional application monitoring by capturing signals unique to generative AI, such as token usage, response quality, and cost. While a 200 OK status code once signaled success, for an LLM, it only means a response was generated. Its accuracy, relevance, and safety require a deeper layer of monitoring. This involves tracking not just system health but also the semantic behavior of models in production.

Two leading platforms for this are Prometheus, an open-source metrics toolkit, and Datadog, a commercial all-in-one observability service. Engineering teams often use an AI gateway like Bifrost, an open-source AI gateway from Maxim AI, to sit between applications and model providers. This centralizes metric collection, making it simpler to export consistent, standardized telemetry to either monitoring backend without instrumenting every individual application.

Key Metrics for LLM Observability

Before exporting data, it is essential to know what to track. Standard LLM observability focuses on several key areas:

Token Usage and Cost: Tracking prompt and completion tokens per request is fundamental for cost management. A minor change to a prompt template can have significant cost implications at scale.
Latency: Monitoring end-to-end request duration, time-to-first-token (TTFT), and per-provider latency helps identify performance bottlenecks. LLM applications often show long-tail latency that average metrics can hide, making P90 and P99 percentiles critical.
Error Rates: Differentiating between standard HTTP errors (e.g., 5xx from a provider) and model-specific errors (e.g., content moderation blocks) is crucial for diagnostics.
Quality and Behavior: This includes tracking user feedback, detecting hallucinations, and monitoring for prompt injection attempts.
Provider and Model Usage: Understanding which models are being used, by which teams, and for what purpose helps optimize both cost and performance.

The OpenTelemetry GenAI Semantic Conventions provide a vendor-neutral standard for naming these metrics and attributes, ensuring consistency across different models and observability platforms.

Option 1: Exporting LLM Metrics to Prometheus

Prometheus is an open-source monitoring system that collects and stores time-series data. It operates on a pull model, where the Prometheus server periodically scrapes a /metrics endpoint exposed by the monitored service. This model is well-suited for dynamic environments like Kubernetes. The typical stack includes Prometheus for data collection, Alertmanager for notifications, and Grafana for visualization.

For services that do not natively expose a Prometheus endpoint, the common pattern is to use an exporter—a sidecar or standalone service that queries the target application and presents the data in the correct format.

How Bifrost Integrates with Prometheus

AI gateways can simplify this process significantly. Bifrost, for instance, has a built-in Prometheus integration that exposes a /metrics endpoint out of the box. Teams do not need to build or maintain a separate exporter.

The gateway automatically tracks and exposes key metrics, including:

http_requests_total: Total number of HTTP requests to the gateway.
bifrost_upstream_requests_total: Total requests forwarded to LLM providers.
bifrost_prompt_tokens_total: Counter for all prompt tokens processed.
bifrost_completion_tokens_total: Counter for all completion tokens generated.
bifrost_request_duration_seconds: A histogram of request latency.

These metrics can be labeled with dimensions like provider, model, and status_code, allowing for detailed analysis in Grafana. The Bifrost telemetry system operates asynchronously to ensure that metrics collection does not add latency to the actual LLM requests.

A typical setup involves:

Deploying Bifrost: Run the gateway as a container or binary, routing all application traffic through it.
Configuring Prometheus: Add a scrape configuration to the prometheus.yml file to target the /metrics endpoint of each Bifrost instance.
Visualizing in Grafana: Connect Grafana to the Prometheus data source and build dashboards to monitor key performance indicators.

# prometheus.yml
scrape_configs:
  - job_name: 'bifrost'
    scrape_interval: 15s
    static_configs:
      - targets: ['bifrost-instance-1:9090', 'bifrost-instance-2:9090']

This approach centralizes all LLM-related metrics at the gateway, providing a single source of truth without requiring per-service instrumentation.

Option 2: Exporting LLM Metrics to Datadog

Datadog is a comprehensive, SaaS-based observability platform that unifies metrics, traces, and logs in a single interface. Unlike Prometheus's pull model, Datadog primarily relies on an agent-based push model, where an agent installed on the host collects and forwards telemetry to the Datadog service.

Datadog offers a dedicated product, LLM Observability, which provides specialized dashboards for monitoring AI applications. This product automatically tracks prompts, responses, token usage, costs, and latency with minimal configuration.

How Bifrost Integrates with Datadog

To streamline data export, tools like Bifrost offer a native Datadog connector. This integration uses Datadog's SDKs to send rich data directly to the platform, covering three main areas:

APM Traces: Distributed traces provide end-to-end visibility into request flows.
LLM Observability: Spans are tagged with GenAI-specific metadata, populating the dedicated LLM Observability dashboards automatically.
Metrics: Operational metrics are sent via DogStatsD for real-time monitoring.

The Bifrost Datadog connector can operate in two modes:

Agent Mode (Default): The connector sends data to a local Datadog Agent, which handles batching and retries. This is the recommended approach for production environments.
Agentless Mode: Data is sent directly to Datadog's API endpoints. This simplifies deployment in serverless or containerized environments where running a full agent is not feasible.

Because the gateway handles the integration, application teams do not need to add the Datadog SDK to their own code. They simply route traffic through Bifrost, and the connector takes care of exporting detailed telemetry. This is especially valuable as more organizations standardize their observability pipelines on OpenTelemetry, which Datadog's LLM Observability product now natively supports.

Prometheus vs. Datadog: Which to Choose?

The choice between Prometheus and Datadog often comes down to a trade-off between control and convenience.

Prometheus is ideal for teams that prefer an open-source, self-hosted solution and have the expertise to manage the full stack (Prometheus, Grafana, Alertmanager). It offers immense flexibility and cost control but requires more operational overhead.
Datadog is better for teams that want a managed, all-in-one platform with powerful built-in analytics, alerting, and a polished user experience. It has a lower setup cost but higher ongoing subscription fees.

An AI gateway provides a strategic control point that decouples application logic from observability concerns. By centralizing traffic, a gateway like Bifrost can instrument every LLM call consistently and export standardized metrics to either Prometheus or Datadog. This allows platform teams to own the observability pipeline while letting application developers focus on building features. Teams evaluating their options can start a trial of Bifrost Enterprise to test the native connectors.