DEV Community

Debby McKinney
Debby McKinney

Posted on

How to Monitor LLM Costs in Real-Time with an AI Gateway

LLM costs scale unpredictably; a misconfigured agent can burn $10K in hours. Without real-time monitoring, teams discover budget overruns days or weeks later through provider bills.

This guide shows how to implement real-time LLM cost monitoring with instant alerts using the Bifrost AI Gateway.

GitHub logo maximhq / bifrost

Fastest enterprise AI gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost AI Gateway

Go Report Card Discord badge Known Vulnerabilities codecov Docker Pulls Run In Postman Artifact Hub License

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Get started

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Enter fullscreen mode Exit fullscreen mode

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080
Enter fullscreen mode Exit fullscreen mode

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'
Enter fullscreen mode Exit fullscreen mode

That's it! Your AI gateway is running with a web interface for visual configuration…


The Cost Monitoring Problem

Without Real-Time Monitoring:

  • Discover cost overruns via monthly bill
  • No per-team or per-user attribution
  • Cannot identify expensive queries
  • No alerts before budget is exhausted

With Real-Time Monitoring:

  • Live cost tracking per request
  • Per-team / per-user / per-project attribution
  • Query-level cost analysis
  • Alerts at 80% / 90% budget thresholds

Bifrost’s Cost Monitoring Architecture

Bifrost ships with built-in observability and metrics so you don’t have to build cost tracking from scratch.

Built-in Dashboard (Bifrost UI at http://localhost:8080):

  • Real-time request logs with costs
  • Cost tracking per virtual key / team / customer
  • Token usage visualization
  • Budget utilization graphs

You get this UI as part of the Bifrost AI Gateway once the gateway is running.

Prometheus Metrics (at http://localhost:8080/metrics):

  • Cost aggregation by model / provider / team
  • Budget utilization percentages
  • Token usage trends
  • Request-level cost distribution

Prometheus compatibility is a core part of Bifrost’s Telemetry and Prometheus Metrics features.


Setup: Real-Time Cost Monitoring

Step 1: Install Bifrost

You can run Bifrost locally in seconds:

npx -y @maximhq/bifrost
Enter fullscreen mode Exit fullscreen mode

This starts the Bifrost AI Gateway with the HTTP API and built-in UI.

Step 2: Configure Hierarchical Budgets

Bifrost’s governance model lets you define budgets at customer, team, and user (virtual key) levels using Virtual Keys and Budget & Rate Limits.

Customer Budget ($10K/month):

curl -X POST http://localhost:8080/api/governance/customers \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Acme Corp",
    "budget": {
      "max_limit": 10000.00,
      "reset_duration": "1M"
    }
  }'
Enter fullscreen mode Exit fullscreen mode

Team Budgets:

# Engineering: $5K
curl -X POST http://localhost:8080/api/governance/teams \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Engineering",
    "customer_id": "customer-acme",
    "budget": {"max_limit": 5000.00, "reset_duration": "1M"}
  }'

# Data Science: $3K
curl -X POST http://localhost:8080/api/governance/teams \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Data Science",
    "customer_id": "customer-acme",
    "budget": {"max_limit": 3000.00, "reset_duration": "1M"}
  }'
Enter fullscreen mode Exit fullscreen mode

User Budgets (per virtual key):

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-dev-alice \
  -H "Content-Type: application/json" \
  -d '{
    "team_id": "team-engineering",
    "budget": {"max_limit": 500.00, "reset_duration": "1M"}
  }'
Enter fullscreen mode Exit fullscreen mode

This is how you get granular governance at every layer: customer → team → user.

Step 3: Configure Prometheus Alerts

Point Prometheus at Bifrost’s metrics endpoint:

prometheus.yml:

scrape_configs:
  - job_name: 'bifrost'
    static_configs:
      - targets: ['localhost:8080']
Enter fullscreen mode Exit fullscreen mode

Then define budget and cost alerts using Bifrost’s cost and budget metrics.

alerts.yml:

groups:
  - name: llm_costs
    rules:
      # User budget warning
      - alert: UserBudgetWarning
        expr: (budget_usage{type="virtual_key"} / budget_limit{type="virtual_key"}) > 0.8
        labels:
          severity: warning
        annotations:
          summary: "User {{ $labels.vk }} at 80% budget"

      # User budget critical
      - alert: UserBudgetCritical
        expr: (budget_usage{type="virtual_key"} / budget_limit{type="virtual_key"}) > 0.9
        labels:
          severity: critical
        annotations:
          summary: "User {{ $labels.vk }} at 90% budget"

      # Team budget critical
      - alert: TeamBudgetCritical
        expr: (team_budget_usage / team_budget_limit) > 0.9
        labels:
          severity: critical
        annotations:
          summary: "Team {{ $labels.team }} at 90% budget"

      # Customer budget critical
      - alert: CustomerBudgetCritical
        expr: (customer_budget_usage / customer_budget_limit) > 0.9
        labels:
          severity: critical
        annotations:
          summary: "Customer {{ $labels.customer }} at 90% budget"

      # Expensive query alert
      - alert: ExpensiveQuery
        expr: bifrost_request_cost_dollars > 10
        labels:
          severity: warning
        annotations:
          summary: "Request cost ${{ $value }} from {{ $labels.vk }}"
Enter fullscreen mode Exit fullscreen mode

Step 4: Configure Alertmanager

Wire Prometheus alerts into Slack (or any other incident channel).

alertmanager.yml:

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'slack'

receivers:
  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#llm-alerts'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
Enter fullscreen mode Exit fullscreen mode

Now, as soon as a user, team, or customer crosses the threshold, you get a ping.


Monitoring Dashboards

Built-in Bifrost Dashboard

Access: http://localhost:8080

Bifrost’s gateway UI (see Bifrost AI Gateway) gives you real-time visibility:

Real-Time Visibility:

  • Request Logs: Every LLM request with:
    • Timestamp
    • Virtual key (user / team)
    • Model used
    • Tokens (input + output)
    • Cost calculated
    • Latency
  • Cost Tracking: Real-time aggregation by:
    • Virtual key (user)
    • Team
    • Customer
    • Model
    • Provider
  • Budget Utilization: Visual progress bars showing:
    • Current usage vs limit
    • Remaining budget
    • Reset date
  • Token Usage: Graphs showing:
    • Input vs output tokens
    • Token trends over time
    • Per-model token distribution

This all rides on Bifrost’s built-in Telemetry and Semantic Caching support so you can keep both performance and cost under control.

Grafana Dashboard

Layer Grafana on top of Prometheus to build richer cost dashboards.

PromQL Queries:

Total Cost by Team:

sum(bifrost_cost_total) by (team)
Enter fullscreen mode Exit fullscreen mode

Budget Utilization by User:

(budget_usage{type="virtual_key"} / budget_limit{type="virtual_key"}) * 100
Enter fullscreen mode Exit fullscreen mode

Cost per Model:

sum(bifrost_cost_total) by (model)
Enter fullscreen mode Exit fullscreen mode

Most Expensive Users:

topk(10, sum(bifrost_cost_total) by (vk))
Enter fullscreen mode Exit fullscreen mode

Daily Cost Trend:

sum(increase(bifrost_cost_total[24h]))
Enter fullscreen mode Exit fullscreen mode

Token Usage by Type:

sum(bifrost_tokens_total) by (token_type)
Enter fullscreen mode Exit fullscreen mode

Cost Attribution

Bifrost’s Virtual Keys and governance layer give you clean attribution across users, teams, and customers.

Per-User Attribution

Query:

sum(bifrost_cost_total{vk="vk-dev-alice"}) 
Enter fullscreen mode Exit fullscreen mode

Result: Total spend for user Alice.

Per-Team Attribution

Query:

sum(bifrost_cost_total{team="engineering"})
Enter fullscreen mode Exit fullscreen mode

Per-Model Attribution

Query:

sum(bifrost_cost_total) by (model)
Enter fullscreen mode Exit fullscreen mode

Example Output:

  • gpt-4o-mini: $1,200
  • gpt-4o: $3,500
  • claude-3-5-haiku: $800

Per-Provider Attribution

Query:

sum(bifrost_cost_total) by (provider)
Enter fullscreen mode Exit fullscreen mode

This works across all configured providers via Bifrost’s Supported Providers layer (OpenAI, Anthropic, Bedrock, Vertex, etc.) without changing your app code.


Real-Time Cost Analysis

Identifying Expensive Queries

Use the Bifrost dashboard to sort requests by cost, then zoom in with PromQL.

Most Expensive Requests:

topk(10, bifrost_request_cost_dollars)
Enter fullscreen mode Exit fullscreen mode

Alert on Expensive Requests:

- alert: ExpensiveQuery
  expr: bifrost_request_cost_dollars > 10
Enter fullscreen mode Exit fullscreen mode

Detecting Cost Anomalies

Sudden Cost Spike:

increase(bifrost_cost_total[5m]) > 100
Enter fullscreen mode Exit fullscreen mode

Unusual Token Usage:

rate(bifrost_tokens_total[5m]) > 100000
Enter fullscreen mode Exit fullscreen mode

These signals are especially useful when you’re running agents or tools through Bifrost’s MCP Gateway and want to catch runaway behavior quickly.


Cost Optimization Insights

Model Efficiency Analysis

Cost per Request by Model:

avg(bifrost_request_cost_dollars) by (model)
Enter fullscreen mode Exit fullscreen mode

Token Efficiency:

sum(bifrost_cost_total) / sum(bifrost_tokens_total)
Enter fullscreen mode Exit fullscreen mode

This helps you decide when to move workloads from expensive models (e.g., GPT-4 class) to cheaper ones while maintaining quality.

Provider Cost Comparison

Compare providers on blended cost per request:

sum(bifrost_cost_total) by (provider) / sum(bifrost_requests_total) by (provider)
Enter fullscreen mode Exit fullscreen mode

Because Bifrost handles Routing and Load Balancing across multiple Supported Providers, you can experiment with cheaper backends without rewriting your app.


Alert Examples

Budget Alerts

User at 80% Budget (Slack):

⚠️ Warning: Alice (vk-dev-alice) at 80% budget
Current: $400 / $500
Time remaining: 15 days
Enter fullscreen mode Exit fullscreen mode

Team Budget Critical (PagerDuty):

🚨 Critical: Engineering team at 90% budget
Current: $4,500 / $5,000
Action required: Review usage or increase budget
Enter fullscreen mode Exit fullscreen mode

Cost Spike Alerts

Unusual Spending (email):

📊 Cost spike detected
Last 5 min: $150 (avg: $10)
Team: Data Science
Top user: Bob (vk-dev-bob) - $120
Action: Investigate recent queries
Enter fullscreen mode Exit fullscreen mode

Complete Monitoring Stack

Putting it all together:

# 1. Start Bifrost
npx -y @maximhq/bifrost

# 2. Start Prometheus
prometheus --config.file=prometheus.yml

# 3. Start Alertmanager
alertmanager --config.file=alertmanager.yml

# 4. Start Grafana
grafana-server
Enter fullscreen mode Exit fullscreen mode

Access:

  • Bifrost Dashboard: http://localhost:8080
  • Prometheus: http://localhost:9090
  • Grafana: http://localhost:3000
  • Alertmanager: http://localhost:9093

Get Started

To try this locally:

npx -y @maximhq/bifrost
Enter fullscreen mode Exit fullscreen mode

Then follow the governance and metrics guides in the Bifrost AI Gateway documentation and observability stack under Telemetry and Prometheus Metrics.

Key Takeaway: Real-time LLM cost monitoring requires built-in dashboards (request logs, cost tracking), Prometheus metrics (cost / token / budget aggregation), automated alerts (80% / 90% thresholds), and granular attribution (per-user / team / customer). Bifrost provides native cost calculation, hierarchical budget tracking, Prometheus integration, and real-time observability—enabling instant cost visibility and proactive budget management across all your LLM providers.

Top comments (0)