Debby McKinney

Posted on Feb 26

How to Monitor LLM Costs in Real-Time with an AI Gateway

#ai #llm #programming #opensource

LLM costs scale unpredictably; a misconfigured agent can burn $10K in hours. Without real-time monitoring, teams discover budget overruns days or weeks later through provider bills.

This guide shows how to implement real-time LLM cost monitoring with instant alerts using the Bifrost AI Gateway.

maximhq / bifrost

Fastest enterprise AI gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost AI Gateway

The fastest way to build AI applications that never go down

Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.

Quick Start

Go from zero to production-ready AI gateway in under a minute.

Step 1: Start Bifrost Gateway

# Install and run locally
npx -y @maximhq/bifrost

# Or use Docker
docker run -p 8080:8080 maximhq/bifrost

Step 2: Configure via Web UI

# Open the built-in web interface
open http://localhost:8080

Step 3: Make your first API call

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-4o-mini",
    "messages": [{"role": "user", "content": "Hello, Bifrost!"}]
  }'

That's it! Your AI gateway is running with a web interface for visual configuration…

View on GitHub

The Cost Monitoring Problem

Without Real-Time Monitoring:

Discover cost overruns via monthly bill
No per-team or per-user attribution
Cannot identify expensive queries
No alerts before budget is exhausted

With Real-Time Monitoring:

Live cost tracking per request
Per-team / per-user / per-project attribution
Query-level cost analysis
Alerts at 80% / 90% budget thresholds

Bifrost’s Cost Monitoring Architecture

Bifrost ships with built-in observability and metrics so you don’t have to build cost tracking from scratch.

Built-in Dashboard (Bifrost UI at http://localhost:8080):

Real-time request logs with costs
Cost tracking per virtual key / team / customer
Token usage visualization
Budget utilization graphs

You get this UI as part of the Bifrost AI Gateway once the gateway is running.

Prometheus Metrics (at http://localhost:8080/metrics):

Cost aggregation by model / provider / team
Budget utilization percentages
Token usage trends
Request-level cost distribution

Prometheus compatibility is a core part of Bifrost’s Telemetry and Prometheus Metrics features.

Setup: Real-Time Cost Monitoring

Step 1: Install Bifrost

You can run Bifrost locally in seconds:

npx -y @maximhq/bifrost

This starts the Bifrost AI Gateway with the HTTP API and built-in UI.

Step 2: Configure Hierarchical Budgets

Bifrost’s governance model lets you define budgets at customer, team, and user (virtual key) levels using Virtual Keys and Budget & Rate Limits.

Customer Budget ($10K/month):

curl -X POST http://localhost:8080/api/governance/customers \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Acme Corp",
    "budget": {
      "max_limit": 10000.00,
      "reset_duration": "1M"
    }
  }'

Team Budgets:

# Engineering: $5K
curl -X POST http://localhost:8080/api/governance/teams \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Engineering",
    "customer_id": "customer-acme",
    "budget": {"max_limit": 5000.00, "reset_duration": "1M"}
  }'

# Data Science: $3K
curl -X POST http://localhost:8080/api/governance/teams \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Data Science",
    "customer_id": "customer-acme",
    "budget": {"max_limit": 3000.00, "reset_duration": "1M"}
  }'

User Budgets (per virtual key):

curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-dev-alice \
  -H "Content-Type: application/json" \
  -d '{
    "team_id": "team-engineering",
    "budget": {"max_limit": 500.00, "reset_duration": "1M"}
  }'

This is how you get granular governance at every layer: customer → team → user.

Step 3: Configure Prometheus Alerts

Point Prometheus at Bifrost’s metrics endpoint:

prometheus.yml:

scrape_configs:
  - job_name: 'bifrost'
    static_configs:
      - targets: ['localhost:8080']

Then define budget and cost alerts using Bifrost’s cost and budget metrics.

alerts.yml:

groups:
  - name: llm_costs
    rules:
      # User budget warning
      - alert: UserBudgetWarning
        expr: (budget_usage{type="virtual_key"} / budget_limit{type="virtual_key"}) > 0.8
        labels:
          severity: warning
        annotations:
          summary: "User {{ $labels.vk }} at 80% budget"

      # User budget critical
      - alert: UserBudgetCritical
        expr: (budget_usage{type="virtual_key"} / budget_limit{type="virtual_key"}) > 0.9
        labels:
          severity: critical
        annotations:
          summary: "User {{ $labels.vk }} at 90% budget"

      # Team budget critical
      - alert: TeamBudgetCritical
        expr: (team_budget_usage / team_budget_limit) > 0.9
        labels:
          severity: critical
        annotations:
          summary: "Team {{ $labels.team }} at 90% budget"

      # Customer budget critical
      - alert: CustomerBudgetCritical
        expr: (customer_budget_usage / customer_budget_limit) > 0.9
        labels:
          severity: critical
        annotations:
          summary: "Customer {{ $labels.customer }} at 90% budget"

      # Expensive query alert
      - alert: ExpensiveQuery
        expr: bifrost_request_cost_dollars > 10
        labels:
          severity: warning
        annotations:
          summary: "Request cost ${{ $value }} from {{ $labels.vk }}"

Step 4: Configure Alertmanager

Wire Prometheus alerts into Slack (or any other incident channel).

alertmanager.yml:

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'slack'

receivers:
  - name: 'slack'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
        channel: '#llm-alerts'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

Now, as soon as a user, team, or customer crosses the threshold, you get a ping.

Monitoring Dashboards

Built-in Bifrost Dashboard

Access: http://localhost:8080

Bifrost’s gateway UI (see Bifrost AI Gateway) gives you real-time visibility:

Real-Time Visibility:

Request Logs: Every LLM request with:
- Timestamp
- Virtual key (user / team)
- Model used
- Tokens (input + output)
- Cost calculated
- Latency
Cost Tracking: Real-time aggregation by:
- Virtual key (user)
- Team
- Customer
- Model
- Provider
Budget Utilization: Visual progress bars showing:
- Current usage vs limit
- Remaining budget
- Reset date
Token Usage: Graphs showing:
- Input vs output tokens
- Token trends over time
- Per-model token distribution

This all rides on Bifrost’s built-in Telemetry and Semantic Caching support so you can keep both performance and cost under control.

Grafana Dashboard

Layer Grafana on top of Prometheus to build richer cost dashboards.

PromQL Queries:

Total Cost by Team:

sum(bifrost_cost_total) by (team)

Budget Utilization by User:

(budget_usage{type="virtual_key"} / budget_limit{type="virtual_key"}) * 100

Cost per Model:

sum(bifrost_cost_total) by (model)

Most Expensive Users:

topk(10, sum(bifrost_cost_total) by (vk))

Daily Cost Trend:

sum(increase(bifrost_cost_total[24h]))

Token Usage by Type:

sum(bifrost_tokens_total) by (token_type)

Cost Attribution

Bifrost’s Virtual Keys and governance layer give you clean attribution across users, teams, and customers.

Per-User Attribution

Query:

sum(bifrost_cost_total{vk="vk-dev-alice"})

Result: Total spend for user Alice.

Per-Team Attribution

Query:

sum(bifrost_cost_total{team="engineering"})

Per-Model Attribution

Query:

sum(bifrost_cost_total) by (model)

Example Output:

gpt-4o-mini: $1,200
gpt-4o: $3,500
claude-3-5-haiku: $800

Per-Provider Attribution

Query:

sum(bifrost_cost_total) by (provider)

This works across all configured providers via Bifrost’s Supported Providers layer (OpenAI, Anthropic, Bedrock, Vertex, etc.) without changing your app code.

Real-Time Cost Analysis

Identifying Expensive Queries

Use the Bifrost dashboard to sort requests by cost, then zoom in with PromQL.

Most Expensive Requests:

topk(10, bifrost_request_cost_dollars)

Alert on Expensive Requests:

- alert: ExpensiveQuery
  expr: bifrost_request_cost_dollars > 10

Detecting Cost Anomalies

Sudden Cost Spike:

increase(bifrost_cost_total[5m]) > 100

Unusual Token Usage:

rate(bifrost_tokens_total[5m]) > 100000

These signals are especially useful when you’re running agents or tools through Bifrost’s MCP Gateway and want to catch runaway behavior quickly.

Cost Optimization Insights

Model Efficiency Analysis

Cost per Request by Model:

avg(bifrost_request_cost_dollars) by (model)

Token Efficiency:

sum(bifrost_cost_total) / sum(bifrost_tokens_total)

This helps you decide when to move workloads from expensive models (e.g., GPT-4 class) to cheaper ones while maintaining quality.

Provider Cost Comparison

Compare providers on blended cost per request:

sum(bifrost_cost_total) by (provider) / sum(bifrost_requests_total) by (provider)

Because Bifrost handles Routing and Load Balancing across multiple Supported Providers, you can experiment with cheaper backends without rewriting your app.

Alert Examples

Budget Alerts

User at 80% Budget (Slack):

⚠️ Warning: Alice (vk-dev-alice) at 80% budget
Current: $400 / $500
Time remaining: 15 days

Team Budget Critical (PagerDuty):

🚨 Critical: Engineering team at 90% budget
Current: $4,500 / $5,000
Action required: Review usage or increase budget

Cost Spike Alerts

Unusual Spending (email):

📊 Cost spike detected
Last 5 min: $150 (avg: $10)
Team: Data Science
Top user: Bob (vk-dev-bob) - $120
Action: Investigate recent queries

Complete Monitoring Stack

Putting it all together:

# 1. Start Bifrost
npx -y @maximhq/bifrost

# 2. Start Prometheus
prometheus --config.file=prometheus.yml

# 3. Start Alertmanager
alertmanager --config.file=alertmanager.yml

# 4. Start Grafana
grafana-server

Access:

Bifrost Dashboard: http://localhost:8080
Prometheus: http://localhost:9090
Grafana: http://localhost:3000
Alertmanager: http://localhost:9093

Get Started

To try this locally:

npx -y @maximhq/bifrost

Then follow the governance and metrics guides in the Bifrost AI Gateway documentation and observability stack under Telemetry and Prometheus Metrics.

Key Takeaway: Real-time LLM cost monitoring requires built-in dashboards (request logs, cost tracking), Prometheus metrics (cost / token / budget aggregation), automated alerts (80% / 90% thresholds), and granular attribution (per-user / team / customer). Bifrost provides native cost calculation, hierarchical budget tracking, Prometheus integration, and real-time observability—enabling instant cost visibility and proactive budget management across all your LLM providers.

DEV Community

How to Monitor LLM Costs in Real-Time with an AI Gateway

maximhq / bifrost

Fastest enterprise AI gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.

Bifrost AI Gateway

The fastest way to build AI applications that never go down

Quick Start

The Cost Monitoring Problem

Bifrost’s Cost Monitoring Architecture

Setup: Real-Time Cost Monitoring

Step 1: Install Bifrost

Step 2: Configure Hierarchical Budgets

Step 3: Configure Prometheus Alerts

Step 4: Configure Alertmanager

Monitoring Dashboards

Built-in Bifrost Dashboard

Grafana Dashboard

Cost Attribution

Per-User Attribution

Per-Team Attribution

Per-Model Attribution

Per-Provider Attribution

Real-Time Cost Analysis

Identifying Expensive Queries

Detecting Cost Anomalies

Cost Optimization Insights

Model Efficiency Analysis

Provider Cost Comparison

Alert Examples

Budget Alerts

Cost Spike Alerts

Complete Monitoring Stack

Get Started

Top comments (0)