LLM costs scale unpredictably; a misconfigured agent can burn $10K in hours. Without real-time monitoring, teams discover budget overruns days or weeks later through provider bills.
This guide shows how to implement real-time LLM cost monitoring with instant alerts using the Bifrost AI Gateway.
maximhq
/
bifrost
Fastest enterprise AI gateway (50x faster than LiteLLM) with adaptive load balancer, cluster mode, guardrails, 1000+ models support & <100 µs overhead at 5k RPS.
Bifrost AI Gateway
The fastest way to build AI applications that never go down
Bifrost is a high-performance AI gateway that unifies access to 15+ providers (OpenAI, Anthropic, AWS Bedrock, Google Vertex, and more) through a single OpenAI-compatible API. Deploy in seconds with zero configuration and get automatic failover, load balancing, semantic caching, and enterprise-grade features.
Quick Start
Go from zero to production-ready AI gateway in under a minute.
Step 1: Start Bifrost Gateway
# Install and run locally
npx -y @maximhq/bifrost
# Or use Docker
docker run -p 8080:8080 maximhq/bifrost
Step 2: Configure via Web UI
# Open the built-in web interface
open http://localhost:8080
Step 3: Make your first API call
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-4o-mini",
"messages": [{"role": "user", "content": "Hello, Bifrost!"}]
}'
That's it! Your AI gateway is running with a web interface for visual configuration…
The Cost Monitoring Problem
Without Real-Time Monitoring:
- Discover cost overruns via monthly bill
- No per-team or per-user attribution
- Cannot identify expensive queries
- No alerts before budget is exhausted
With Real-Time Monitoring:
- Live cost tracking per request
- Per-team / per-user / per-project attribution
- Query-level cost analysis
- Alerts at 80% / 90% budget thresholds
Bifrost’s Cost Monitoring Architecture
Bifrost ships with built-in observability and metrics so you don’t have to build cost tracking from scratch.
Built-in Dashboard (Bifrost UI at http://localhost:8080):
- Real-time request logs with costs
- Cost tracking per virtual key / team / customer
- Token usage visualization
- Budget utilization graphs
You get this UI as part of the Bifrost AI Gateway once the gateway is running.
Prometheus Metrics (at http://localhost:8080/metrics):
- Cost aggregation by model / provider / team
- Budget utilization percentages
- Token usage trends
- Request-level cost distribution
Prometheus compatibility is a core part of Bifrost’s Telemetry and Prometheus Metrics features.
Setup: Real-Time Cost Monitoring
Step 1: Install Bifrost
You can run Bifrost locally in seconds:
npx -y @maximhq/bifrost
This starts the Bifrost AI Gateway with the HTTP API and built-in UI.
Step 2: Configure Hierarchical Budgets
Bifrost’s governance model lets you define budgets at customer, team, and user (virtual key) levels using Virtual Keys and Budget & Rate Limits.
Customer Budget ($10K/month):
curl -X POST http://localhost:8080/api/governance/customers \
-H "Content-Type: application/json" \
-d '{
"name": "Acme Corp",
"budget": {
"max_limit": 10000.00,
"reset_duration": "1M"
}
}'
Team Budgets:
# Engineering: $5K
curl -X POST http://localhost:8080/api/governance/teams \
-H "Content-Type: application/json" \
-d '{
"name": "Engineering",
"customer_id": "customer-acme",
"budget": {"max_limit": 5000.00, "reset_duration": "1M"}
}'
# Data Science: $3K
curl -X POST http://localhost:8080/api/governance/teams \
-H "Content-Type: application/json" \
-d '{
"name": "Data Science",
"customer_id": "customer-acme",
"budget": {"max_limit": 3000.00, "reset_duration": "1M"}
}'
User Budgets (per virtual key):
curl -X PUT http://localhost:8080/api/governance/virtual-keys/vk-dev-alice \
-H "Content-Type: application/json" \
-d '{
"team_id": "team-engineering",
"budget": {"max_limit": 500.00, "reset_duration": "1M"}
}'
This is how you get granular governance at every layer: customer → team → user.
Step 3: Configure Prometheus Alerts
Point Prometheus at Bifrost’s metrics endpoint:
prometheus.yml:
scrape_configs:
- job_name: 'bifrost'
static_configs:
- targets: ['localhost:8080']
Then define budget and cost alerts using Bifrost’s cost and budget metrics.
alerts.yml:
groups:
- name: llm_costs
rules:
# User budget warning
- alert: UserBudgetWarning
expr: (budget_usage{type="virtual_key"} / budget_limit{type="virtual_key"}) > 0.8
labels:
severity: warning
annotations:
summary: "User {{ $labels.vk }} at 80% budget"
# User budget critical
- alert: UserBudgetCritical
expr: (budget_usage{type="virtual_key"} / budget_limit{type="virtual_key"}) > 0.9
labels:
severity: critical
annotations:
summary: "User {{ $labels.vk }} at 90% budget"
# Team budget critical
- alert: TeamBudgetCritical
expr: (team_budget_usage / team_budget_limit) > 0.9
labels:
severity: critical
annotations:
summary: "Team {{ $labels.team }} at 90% budget"
# Customer budget critical
- alert: CustomerBudgetCritical
expr: (customer_budget_usage / customer_budget_limit) > 0.9
labels:
severity: critical
annotations:
summary: "Customer {{ $labels.customer }} at 90% budget"
# Expensive query alert
- alert: ExpensiveQuery
expr: bifrost_request_cost_dollars > 10
labels:
severity: warning
annotations:
summary: "Request cost ${{ $value }} from {{ $labels.vk }}"
Step 4: Configure Alertmanager
Wire Prometheus alerts into Slack (or any other incident channel).
alertmanager.yml:
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'slack'
receivers:
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
channel: '#llm-alerts'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
Now, as soon as a user, team, or customer crosses the threshold, you get a ping.
Monitoring Dashboards
Built-in Bifrost Dashboard
Access: http://localhost:8080
Bifrost’s gateway UI (see Bifrost AI Gateway) gives you real-time visibility:
Real-Time Visibility:
-
Request Logs: Every LLM request with:
- Timestamp
- Virtual key (user / team)
- Model used
- Tokens (input + output)
- Cost calculated
- Latency
-
Cost Tracking: Real-time aggregation by:
- Virtual key (user)
- Team
- Customer
- Model
- Provider
-
Budget Utilization: Visual progress bars showing:
- Current usage vs limit
- Remaining budget
- Reset date
-
Token Usage: Graphs showing:
- Input vs output tokens
- Token trends over time
- Per-model token distribution
This all rides on Bifrost’s built-in Telemetry and Semantic Caching support so you can keep both performance and cost under control.
Grafana Dashboard
Layer Grafana on top of Prometheus to build richer cost dashboards.
PromQL Queries:
Total Cost by Team:
sum(bifrost_cost_total) by (team)
Budget Utilization by User:
(budget_usage{type="virtual_key"} / budget_limit{type="virtual_key"}) * 100
Cost per Model:
sum(bifrost_cost_total) by (model)
Most Expensive Users:
topk(10, sum(bifrost_cost_total) by (vk))
Daily Cost Trend:
sum(increase(bifrost_cost_total[24h]))
Token Usage by Type:
sum(bifrost_tokens_total) by (token_type)
Cost Attribution
Bifrost’s Virtual Keys and governance layer give you clean attribution across users, teams, and customers.
Per-User Attribution
Query:
sum(bifrost_cost_total{vk="vk-dev-alice"})
Result: Total spend for user Alice.
Per-Team Attribution
Query:
sum(bifrost_cost_total{team="engineering"})
Per-Model Attribution
Query:
sum(bifrost_cost_total) by (model)
Example Output:
-
gpt-4o-mini: $1,200 -
gpt-4o: $3,500 -
claude-3-5-haiku: $800
Per-Provider Attribution
Query:
sum(bifrost_cost_total) by (provider)
This works across all configured providers via Bifrost’s Supported Providers layer (OpenAI, Anthropic, Bedrock, Vertex, etc.) without changing your app code.
Real-Time Cost Analysis
Identifying Expensive Queries
Use the Bifrost dashboard to sort requests by cost, then zoom in with PromQL.
Most Expensive Requests:
topk(10, bifrost_request_cost_dollars)
Alert on Expensive Requests:
- alert: ExpensiveQuery
expr: bifrost_request_cost_dollars > 10
Detecting Cost Anomalies
Sudden Cost Spike:
increase(bifrost_cost_total[5m]) > 100
Unusual Token Usage:
rate(bifrost_tokens_total[5m]) > 100000
These signals are especially useful when you’re running agents or tools through Bifrost’s MCP Gateway and want to catch runaway behavior quickly.
Cost Optimization Insights
Model Efficiency Analysis
Cost per Request by Model:
avg(bifrost_request_cost_dollars) by (model)
Token Efficiency:
sum(bifrost_cost_total) / sum(bifrost_tokens_total)
This helps you decide when to move workloads from expensive models (e.g., GPT-4 class) to cheaper ones while maintaining quality.
Provider Cost Comparison
Compare providers on blended cost per request:
sum(bifrost_cost_total) by (provider) / sum(bifrost_requests_total) by (provider)
Because Bifrost handles Routing and Load Balancing across multiple Supported Providers, you can experiment with cheaper backends without rewriting your app.
Alert Examples
Budget Alerts
User at 80% Budget (Slack):
⚠️ Warning: Alice (vk-dev-alice) at 80% budget
Current: $400 / $500
Time remaining: 15 days
Team Budget Critical (PagerDuty):
🚨 Critical: Engineering team at 90% budget
Current: $4,500 / $5,000
Action required: Review usage or increase budget
Cost Spike Alerts
Unusual Spending (email):
📊 Cost spike detected
Last 5 min: $150 (avg: $10)
Team: Data Science
Top user: Bob (vk-dev-bob) - $120
Action: Investigate recent queries
Complete Monitoring Stack
Putting it all together:
# 1. Start Bifrost
npx -y @maximhq/bifrost
# 2. Start Prometheus
prometheus --config.file=prometheus.yml
# 3. Start Alertmanager
alertmanager --config.file=alertmanager.yml
# 4. Start Grafana
grafana-server
Access:
- Bifrost Dashboard:
http://localhost:8080 - Prometheus:
http://localhost:9090 - Grafana:
http://localhost:3000 - Alertmanager:
http://localhost:9093
Get Started
To try this locally:
npx -y @maximhq/bifrost
Then follow the governance and metrics guides in the Bifrost AI Gateway documentation and observability stack under Telemetry and Prometheus Metrics.
Key Takeaway: Real-time LLM cost monitoring requires built-in dashboards (request logs, cost tracking), Prometheus metrics (cost / token / budget aggregation), automated alerts (80% / 90% thresholds), and granular attribution (per-user / team / customer). Bifrost provides native cost calculation, hierarchical budget tracking, Prometheus integration, and real-time observability—enabling instant cost visibility and proactive budget management across all your LLM providers.

Top comments (0)