DEV Community

Cover image for Building Enterprise-Level Monitoring: From Prometheus to Grafana Dashboards
Aleksandr Murzin
Aleksandr Murzin

Posted on

Building Enterprise-Level Monitoring: From Prometheus to Grafana Dashboards

Cover Image: Main Grafana dashboard with HTTP and performance metrics

Introduction

Once your web application hits production, the most critical question becomes: how is it performing right now? Logs tell you what happened, but you want to spot problems before users start complaining.

In this article, I'll share how I built a complete monitoring system for Peakline — a FastAPI application for Strava data analysis that processes thousands of requests daily from athletes worldwide.

What's Inside:

  • Metrics architecture (HTTP, API, business metrics)
  • Prometheus + Grafana setup from scratch
  • 50+ production-ready metrics
  • Advanced PromQL queries
  • Reactive dashboards
  • Best practices and pitfalls

Architecture: Three Monitoring Levels

Modern monitoring isn't just "set up Grafana and look at graphs." It's a well-thought-out architecture with several layers:

┌─────────────────────────────────────────────────┐
│   FastAPI Application                           │
│   ├── HTTP Middleware (auto-collect metrics)    │
│   ├── Business Logic (business metrics)         │
│   └── /metrics endpoint (Prometheus format)     │
└──────────────────┬──────────────────────────────┘
                   │ scrape every 5s
┌──────────────────▼──────────────────────────────┐
│   Prometheus                                    │
│   ├── Time Series Database (TSDB)               │
│   ├── Storage retention: 200h                   │
│   └── PromQL Engine                             │
└──────────────────┬──────────────────────────────┘
                   │ query data
┌──────────────────▼──────────────────────────────┐
│   Grafana                                       │
│   ├── Dashboards                                │
│   ├── Alerting                                  │
│   └── Visualization                             │
└─────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Why This Stack?

Prometheus — the de-facto standard for metrics. Pull model, powerful PromQL query language, excellent Kubernetes integration.

Grafana — the best visualization tool. Beautiful dashboards, alerting, templating, rich UI.

FastAPI — async Python framework with native metrics support via prometheus_client.

Basic Infrastructure Setup

Docker Compose: 5-Minute Quick Start

First, let's spin up Prometheus and Grafana in Docker:

# docker-compose.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=200h'  # 8+ days of history
      - '--web.enable-lifecycle'
    networks:
      - monitoring
    extra_hosts:
      - "host.docker.internal:host-gateway"  # Access host machine

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}  # Use .env!
      - GF_SERVER_ROOT_URL=/grafana  # For nginx reverse proxy
    volumes:
      - grafana_data:/var/lib/grafana
      - ./monitoring/grafana/provisioning:/etc/grafana/provisioning
    depends_on:
      - prometheus
    networks:
      - monitoring

volumes:
  prometheus_data:
  grafana_data:

networks:
  monitoring:
    driver: bridge
Enter fullscreen mode Exit fullscreen mode

Key Points:

  • storage.tsdb.retention.time=200h — keep metrics for 8+ days (for weekly analysis)
  • extra_hosts: host.docker.internal — allows Prometheus to reach the app on the host
  • Volumes for data persistence

Prometheus Configuration

# monitoring/prometheus.yml
global:
  scrape_interval: 15s      # How often to collect metrics
  evaluation_interval: 15s  # How often to check alerts

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'webapp'
    static_configs:
      - targets: ['host.docker.internal:8000']  # Your app port
    scrape_interval: 5s      # More frequent for web apps
    metrics_path: /metrics
Enter fullscreen mode Exit fullscreen mode

Important: scrape_interval: 5s for web apps is a balance between data freshness and system load. In production, typically 15-30s.

Grafana Datasource Provisioning

To avoid manual Prometheus setup in Grafana, use provisioning:

# monitoring/grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
Enter fullscreen mode Exit fullscreen mode

Now Grafana automatically connects to Prometheus on startup.

docker-compose up -d
Enter fullscreen mode Exit fullscreen mode

Level 1: HTTP Metrics

The most basic but critically important layer — monitoring HTTP requests. Middleware automatically collects metrics for all HTTP requests.

Metrics Initialization

# webapp/main.py
from prometheus_client import Counter, Histogram, CollectorRegistry, generate_latest, CONTENT_TYPE_LATEST
from fastapi import FastAPI, Request
from fastapi.responses import PlainTextResponse
import time

app = FastAPI(title="Peakline", version="2.0.0")

# Create separate registry for metrics isolation
registry = CollectorRegistry()

# Counter: monotonically increasing value (request count)
http_requests_total = Counter(
    'http_requests_total',
    'Total number of HTTP requests',
    ['method', 'endpoint', 'status_code'],  # Labels for grouping
    registry=registry
)

# Histogram: distribution of values (execution time)
http_request_duration_seconds = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    ['method', 'endpoint'],
    registry=registry
)

# API call counters
api_calls_total = Counter(
    'api_calls_total',
    'Total number of API calls by type',
    ['api_type'],
    registry=registry
)

# Separate error counters
http_errors_4xx_total = Counter(
    'http_errors_4xx_total',
    'Total number of 4xx HTTP errors',
    ['endpoint', 'status_code'],
    registry=registry
)

http_errors_5xx_total = Counter(
    'http_errors_5xx_total',
    'Total number of 5xx HTTP errors',
    ['endpoint', 'status_code'],
    registry=registry
)
Enter fullscreen mode Exit fullscreen mode

Middleware for Automatic Collection

The magic happens in middleware — it wraps every request:

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start_time = time.time()

    # Execute request
    response = await call_next(request)

    duration = time.time() - start_time

    # Path normalization: /api/activities/12345 → /api/activities/{id}
    path = request.url.path
    if path.startswith('/api/'):
        parts = path.split('/')
        if len(parts) > 3 and parts[3].isdigit():
            parts[3] = '{id}'
            path = '/'.join(parts)

    # Record metrics
    http_requests_total.labels(
        method=request.method,
        endpoint=path,
        status_code=str(response.status_code)
    ).inc()

    http_request_duration_seconds.labels(
        method=request.method,
        endpoint=path
    ).observe(duration)

    # Track API calls
    if path.startswith('/api/'):
        api_type = path.split('/')[2] if len(path.split('/')) > 2 else 'unknown'
        api_calls_total.labels(api_type=api_type).inc()

    # Track errors separately
    status_code = response.status_code
    if 400 <= status_code < 500:
        http_errors_4xx_total.labels(endpoint=path, status_code=str(status_code)).inc()
    elif status_code >= 500:
        http_errors_5xx_total.labels(endpoint=path, status_code=str(status_code)).inc()

    return response
Enter fullscreen mode Exit fullscreen mode

Key Techniques:

  1. Path normalization — critically important! Without this, you'll get thousands of unique metrics for /api/activities/1, /api/activities/2, etc.

  2. Labels — allow filtering and grouping metrics in PromQL

  3. Separate error counters — simplifies alert writing

Metrics Endpoint

@app.get("/metrics")
async def metrics():
    """Prometheus metrics endpoint"""
    return PlainTextResponse(
        generate_latest(registry),
        media_type=CONTENT_TYPE_LATEST
    )
Enter fullscreen mode Exit fullscreen mode

Now Prometheus can collect metrics from http://localhost:8000/metrics.

What We Get in Prometheus

# Metrics format in /metrics endpoint:
http_requests_total{method="GET",endpoint="/api/activities",status_code="200"} 1543
http_requests_total{method="POST",endpoint="/api/activities",status_code="201"} 89
http_request_duration_seconds_bucket{method="GET",endpoint="/api/activities",le="0.1"} 1234
Enter fullscreen mode Exit fullscreen mode

Level 2: External API Metrics

Web applications often integrate with external APIs (Stripe, AWS, etc.). It's important to track not only your own requests but also dependencies.

External API Metrics

# External API metrics
external_api_calls_total = Counter(
    'external_api_calls_total',
    'Total number of external API calls by endpoint type',
    ['endpoint_type'],
    registry=registry
)

external_api_errors_total = Counter(
    'external_api_errors_total',
    'Total number of external API errors by endpoint type',
    ['endpoint_type'],
    registry=registry
)

external_api_latency_seconds = Histogram(
    'external_api_latency_seconds',
    'External API call latency in seconds',
    ['endpoint_type'],
    registry=registry
)
Enter fullscreen mode Exit fullscreen mode

API Call Tracking Helper

Instead of duplicating code everywhere you call the API, create a universal wrapper:

async def track_external_api_call(endpoint_type: str, api_call_func, *args, **kwargs):
    """
    Universal wrapper for tracking API calls

    Usage:
        result = await track_external_api_call(
            'athlete_activities',
            client.get_athlete_activities,
            athlete_id=123
        )
    """
    start_time = time.time()

    try:
        # Increment call counter
        external_api_calls_total.labels(endpoint_type=endpoint_type).inc()

        # Execute API call
        result = await api_call_func(*args, **kwargs)

        # Record latency
        duration = time.time() - start_time
        external_api_latency_seconds.labels(endpoint_type=endpoint_type).observe(duration)

        # Check for API errors (status >= 400)
        if isinstance(result, Exception) or (hasattr(result, 'status') and result.status >= 400):
            external_api_errors_total.labels(endpoint_type=endpoint_type).inc()

        return result

    except Exception as e:
        # Record latency and error
        duration = time.time() - start_time
        external_api_latency_seconds.labels(endpoint_type=endpoint_type).observe(duration)
        external_api_errors_total.labels(endpoint_type=endpoint_type).inc()
        raise e
Enter fullscreen mode Exit fullscreen mode

Usage in Code

@app.get("/api/activities")
async def get_activities(athlete_id: int):
    # Instead of direct API call:
    # activities = await external_client.get_athlete_activities(athlete_id)

    # Use wrapper with tracking:
    activities = await track_external_api_call(
        'athlete_activities',
        external_client.get_athlete_activities,
        athlete_id=athlete_id
    )

    return activities
Enter fullscreen mode Exit fullscreen mode

Now we can see:

  • How many calls to each external API endpoint
  • How many returned errors
  • Latency for each call type

Level 3: Business Metrics

This is the most valuable part of monitoring — metrics that reflect actual application usage.

Business Metrics Types

# === Authentication ===
user_logins_total = Counter(
    'user_logins_total',
    'Total number of user logins',
    registry=registry
)

user_registrations_total = Counter(
    'user_registrations_total',
    'Total number of new user registrations',
    registry=registry
)

user_deletions_total = Counter(
    'user_deletions_total',
    'Total number of user deletions',
    registry=registry
)

# === File Operations ===
fit_downloads_total = Counter(
    'fit_downloads_total',
    'Total number of FIT file downloads',
    registry=registry
)

gpx_downloads_total = Counter(
    'gpx_downloads_total',
    'Total number of GPX file downloads',
    registry=registry
)

gpx_uploads_total = Counter(
    'gpx_uploads_total',
    'Total number of GPX file uploads',
    registry=registry
)

# === User Actions ===
settings_updates_total = Counter(
    'settings_updates_total',
    'Total number of user settings updates',
    registry=registry
)

feature_requests_total = Counter(
    'feature_requests_total',
    'Total number of feature requests',
    registry=registry
)

feature_votes_total = Counter(
    'feature_votes_total',
    'Total number of votes for features',
    registry=registry
)

# === Reports ===
manual_reports_total = Counter(
    'manual_reports_total',
    'Total number of manually created reports',
    registry=registry
)

auto_reports_total = Counter(
    'auto_reports_total',
    'Total number of automatically created reports',
    registry=registry
)

failed_reports_total = Counter(
    'failed_reports_total',
    'Total number of failed report creation attempts',
    registry=registry
)
Enter fullscreen mode Exit fullscreen mode

Incrementing in Code

@app.post("/api/auth/login")
async def login(credentials: LoginCredentials):
    user = await authenticate_user(credentials)

    if user:
        # Increment successful login counter
        user_logins_total.inc()
        return {"token": generate_token(user)}

    return {"error": "Invalid credentials"}

@app.post("/api/activities/report")
async def create_report(activity_id: int, is_auto: bool = False):
    try:
        report = await generate_activity_report(activity_id)

        # Different counters for manual and automatic reports
        if is_auto:
            auto_reports_total.inc()
        else:
            manual_reports_total.inc()

        return report
    except Exception as e:
        failed_reports_total.inc()
        raise e
Enter fullscreen mode Exit fullscreen mode

Level 4: Performance and Caching

Cache Metrics

Cache is a critical part of performance. Need to track hit rate:

cache_hits_total = Counter(
    'cache_hits_total',
    'Total number of cache hits',
    ['cache_type'],
    registry=registry
)

cache_misses_total = Counter(
    'cache_misses_total',
    'Total number of cache misses',
    ['cache_type'],
    registry=registry
)

# In caching code:
async def get_from_cache(key: str, cache_type: str = 'generic'):
    value = await cache.get(key)

    if value is not None:
        cache_hits_total.labels(cache_type=cache_type).inc()
        return value
    else:
        cache_misses_total.labels(cache_type=cache_type).inc()
        return None
Enter fullscreen mode Exit fullscreen mode

Background Task Metrics

If you have background tasks (Celery, APScheduler), track them:

background_task_duration_seconds = Histogram(
    'background_task_duration_seconds',
    'Background task execution time',
    ['task_type'],
    registry=registry
)

async def run_background_task(task_type: str, task_func, *args, **kwargs):
    start_time = time.time()

    try:
        result = await task_func(*args, **kwargs)
        return result
    finally:
        duration = time.time() - start_time
        background_task_duration_seconds.labels(task_type=task_type).observe(duration)
Enter fullscreen mode Exit fullscreen mode

PromQL: Metrics Query Language

Prometheus uses its own query language — PromQL. Not SQL, but very powerful.

Basic Queries

# 1. Just get metric (instant vector)
http_requests_total

# 2. Filter by labels
http_requests_total{method="GET"}
http_requests_total{status_code="200"}
http_requests_total{method="GET", endpoint="/api/activities"}

# 3. Regular expressions in labels
http_requests_total{status_code=~"5.."}  # All 5xx errors
http_requests_total{endpoint=~"/api/.*"}  # All API endpoints

# 4. Time interval (range vector)
http_requests_total[5m]  # Data for last 5 minutes
Enter fullscreen mode Exit fullscreen mode

Rate and irate: Rate of Change

Counter constantly grows, but we need rate of change — RPS (requests per second):

# Rate - average rate over interval
rate(http_requests_total[5m])

# irate - instantaneous rate (between last two points)
irate(http_requests_total[5m])
Enter fullscreen mode Exit fullscreen mode

When to use what:

  • rate() — for alerts and trend graphs (smooths spikes)
  • irate() — for detailed analysis (shows peaks)

Aggregation with sum, avg, max

# Total app RPS
sum(rate(http_requests_total[5m]))

# RPS by method
sum(rate(http_requests_total[5m])) by (method)

# RPS by endpoint, sorted
sort_desc(sum(rate(http_requests_total[5m])) by (endpoint))

# Average latency
avg(rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]))
Enter fullscreen mode Exit fullscreen mode

Histogram and Percentiles

For Histogram metrics (latency, duration) use histogram_quantile:

# P50 (median) latency
histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m]))

# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# P99 latency (99% of requests faster than this)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# P95 per endpoint
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) by (endpoint)
Enter fullscreen mode Exit fullscreen mode

Complex Queries

1. Success Rate (percentage of successful requests)

(
  sum(rate(http_requests_total{status_code=~"2.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) * 100
Enter fullscreen mode Exit fullscreen mode

2. Error Rate (percentage of errors)

(
  sum(rate(http_requests_total{status_code=~"4..|5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) * 100
Enter fullscreen mode Exit fullscreen mode

3. Cache Hit Rate

(
  sum(rate(cache_hits_total[5m]))
  /
  (sum(rate(cache_hits_total[5m])) + sum(rate(cache_misses_total[5m])))
) * 100
Enter fullscreen mode Exit fullscreen mode

4. Top-5 Slowest Endpoints

topk(5,
  histogram_quantile(0.95, 
    rate(http_request_duration_seconds_bucket[5m])
  ) by (endpoint)
)
Enter fullscreen mode Exit fullscreen mode

5. API Health Score (0-100)

(
  (
    sum(rate(external_api_calls_total[5m])) 
    - 
    sum(rate(external_api_errors_total[5m]))
  ) 
  / 
  sum(rate(external_api_calls_total[5m]))
) * 100
Enter fullscreen mode Exit fullscreen mode

Grafana Dashboards: Visualization

Now the fun part — turning raw metrics into beautiful and informative dashboards.

HTTP and Performance

Dashboard 1: HTTP & Performance

Panel 1: Request Rate

sum(rate(http_requests_total[5m]))
Enter fullscreen mode Exit fullscreen mode
  • Type: Time series
  • Color: Blue gradient
  • Unit: requests/sec
  • Legend: Total RPS

Panel 2: Success Rate

(
  sum(rate(http_requests_total{status_code=~"2.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) * 100
Enter fullscreen mode Exit fullscreen mode
  • Type: Stat
  • Color: Green if > 95%, yellow if > 90%, red if < 90%
  • Unit: percent (0-100)
  • Value: Current (last)

Panel 3: Response Time (P50, P95, P99)

# P50
histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m]))

# P95
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# P99
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
Enter fullscreen mode Exit fullscreen mode
  • Type: Time series
  • Unit: seconds (s)
  • Legend: P50, P95, P99

Panel 4: Errors by Type

sum(rate(http_requests_total{status_code=~"4.."}[5m])) by (status_code)
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (status_code)
Enter fullscreen mode Exit fullscreen mode
  • Type: Bar chart
  • Colors: Yellow (4xx), Red (5xx)

Panel 5: Request Rate by Endpoint

sort_desc(sum(rate(http_requests_total[5m])) by (endpoint))
Enter fullscreen mode Exit fullscreen mode
  • Type: Bar chart
  • Limit: Top 10

Dashboard 2: Business Metrics

This dashboard shows real product usage — what users do and how often.

Authentication and Users

Panel 1: User Activity (24h)

# Logins
increase(user_logins_total[24h])

# Registrations
increase(user_registrations_total[24h])

# Deletions
increase(user_deletions_total[24h])
Enter fullscreen mode Exit fullscreen mode
  • Type: Stat
  • Layout: Horizontal

Panel 2: Downloads by Type

sum(rate({__name__=~".*_downloads_total"}[5m])) by (__name__)
Enter fullscreen mode Exit fullscreen mode
  • Type: Pie chart
  • Legend: Right side

Panel 3: Feature Usage Timeline

rate(gpx_fixer_usage_total[5m])
rate(search_usage_total[5m])
rate(manual_reports_total[5m])
Enter fullscreen mode Exit fullscreen mode
  • Type: Time series
  • Stacking: Normal

Dashboard 3: External API

Critical to monitor dependencies on external services — they can become bottlenecks.

External API

Panel 1: API Health Score

(
  sum(rate(external_api_calls_total[5m])) - sum(rate(external_api_errors_total[5m]))
) / sum(rate(external_api_calls_total[5m])) * 100
Enter fullscreen mode Exit fullscreen mode
  • Type: Gauge
  • Min: 0, Max: 100
  • Thresholds: 95 (green), 90 (yellow), 0 (red)

Panel 2: API Latency by Endpoint

histogram_quantile(0.95, rate(external_api_latency_seconds_bucket[5m])) by (endpoint_type)
Enter fullscreen mode Exit fullscreen mode
  • Type: Bar chart
  • Sort: Descending

Panel 3: Error Rate by Endpoint

sum(rate(external_api_errors_total[5m])) by (endpoint_type)
Enter fullscreen mode Exit fullscreen mode
  • Type: Bar chart
  • Color: Red

Variables: Dynamic Dashboards

Grafana supports variables for interactive dashboards:

Creating a Variable

  1. Dashboard Settings → Variables → Add variable
  2. Name: endpoint
  3. Type: Query
  4. Query:
label_values(http_requests_total, endpoint)
Enter fullscreen mode Exit fullscreen mode

Using in Panels

# Filter by selected endpoint
sum(rate(http_requests_total{endpoint="$endpoint"}[5m]))

# Multi-select
sum(rate(http_requests_total{endpoint=~"$endpoint"}[5m])) by (endpoint)
Enter fullscreen mode Exit fullscreen mode

Useful Variables

# Time interval
Variable: interval
Type: Interval
Values: 1m,5m,10m,30m,1h

# HTTP method
Variable: method
Query: label_values(http_requests_total, method)

# Status code
Variable: status_code
Query: label_values(http_requests_total, status_code)
Enter fullscreen mode Exit fullscreen mode

Alerting: System Reactivity

Monitoring without alerts is like a car without brakes. Let's set up smart alerts.

Grafana Alerting

Alert 1: High Error Rate

(
  sum(rate(http_requests_total{status_code=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) * 100 > 1
Enter fullscreen mode Exit fullscreen mode
  • Condition: > 1 (more than 1% errors)
  • For: 5m (for 5 minutes)
  • Severity: Critical
  • Notification: Slack, Email, Telegram

Alert 2: High Latency

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
Enter fullscreen mode Exit fullscreen mode
  • Condition: P95 > 2 seconds
  • For: 10m
  • Severity: Warning

Alert 3: External API Down

sum(rate(external_api_errors_total[5m])) / sum(rate(external_api_calls_total[5m])) > 0.5
Enter fullscreen mode Exit fullscreen mode
  • Condition: More than 50% API errors
  • For: 2m
  • Severity: Critical

Alert 4: No Data

absent_over_time(http_requests_total[10m])
Enter fullscreen mode Exit fullscreen mode
  • Condition: No metrics for 10 minutes
  • Severity: Critical
  • Means: app crashed or Prometheus can't collect metrics

Best Practices: Battle-Tested Experience

1. Labels: Don't Overdo It

Bad:

# Too detailed labels = cardinality explosion
http_requests_total.labels(
    method=request.method,
    endpoint=request.url.path,  # Every unique URL!
    user_id=str(user.id),       # Thousands of users!
    timestamp=str(time.time())  # Infinite values!
).inc()
Enter fullscreen mode Exit fullscreen mode

Good:

# Normalized endpoints + limited label set
http_requests_total.labels(
    method=request.method,
    endpoint=normalize_path(request.url.path),  # /api/users/{id}
    status_code=str(response.status_code)
).inc()
Enter fullscreen mode Exit fullscreen mode

Rule: High-cardinality data (user_id, timestamps, unique IDs) should NOT be labels.

2. Naming Convention

Follow Prometheus naming conventions:

# Good names:
http_requests_total          # <namespace>_<name>_<unit>
external_api_latency_seconds # Unit in name
cache_hits_total             # Clear it's a Counter

# Bad names:
RequestCount                 # Not CamelCase
api-latency                  # Don't use dashes
request_time                 # Unit not specified
Enter fullscreen mode Exit fullscreen mode

3. Rate() Interval

Rate() interval should be minimum 4x larger than scrape_interval:

# If scrape_interval = 15s
rate(http_requests_total[1m])   # 4x = 60s ✅
rate(http_requests_total[30s])  # 2x = poor accuracy ❌
Enter fullscreen mode Exit fullscreen mode

4. Histogram Buckets

Proper buckets are critical for accurate percentiles:

# Default (bad for latency):
Histogram('latency_seconds', 'Latency')  # [.005, .01, .025, .05, .1, ...]

# Custom buckets for web latency:
Histogram(
    'http_request_duration_seconds',
    'Request latency',
    buckets=[.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
Enter fullscreen mode Exit fullscreen mode

Principle: Buckets should cover the typical range of values.

5. Metrics Cost

Every metric costs memory. Let's calculate:

Memory = Series count × (~3KB per series)

Series = Metric × Label combinations
Enter fullscreen mode Exit fullscreen mode

Example:

# 1 metric × 5 methods × 20 endpoints × 15 status codes = 1,500 series
http_requests_total{method, endpoint, status_code}

# 1,500 × 3KB = ~4.5MB for one metric!
Enter fullscreen mode Exit fullscreen mode

Tip: Regularly check cardinality:

# Top metrics by cardinality
topk(10, count by (__name__)({__name__=~".+"}))
Enter fullscreen mode Exit fullscreen mode

Production Checklist

Before launching in production, check:

  • [ ] Retention policy configured (storage.tsdb.retention.time)
  • [ ] Disk space monitored (Prometheus can take a lot of space)
  • [ ] Backups configured for Grafana dashboards
  • [ ] Alerts tested (create artificial error)
  • [ ] Notification channels work (send test alert)
  • [ ] Access control configured (don't leave Grafana with admin/admin!)
  • [ ] HTTPS configured for Grafana (via nginx reverse proxy)
  • [ ] Cardinality checked (topk(10, count by (__name__)({__name__=~".+"})))
  • [ ] Documentation created (what metric is responsible for what)
  • [ ] On-call process defined (who gets alerts and what to do)

Real Case: Finding a Problem

Imagine: users complain about slow performance. Here's how monitoring helped find and fix the problem in minutes.

Step 1: Open Grafana → HTTP Performance Dashboard

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Enter fullscreen mode Exit fullscreen mode

We see: P95 latency jumped from 0.2s to 3s.

Step 2: Check latency by endpoint

topk(5, histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) by (endpoint))
Enter fullscreen mode Exit fullscreen mode

Found: /api/activities — 5 seconds!

Step 3: Check external APIs

histogram_quantile(0.95, rate(external_api_latency_seconds_bucket[5m])) by (endpoint_type)
Enter fullscreen mode Exit fullscreen mode

External API athlete_activities — 4.8 seconds. There's the problem!

Step 4: Check error rate

rate(external_api_errors_total{endpoint_type="athlete_activities"}[5m])
Enter fullscreen mode Exit fullscreen mode

No errors, just slow. So the problem isn't on our side — external service is lagging.

Solution:

  • Add aggressive caching for external API (TTL 5 minutes)
  • Set up alert for latency > 2s
  • Add timeout to requests

Step 5: After deploy, verify

# Cache hit rate
(cache_hits_total / (cache_hits_total + cache_misses_total)) * 100
Enter fullscreen mode Exit fullscreen mode

Hit rate 85% → latency dropped to 0.3s. Victory! 🎉

What's Next?

You've built a production-ready monitoring system. But this is just the beginning:

Next Steps:

  1. Distributed Tracing — add Jaeger/Tempo for request tracing
  2. Logging — integrate Loki for centralized logs
  3. Custom Dashboards — create dashboards for business (not just DevOps)
  4. SLO/SLI — define Service Level Objectives
  5. Anomaly Detection — use machine learning for anomaly detection
  6. Cost Monitoring — add cost metrics (AWS CloudWatch, etc.)

Useful Resources:

Conclusion

A monitoring system isn't "set it and forget it." It's a living organism that needs to evolve with your application. But the basic architecture we've built scales from startup to enterprise.

Key Takeaways:

  1. Three metric levels: HTTP (infrastructure) → API (dependencies) → Business (product)
  2. Middleware automates basic metrics collection
  3. PromQL is powerful — learn gradually
  4. Labels matter — but don't overdo cardinality
  5. Alerts are critical — monitoring without alerts is useless
  6. Document — in six months you'll forget what foo_bar_total means

Monitoring is a culture, not a tool. Start simple, iterate, improve. And your application will run stably, while you sleep peacefully 😴


About Peakline

This monitoring system was built for Peakline — a web application for Strava activity analysis. Peakline provides athletes with:

  • Detailed segment analysis with interactive maps
  • Historical weather data for every activity
  • Advanced FIT file generation for virtual races
  • Automatic GPX track error correction
  • Route planner

All these features require reliable monitoring to ensure quality user experience.


Questions? Leave them in the comments!

P.S. If you found this helpful — share with colleagues who might benefit!


About the Author

Solo developer building Peakline — tools for athletes. Athlete and enthusiast myself, believe in automation, observability, and quality code. Continuing to develop the project and share experience with the community in 2025.


Connect

  • 🌐 Peakline Website
  • 💬 Share your monitoring setup in comments
  • 📧 Questions? Drop a comment below!

Tags: #prometheus #grafana #monitoring #python #fastapi #devops #observability #sre #metrics #production

Top comments (0)