Aleksandr Murzin

Posted on Oct 23

Building Enterprise-Level Monitoring: From Prometheus to Grafana Dashboards

#prometheus #grafana #monitoring #python

Introduction

Once your web application hits production, the most critical question becomes: how is it performing right now? Logs tell you what happened, but you want to spot problems before users start complaining.

In this article, I'll share how I built a complete monitoring system for Peakline — a FastAPI application for Strava data analysis that processes thousands of requests daily from athletes worldwide.

What's Inside:

Metrics architecture (HTTP, API, business metrics)
Prometheus + Grafana setup from scratch
50+ production-ready metrics
Advanced PromQL queries
Reactive dashboards
Best practices and pitfalls

Architecture: Three Monitoring Levels

Modern monitoring isn't just "set up Grafana and look at graphs." It's a well-thought-out architecture with several layers:

┌─────────────────────────────────────────────────┐
│   FastAPI Application                           │
│   ├── HTTP Middleware (auto-collect metrics)    │
│   ├── Business Logic (business metrics)         │
│   └── /metrics endpoint (Prometheus format)     │
└──────────────────┬──────────────────────────────┘
                   │ scrape every 5s
┌──────────────────▼──────────────────────────────┐
│   Prometheus                                    │
│   ├── Time Series Database (TSDB)               │
│   ├── Storage retention: 200h                   │
│   └── PromQL Engine                             │
└──────────────────┬──────────────────────────────┘
                   │ query data
┌──────────────────▼──────────────────────────────┐
│   Grafana                                       │
│   ├── Dashboards                                │
│   ├── Alerting                                  │
│   └── Visualization                             │
└─────────────────────────────────────────────────┘

Why This Stack?

Prometheus — the de-facto standard for metrics. Pull model, powerful PromQL query language, excellent Kubernetes integration.

Grafana — the best visualization tool. Beautiful dashboards, alerting, templating, rich UI.

FastAPI — async Python framework with native metrics support via prometheus_client.

Basic Infrastructure Setup

Docker Compose: 5-Minute Quick Start

First, let's spin up Prometheus and Grafana in Docker:

# docker-compose.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=200h'  # 8+ days of history
      - '--web.enable-lifecycle'
    networks:
      - monitoring
    extra_hosts:
      - "host.docker.internal:host-gateway"  # Access host machine

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}  # Use .env!
      - GF_SERVER_ROOT_URL=/grafana  # For nginx reverse proxy
    volumes:
      - grafana_data:/var/lib/grafana
      - ./monitoring/grafana/provisioning:/etc/grafana/provisioning
    depends_on:
      - prometheus
    networks:
      - monitoring

volumes:
  prometheus_data:
  grafana_data:

networks:
  monitoring:
    driver: bridge

Key Points:

storage.tsdb.retention.time=200h — keep metrics for 8+ days (for weekly analysis)
extra_hosts: host.docker.internal — allows Prometheus to reach the app on the host
Volumes for data persistence

Prometheus Configuration

# monitoring/prometheus.yml
global:
  scrape_interval: 15s      # How often to collect metrics
  evaluation_interval: 15s  # How often to check alerts

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'webapp'
    static_configs:
      - targets: ['host.docker.internal:8000']  # Your app port
    scrape_interval: 5s      # More frequent for web apps
    metrics_path: /metrics

Important: scrape_interval: 5s for web apps is a balance between data freshness and system load. In production, typically 15-30s.

Grafana Datasource Provisioning

To avoid manual Prometheus setup in Grafana, use provisioning:

# monitoring/grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true

Now Grafana automatically connects to Prometheus on startup.

docker-compose up -d

Level 1: HTTP Metrics

The most basic but critically important layer — monitoring HTTP requests. Middleware automatically collects metrics for all HTTP requests.

Metrics Initialization

# webapp/main.py
from prometheus_client import Counter, Histogram, CollectorRegistry, generate_latest, CONTENT_TYPE_LATEST
from fastapi import FastAPI, Request
from fastapi.responses import PlainTextResponse
import time

app = FastAPI(title="Peakline", version="2.0.0")

# Create separate registry for metrics isolation
registry = CollectorRegistry()

# Counter: monotonically increasing value (request count)
http_requests_total = Counter(
    'http_requests_total',
    'Total number of HTTP requests',
    ['method', 'endpoint', 'status_code'],  # Labels for grouping
    registry=registry
)

# Histogram: distribution of values (execution time)
http_request_duration_seconds = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    ['method', 'endpoint'],
    registry=registry
)

# API call counters
api_calls_total = Counter(
    'api_calls_total',
    'Total number of API calls by type',
    ['api_type'],
    registry=registry
)

# Separate error counters
http_errors_4xx_total = Counter(
    'http_errors_4xx_total',
    'Total number of 4xx HTTP errors',
    ['endpoint', 'status_code'],
    registry=registry
)

http_errors_5xx_total = Counter(
    'http_errors_5xx_total',
    'Total number of 5xx HTTP errors',
    ['endpoint', 'status_code'],
    registry=registry
)

Middleware for Automatic Collection

The magic happens in middleware — it wraps every request:

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start_time = time.time()

    # Execute request
    response = await call_next(request)

    duration = time.time() - start_time

    # Path normalization: /api/activities/12345 → /api/activities/{id}
    path = request.url.path
    if path.startswith('/api/'):
        parts = path.split('/')
        if len(parts) > 3 and parts[3].isdigit():
            parts[3] = '{id}'
            path = '/'.join(parts)

    # Record metrics
    http_requests_total.labels(
        method=request.method,
        endpoint=path,
        status_code=str(response.status_code)
    ).inc()

    http_request_duration_seconds.labels(
        method=request.method,
        endpoint=path
    ).observe(duration)

    # Track API calls
    if path.startswith('/api/'):
        api_type = path.split('/')[2] if len(path.split('/')) > 2 else 'unknown'
        api_calls_total.labels(api_type=api_type).inc()

    # Track errors separately
    status_code = response.status_code
    if 400 <= status_code < 500:
        http_errors_4xx_total.labels(endpoint=path, status_code=str(status_code)).inc()
    elif status_code >= 500:
        http_errors_5xx_total.labels(endpoint=path, status_code=str(status_code)).inc()

    return response

Key Techniques:

Path normalization — critically important! Without this, you'll get thousands of unique metrics for /api/activities/1, /api/activities/2, etc.
Labels — allow filtering and grouping metrics in PromQL
Separate error counters — simplifies alert writing

Metrics Endpoint

@app.get("/metrics")
async def metrics():
    """Prometheus metrics endpoint"""
    return PlainTextResponse(
        generate_latest(registry),
        media_type=CONTENT_TYPE_LATEST
    )

Now Prometheus can collect metrics from http://localhost:8000/metrics.

What We Get in Prometheus

# Metrics format in /metrics endpoint:
http_requests_total{method="GET",endpoint="/api/activities",status_code="200"} 1543
http_requests_total{method="POST",endpoint="/api/activities",status_code="201"} 89
http_request_duration_seconds_bucket{method="GET",endpoint="/api/activities",le="0.1"} 1234

Level 2: External API Metrics

Web applications often integrate with external APIs (Stripe, AWS, etc.). It's important to track not only your own requests but also dependencies.

External API Metrics

# External API metrics
external_api_calls_total = Counter(
    'external_api_calls_total',
    'Total number of external API calls by endpoint type',
    ['endpoint_type'],
    registry=registry
)

external_api_errors_total = Counter(
    'external_api_errors_total',
    'Total number of external API errors by endpoint type',
    ['endpoint_type'],
    registry=registry
)

external_api_latency_seconds = Histogram(
    'external_api_latency_seconds',
    'External API call latency in seconds',
    ['endpoint_type'],
    registry=registry
)

API Call Tracking Helper

Instead of duplicating code everywhere you call the API, create a universal wrapper:

async def track_external_api_call(endpoint_type: str, api_call_func, *args, **kwargs):
    """
    Universal wrapper for tracking API calls

    Usage:
        result = await track_external_api_call(
            'athlete_activities',
            client.get_athlete_activities,
            athlete_id=123
        )
    """
    start_time = time.time()

    try:
        # Increment call counter
        external_api_calls_total.labels(endpoint_type=endpoint_type).inc()

        # Execute API call
        result = await api_call_func(*args, **kwargs)

        # Record latency
        duration = time.time() - start_time
        external_api_latency_seconds.labels(endpoint_type=endpoint_type).observe(duration)

        # Check for API errors (status >= 400)
        if isinstance(result, Exception) or (hasattr(result, 'status') and result.status >= 400):
            external_api_errors_total.labels(endpoint_type=endpoint_type).inc()

        return result

    except Exception as e:
        # Record latency and error
        duration = time.time() - start_time
        external_api_latency_seconds.labels(endpoint_type=endpoint_type).observe(duration)
        external_api_errors_total.labels(endpoint_type=endpoint_type).inc()
        raise e

Usage in Code

@app.get("/api/activities")
async def get_activities(athlete_id: int):
    # Instead of direct API call:
    # activities = await external_client.get_athlete_activities(athlete_id)

    # Use wrapper with tracking:
    activities = await track_external_api_call(
        'athlete_activities',
        external_client.get_athlete_activities,
        athlete_id=athlete_id
    )

    return activities

Now we can see:

How many calls to each external API endpoint
How many returned errors
Latency for each call type

Level 3: Business Metrics

This is the most valuable part of monitoring — metrics that reflect actual application usage.

Business Metrics Types

# === Authentication ===
user_logins_total = Counter(
    'user_logins_total',
    'Total number of user logins',
    registry=registry
)

user_registrations_total = Counter(
    'user_registrations_total',
    'Total number of new user registrations',
    registry=registry
)

user_deletions_total = Counter(
    'user_deletions_total',
    'Total number of user deletions',
    registry=registry
)

# === File Operations ===
fit_downloads_total = Counter(
    'fit_downloads_total',
    'Total number of FIT file downloads',
    registry=registry
)

gpx_downloads_total = Counter(
    'gpx_downloads_total',
    'Total number of GPX file downloads',
    registry=registry
)

gpx_uploads_total = Counter(
    'gpx_uploads_total',
    'Total number of GPX file uploads',
    registry=registry
)

# === User Actions ===
settings_updates_total = Counter(
    'settings_updates_total',
    'Total number of user settings updates',
    registry=registry
)

feature_requests_total = Counter(
    'feature_requests_total',
    'Total number of feature requests',
    registry=registry
)

feature_votes_total = Counter(
    'feature_votes_total',
    'Total number of votes for features',
    registry=registry
)

# === Reports ===
manual_reports_total = Counter(
    'manual_reports_total',
    'Total number of manually created reports',
    registry=registry
)

auto_reports_total = Counter(
    'auto_reports_total',
    'Total number of automatically created reports',
    registry=registry
)

failed_reports_total = Counter(
    'failed_reports_total',
    'Total number of failed report creation attempts',
    registry=registry
)

Incrementing in Code

@app.post("/api/auth/login")
async def login(credentials: LoginCredentials):
    user = await authenticate_user(credentials)

    if user:
        # Increment successful login counter
        user_logins_total.inc()
        return {"token": generate_token(user)}

    return {"error": "Invalid credentials"}

@app.post("/api/activities/report")
async def create_report(activity_id: int, is_auto: bool = False):
    try:
        report = await generate_activity_report(activity_id)

        # Different counters for manual and automatic reports
        if is_auto:
            auto_reports_total.inc()
        else:
            manual_reports_total.inc()

        return report
    except Exception as e:
        failed_reports_total.inc()
        raise e

Level 4: Performance and Caching

Cache Metrics

Cache is a critical part of performance. Need to track hit rate:

cache_hits_total = Counter(
    'cache_hits_total',
    'Total number of cache hits',
    ['cache_type'],
    registry=registry
)

cache_misses_total = Counter(
    'cache_misses_total',
    'Total number of cache misses',
    ['cache_type'],
    registry=registry
)

# In caching code:
async def get_from_cache(key: str, cache_type: str = 'generic'):
    value = await cache.get(key)

    if value is not None:
        cache_hits_total.labels(cache_type=cache_type).inc()
        return value
    else:
        cache_misses_total.labels(cache_type=cache_type).inc()
        return None

Background Task Metrics

If you have background tasks (Celery, APScheduler), track them:

background_task_duration_seconds = Histogram(
    'background_task_duration_seconds',
    'Background task execution time',
    ['task_type'],
    registry=registry
)

async def run_background_task(task_type: str, task_func, *args, **kwargs):
    start_time = time.time()

    try:
        result = await task_func(*args, **kwargs)
        return result
    finally:
        duration = time.time() - start_time
        background_task_duration_seconds.labels(task_type=task_type).observe(duration)

PromQL: Metrics Query Language

Prometheus uses its own query language — PromQL. Not SQL, but very powerful.

Basic Queries

# 1. Just get metric (instant vector)
http_requests_total

# 2. Filter by labels
http_requests_total{method="GET"}
http_requests_total{status_code="200"}
http_requests_total{method="GET", endpoint="/api/activities"}

# 3. Regular expressions in labels
http_requests_total{status_code=~"5.."}  # All 5xx errors
http_requests_total{endpoint=~"/api/.*"}  # All API endpoints

# 4. Time interval (range vector)
http_requests_total[5m]  # Data for last 5 minutes

Rate and irate: Rate of Change

Counter constantly grows, but we need rate of change — RPS (requests per second):

# Rate - average rate over interval
rate(http_requests_total[5m])

# irate - instantaneous rate (between last two points)
irate(http_requests_total[5m])

When to use what:

rate() — for alerts and trend graphs (smooths spikes)
irate() — for detailed analysis (shows peaks)

Aggregation with sum, avg, max

# Total app RPS
sum(rate(http_requests_total[5m]))

# RPS by method
sum(rate(http_requests_total[5m])) by (method)

# RPS by endpoint, sorted
sort_desc(sum(rate(http_requests_total[5m])) by (endpoint))

# Average latency
avg(rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]))

Histogram and Percentiles

For Histogram metrics (latency, duration) use histogram_quantile:

# P50 (median) latency
histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m]))

# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# P99 latency (99% of requests faster than this)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# P95 per endpoint
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) by (endpoint)

Complex Queries

1. Success Rate (percentage of successful requests)

(
  sum(rate(http_requests_total{status_code=~"2.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) * 100

2. Error Rate (percentage of errors)

(
  sum(rate(http_requests_total{status_code=~"4..|5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) * 100

3. Cache Hit Rate

(
  sum(rate(cache_hits_total[5m]))
  /
  (sum(rate(cache_hits_total[5m])) + sum(rate(cache_misses_total[5m])))
) * 100

4. Top-5 Slowest Endpoints

topk(5,
  histogram_quantile(0.95, 
    rate(http_request_duration_seconds_bucket[5m])
  ) by (endpoint)
)

5. API Health Score (0-100)

(
  (
    sum(rate(external_api_calls_total[5m])) 
    - 
    sum(rate(external_api_errors_total[5m]))
  ) 
  / 
  sum(rate(external_api_calls_total[5m]))
) * 100

Grafana Dashboards: Visualization

Now the fun part — turning raw metrics into beautiful and informative dashboards.

Dashboard 1: HTTP & Performance

Panel 1: Request Rate

sum(rate(http_requests_total[5m]))

Type: Time series
Color: Blue gradient
Unit: requests/sec
Legend: Total RPS

Panel 2: Success Rate

(
  sum(rate(http_requests_total{status_code=~"2.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) * 100

Type: Stat
Color: Green if > 95%, yellow if > 90%, red if < 90%
Unit: percent (0-100)
Value: Current (last)

Panel 3: Response Time (P50, P95, P99)

# P50
histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m]))

# P95
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# P99
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Type: Time series
Unit: seconds (s)
Legend: P50, P95, P99

Panel 4: Errors by Type

sum(rate(http_requests_total{status_code=~"4.."}[5m])) by (status_code)
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (status_code)

Type: Bar chart
Colors: Yellow (4xx), Red (5xx)

Panel 5: Request Rate by Endpoint

sort_desc(sum(rate(http_requests_total[5m])) by (endpoint))

Type: Bar chart
Limit: Top 10

Dashboard 2: Business Metrics

This dashboard shows real product usage — what users do and how often.

Panel 1: User Activity (24h)

# Logins
increase(user_logins_total[24h])

# Registrations
increase(user_registrations_total[24h])

# Deletions
increase(user_deletions_total[24h])

Type: Stat
Layout: Horizontal

Panel 2: Downloads by Type

sum(rate({__name__=~".*_downloads_total"}[5m])) by (__name__)

Type: Pie chart
Legend: Right side

Panel 3: Feature Usage Timeline

rate(gpx_fixer_usage_total[5m])
rate(search_usage_total[5m])
rate(manual_reports_total[5m])

Type: Time series
Stacking: Normal

Dashboard 3: External API

Critical to monitor dependencies on external services — they can become bottlenecks.

Panel 1: API Health Score

(
  sum(rate(external_api_calls_total[5m])) - sum(rate(external_api_errors_total[5m]))
) / sum(rate(external_api_calls_total[5m])) * 100

Type: Gauge
Min: 0, Max: 100
Thresholds: 95 (green), 90 (yellow), 0 (red)

Panel 2: API Latency by Endpoint

histogram_quantile(0.95, rate(external_api_latency_seconds_bucket[5m])) by (endpoint_type)

Type: Bar chart
Sort: Descending

Panel 3: Error Rate by Endpoint

sum(rate(external_api_errors_total[5m])) by (endpoint_type)

Type: Bar chart
Color: Red

Variables: Dynamic Dashboards

Grafana supports variables for interactive dashboards:

Creating a Variable

Dashboard Settings → Variables → Add variable
Name: endpoint
Type: Query
Query:

label_values(http_requests_total, endpoint)

Using in Panels

# Filter by selected endpoint
sum(rate(http_requests_total{endpoint="$endpoint"}[5m]))

# Multi-select
sum(rate(http_requests_total{endpoint=~"$endpoint"}[5m])) by (endpoint)

Useful Variables

# Time interval
Variable: interval
Type: Interval
Values: 1m,5m,10m,30m,1h

# HTTP method
Variable: method
Query: label_values(http_requests_total, method)

# Status code
Variable: status_code
Query: label_values(http_requests_total, status_code)

Alerting: System Reactivity

Monitoring without alerts is like a car without brakes. Let's set up smart alerts.

Grafana Alerting

Alert 1: High Error Rate

(
  sum(rate(http_requests_total{status_code=~"5.."}[5m]))
  /
  sum(rate(http_requests_total[5m]))
) * 100 > 1

Condition: > 1 (more than 1% errors)
For: 5m (for 5 minutes)
Severity: Critical
Notification: Slack, Email, Telegram

Alert 2: High Latency

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2

Condition: P95 > 2 seconds
For: 10m
Severity: Warning

Alert 3: External API Down

sum(rate(external_api_errors_total[5m])) / sum(rate(external_api_calls_total[5m])) > 0.5

Condition: More than 50% API errors
For: 2m
Severity: Critical

Alert 4: No Data

absent_over_time(http_requests_total[10m])

Condition: No metrics for 10 minutes
Severity: Critical
Means: app crashed or Prometheus can't collect metrics

Best Practices: Battle-Tested Experience

1. Labels: Don't Overdo It

❌ Bad:

# Too detailed labels = cardinality explosion
http_requests_total.labels(
    method=request.method,
    endpoint=request.url.path,  # Every unique URL!
    user_id=str(user.id),       # Thousands of users!
    timestamp=str(time.time())  # Infinite values!
).inc()

✅ Good:

# Normalized endpoints + limited label set
http_requests_total.labels(
    method=request.method,
    endpoint=normalize_path(request.url.path),  # /api/users/{id}
    status_code=str(response.status_code)
).inc()

Rule: High-cardinality data (user_id, timestamps, unique IDs) should NOT be labels.

2. Naming Convention

Follow Prometheus naming conventions:

# Good names:
http_requests_total          # <namespace>_<name>_<unit>
external_api_latency_seconds # Unit in name
cache_hits_total             # Clear it's a Counter

# Bad names:
RequestCount                 # Not CamelCase
api-latency                  # Don't use dashes
request_time                 # Unit not specified

3. Rate() Interval

Rate() interval should be minimum 4x larger than scrape_interval:

# If scrape_interval = 15s
rate(http_requests_total[1m])   # 4x = 60s ✅
rate(http_requests_total[30s])  # 2x = poor accuracy ❌

4. Histogram Buckets

Proper buckets are critical for accurate percentiles:

# Default (bad for latency):
Histogram('latency_seconds', 'Latency')  # [.005, .01, .025, .05, .1, ...]

# Custom buckets for web latency:
Histogram(
    'http_request_duration_seconds',
    'Request latency',
    buckets=[.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

Principle: Buckets should cover the typical range of values.

5. Metrics Cost

Every metric costs memory. Let's calculate:

Memory = Series count × (~3KB per series)

Series = Metric × Label combinations

Example:

# 1 metric × 5 methods × 20 endpoints × 15 status codes = 1,500 series
http_requests_total{method, endpoint, status_code}

# 1,500 × 3KB = ~4.5MB for one metric!

Tip: Regularly check cardinality:

# Top metrics by cardinality
topk(10, count by (__name__)({__name__=~".+"}))

Production Checklist

Before launching in production, check:

[ ] Retention policy configured (storage.tsdb.retention.time)
[ ] Disk space monitored (Prometheus can take a lot of space)
[ ] Backups configured for Grafana dashboards
[ ] Alerts tested (create artificial error)
[ ] Notification channels work (send test alert)
[ ] Access control configured (don't leave Grafana with admin/admin!)
[ ] HTTPS configured for Grafana (via nginx reverse proxy)
[ ] Cardinality checked (topk(10, count by (__name__)({__name__=~".+"})))
[ ] Documentation created (what metric is responsible for what)
[ ] On-call process defined (who gets alerts and what to do)

Real Case: Finding a Problem

Imagine: users complain about slow performance. Here's how monitoring helped find and fix the problem in minutes.

Step 1: Open Grafana → HTTP Performance Dashboard

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

We see: P95 latency jumped from 0.2s to 3s.

Step 2: Check latency by endpoint

topk(5, histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) by (endpoint))

Found: /api/activities — 5 seconds!

Step 3: Check external APIs

histogram_quantile(0.95, rate(external_api_latency_seconds_bucket[5m])) by (endpoint_type)

External API athlete_activities — 4.8 seconds. There's the problem!

Step 4: Check error rate

rate(external_api_errors_total{endpoint_type="athlete_activities"}[5m])

No errors, just slow. So the problem isn't on our side — external service is lagging.

Solution:

Add aggressive caching for external API (TTL 5 minutes)
Set up alert for latency > 2s
Add timeout to requests

Step 5: After deploy, verify

# Cache hit rate
(cache_hits_total / (cache_hits_total + cache_misses_total)) * 100

Hit rate 85% → latency dropped to 0.3s. Victory! 🎉

What's Next?

You've built a production-ready monitoring system. But this is just the beginning:

Next Steps:

Distributed Tracing — add Jaeger/Tempo for request tracing
Logging — integrate Loki for centralized logs
Custom Dashboards — create dashboards for business (not just DevOps)
SLO/SLI — define Service Level Objectives
Anomaly Detection — use machine learning for anomaly detection
Cost Monitoring — add cost metrics (AWS CloudWatch, etc.)

Useful Resources:

Conclusion

A monitoring system isn't "set it and forget it." It's a living organism that needs to evolve with your application. But the basic architecture we've built scales from startup to enterprise.

Key Takeaways:

Three metric levels: HTTP (infrastructure) → API (dependencies) → Business (product)
Middleware automates basic metrics collection
PromQL is powerful — learn gradually
Labels matter — but don't overdo cardinality
Alerts are critical — monitoring without alerts is useless
Document — in six months you'll forget what foo_bar_total means

Monitoring is a culture, not a tool. Start simple, iterate, improve. And your application will run stably, while you sleep peacefully 😴

About Peakline

This monitoring system was built for Peakline — a web application for Strava activity analysis. Peakline provides athletes with:

Detailed segment analysis with interactive maps
Historical weather data for every activity
Advanced FIT file generation for virtual races
Automatic GPX track error correction
Route planner

All these features require reliable monitoring to ensure quality user experience.

Questions? Leave them in the comments!

P.S. If you found this helpful — share with colleagues who might benefit!

About the Author

Solo developer building Peakline — tools for athletes. Athlete and enthusiast myself, believe in automation, observability, and quality code. Continuing to develop the project and share experience with the community in 2025.

Connect

🌐 Peakline Website
💬 Share your monitoring setup in comments
📧 Questions? Drop a comment below!

Tags: #prometheus #grafana #monitoring #python #fastapi #devops #observability #sre #metrics #production