Introduction
Once your web application hits production, the most critical question becomes: how is it performing right now? Logs tell you what happened, but you want to spot problems before users start complaining.
In this article, I'll share how I built a complete monitoring system for Peakline — a FastAPI application for Strava data analysis that processes thousands of requests daily from athletes worldwide.
What's Inside:
- Metrics architecture (HTTP, API, business metrics)
- Prometheus + Grafana setup from scratch
- 50+ production-ready metrics
- Advanced PromQL queries
- Reactive dashboards
- Best practices and pitfalls
Architecture: Three Monitoring Levels
Modern monitoring isn't just "set up Grafana and look at graphs." It's a well-thought-out architecture with several layers:
┌─────────────────────────────────────────────────┐
│ FastAPI Application │
│ ├── HTTP Middleware (auto-collect metrics) │
│ ├── Business Logic (business metrics) │
│ └── /metrics endpoint (Prometheus format) │
└──────────────────┬──────────────────────────────┘
│ scrape every 5s
┌──────────────────▼──────────────────────────────┐
│ Prometheus │
│ ├── Time Series Database (TSDB) │
│ ├── Storage retention: 200h │
│ └── PromQL Engine │
└──────────────────┬──────────────────────────────┘
│ query data
┌──────────────────▼──────────────────────────────┐
│ Grafana │
│ ├── Dashboards │
│ ├── Alerting │
│ └── Visualization │
└─────────────────────────────────────────────────┘
Why This Stack?
Prometheus — the de-facto standard for metrics. Pull model, powerful PromQL query language, excellent Kubernetes integration.
Grafana — the best visualization tool. Beautiful dashboards, alerting, templating, rich UI.
FastAPI — async Python framework with native metrics support via prometheus_client.
Basic Infrastructure Setup
Docker Compose: 5-Minute Quick Start
First, let's spin up Prometheus and Grafana in Docker:
# docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=200h' # 8+ days of history
- '--web.enable-lifecycle'
networks:
- monitoring
extra_hosts:
- "host.docker.internal:host-gateway" # Access host machine
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD} # Use .env!
- GF_SERVER_ROOT_URL=/grafana # For nginx reverse proxy
volumes:
- grafana_data:/var/lib/grafana
- ./monitoring/grafana/provisioning:/etc/grafana/provisioning
depends_on:
- prometheus
networks:
- monitoring
volumes:
prometheus_data:
grafana_data:
networks:
monitoring:
driver: bridge
Key Points:
-
storage.tsdb.retention.time=200h— keep metrics for 8+ days (for weekly analysis) -
extra_hosts: host.docker.internal— allows Prometheus to reach the app on the host - Volumes for data persistence
Prometheus Configuration
# monitoring/prometheus.yml
global:
scrape_interval: 15s # How often to collect metrics
evaluation_interval: 15s # How often to check alerts
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'webapp'
static_configs:
- targets: ['host.docker.internal:8000'] # Your app port
scrape_interval: 5s # More frequent for web apps
metrics_path: /metrics
Important: scrape_interval: 5s for web apps is a balance between data freshness and system load. In production, typically 15-30s.
Grafana Datasource Provisioning
To avoid manual Prometheus setup in Grafana, use provisioning:
# monitoring/grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
Now Grafana automatically connects to Prometheus on startup.
docker-compose up -d
Level 1: HTTP Metrics
The most basic but critically important layer — monitoring HTTP requests. Middleware automatically collects metrics for all HTTP requests.
Metrics Initialization
# webapp/main.py
from prometheus_client import Counter, Histogram, CollectorRegistry, generate_latest, CONTENT_TYPE_LATEST
from fastapi import FastAPI, Request
from fastapi.responses import PlainTextResponse
import time
app = FastAPI(title="Peakline", version="2.0.0")
# Create separate registry for metrics isolation
registry = CollectorRegistry()
# Counter: monotonically increasing value (request count)
http_requests_total = Counter(
'http_requests_total',
'Total number of HTTP requests',
['method', 'endpoint', 'status_code'], # Labels for grouping
registry=registry
)
# Histogram: distribution of values (execution time)
http_request_duration_seconds = Histogram(
'http_request_duration_seconds',
'HTTP request duration in seconds',
['method', 'endpoint'],
registry=registry
)
# API call counters
api_calls_total = Counter(
'api_calls_total',
'Total number of API calls by type',
['api_type'],
registry=registry
)
# Separate error counters
http_errors_4xx_total = Counter(
'http_errors_4xx_total',
'Total number of 4xx HTTP errors',
['endpoint', 'status_code'],
registry=registry
)
http_errors_5xx_total = Counter(
'http_errors_5xx_total',
'Total number of 5xx HTTP errors',
['endpoint', 'status_code'],
registry=registry
)
Middleware for Automatic Collection
The magic happens in middleware — it wraps every request:
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
start_time = time.time()
# Execute request
response = await call_next(request)
duration = time.time() - start_time
# Path normalization: /api/activities/12345 → /api/activities/{id}
path = request.url.path
if path.startswith('/api/'):
parts = path.split('/')
if len(parts) > 3 and parts[3].isdigit():
parts[3] = '{id}'
path = '/'.join(parts)
# Record metrics
http_requests_total.labels(
method=request.method,
endpoint=path,
status_code=str(response.status_code)
).inc()
http_request_duration_seconds.labels(
method=request.method,
endpoint=path
).observe(duration)
# Track API calls
if path.startswith('/api/'):
api_type = path.split('/')[2] if len(path.split('/')) > 2 else 'unknown'
api_calls_total.labels(api_type=api_type).inc()
# Track errors separately
status_code = response.status_code
if 400 <= status_code < 500:
http_errors_4xx_total.labels(endpoint=path, status_code=str(status_code)).inc()
elif status_code >= 500:
http_errors_5xx_total.labels(endpoint=path, status_code=str(status_code)).inc()
return response
Key Techniques:
Path normalization — critically important! Without this, you'll get thousands of unique metrics for
/api/activities/1,/api/activities/2, etc.Labels — allow filtering and grouping metrics in PromQL
Separate error counters — simplifies alert writing
Metrics Endpoint
@app.get("/metrics")
async def metrics():
"""Prometheus metrics endpoint"""
return PlainTextResponse(
generate_latest(registry),
media_type=CONTENT_TYPE_LATEST
)
Now Prometheus can collect metrics from http://localhost:8000/metrics.
What We Get in Prometheus
# Metrics format in /metrics endpoint:
http_requests_total{method="GET",endpoint="/api/activities",status_code="200"} 1543
http_requests_total{method="POST",endpoint="/api/activities",status_code="201"} 89
http_request_duration_seconds_bucket{method="GET",endpoint="/api/activities",le="0.1"} 1234
Level 2: External API Metrics
Web applications often integrate with external APIs (Stripe, AWS, etc.). It's important to track not only your own requests but also dependencies.
External API Metrics
# External API metrics
external_api_calls_total = Counter(
'external_api_calls_total',
'Total number of external API calls by endpoint type',
['endpoint_type'],
registry=registry
)
external_api_errors_total = Counter(
'external_api_errors_total',
'Total number of external API errors by endpoint type',
['endpoint_type'],
registry=registry
)
external_api_latency_seconds = Histogram(
'external_api_latency_seconds',
'External API call latency in seconds',
['endpoint_type'],
registry=registry
)
API Call Tracking Helper
Instead of duplicating code everywhere you call the API, create a universal wrapper:
async def track_external_api_call(endpoint_type: str, api_call_func, *args, **kwargs):
"""
Universal wrapper for tracking API calls
Usage:
result = await track_external_api_call(
'athlete_activities',
client.get_athlete_activities,
athlete_id=123
)
"""
start_time = time.time()
try:
# Increment call counter
external_api_calls_total.labels(endpoint_type=endpoint_type).inc()
# Execute API call
result = await api_call_func(*args, **kwargs)
# Record latency
duration = time.time() - start_time
external_api_latency_seconds.labels(endpoint_type=endpoint_type).observe(duration)
# Check for API errors (status >= 400)
if isinstance(result, Exception) or (hasattr(result, 'status') and result.status >= 400):
external_api_errors_total.labels(endpoint_type=endpoint_type).inc()
return result
except Exception as e:
# Record latency and error
duration = time.time() - start_time
external_api_latency_seconds.labels(endpoint_type=endpoint_type).observe(duration)
external_api_errors_total.labels(endpoint_type=endpoint_type).inc()
raise e
Usage in Code
@app.get("/api/activities")
async def get_activities(athlete_id: int):
# Instead of direct API call:
# activities = await external_client.get_athlete_activities(athlete_id)
# Use wrapper with tracking:
activities = await track_external_api_call(
'athlete_activities',
external_client.get_athlete_activities,
athlete_id=athlete_id
)
return activities
Now we can see:
- How many calls to each external API endpoint
- How many returned errors
- Latency for each call type
Level 3: Business Metrics
This is the most valuable part of monitoring — metrics that reflect actual application usage.
Business Metrics Types
# === Authentication ===
user_logins_total = Counter(
'user_logins_total',
'Total number of user logins',
registry=registry
)
user_registrations_total = Counter(
'user_registrations_total',
'Total number of new user registrations',
registry=registry
)
user_deletions_total = Counter(
'user_deletions_total',
'Total number of user deletions',
registry=registry
)
# === File Operations ===
fit_downloads_total = Counter(
'fit_downloads_total',
'Total number of FIT file downloads',
registry=registry
)
gpx_downloads_total = Counter(
'gpx_downloads_total',
'Total number of GPX file downloads',
registry=registry
)
gpx_uploads_total = Counter(
'gpx_uploads_total',
'Total number of GPX file uploads',
registry=registry
)
# === User Actions ===
settings_updates_total = Counter(
'settings_updates_total',
'Total number of user settings updates',
registry=registry
)
feature_requests_total = Counter(
'feature_requests_total',
'Total number of feature requests',
registry=registry
)
feature_votes_total = Counter(
'feature_votes_total',
'Total number of votes for features',
registry=registry
)
# === Reports ===
manual_reports_total = Counter(
'manual_reports_total',
'Total number of manually created reports',
registry=registry
)
auto_reports_total = Counter(
'auto_reports_total',
'Total number of automatically created reports',
registry=registry
)
failed_reports_total = Counter(
'failed_reports_total',
'Total number of failed report creation attempts',
registry=registry
)
Incrementing in Code
@app.post("/api/auth/login")
async def login(credentials: LoginCredentials):
user = await authenticate_user(credentials)
if user:
# Increment successful login counter
user_logins_total.inc()
return {"token": generate_token(user)}
return {"error": "Invalid credentials"}
@app.post("/api/activities/report")
async def create_report(activity_id: int, is_auto: bool = False):
try:
report = await generate_activity_report(activity_id)
# Different counters for manual and automatic reports
if is_auto:
auto_reports_total.inc()
else:
manual_reports_total.inc()
return report
except Exception as e:
failed_reports_total.inc()
raise e
Level 4: Performance and Caching
Cache Metrics
Cache is a critical part of performance. Need to track hit rate:
cache_hits_total = Counter(
'cache_hits_total',
'Total number of cache hits',
['cache_type'],
registry=registry
)
cache_misses_total = Counter(
'cache_misses_total',
'Total number of cache misses',
['cache_type'],
registry=registry
)
# In caching code:
async def get_from_cache(key: str, cache_type: str = 'generic'):
value = await cache.get(key)
if value is not None:
cache_hits_total.labels(cache_type=cache_type).inc()
return value
else:
cache_misses_total.labels(cache_type=cache_type).inc()
return None
Background Task Metrics
If you have background tasks (Celery, APScheduler), track them:
background_task_duration_seconds = Histogram(
'background_task_duration_seconds',
'Background task execution time',
['task_type'],
registry=registry
)
async def run_background_task(task_type: str, task_func, *args, **kwargs):
start_time = time.time()
try:
result = await task_func(*args, **kwargs)
return result
finally:
duration = time.time() - start_time
background_task_duration_seconds.labels(task_type=task_type).observe(duration)
PromQL: Metrics Query Language
Prometheus uses its own query language — PromQL. Not SQL, but very powerful.
Basic Queries
# 1. Just get metric (instant vector)
http_requests_total
# 2. Filter by labels
http_requests_total{method="GET"}
http_requests_total{status_code="200"}
http_requests_total{method="GET", endpoint="/api/activities"}
# 3. Regular expressions in labels
http_requests_total{status_code=~"5.."} # All 5xx errors
http_requests_total{endpoint=~"/api/.*"} # All API endpoints
# 4. Time interval (range vector)
http_requests_total[5m] # Data for last 5 minutes
Rate and irate: Rate of Change
Counter constantly grows, but we need rate of change — RPS (requests per second):
# Rate - average rate over interval
rate(http_requests_total[5m])
# irate - instantaneous rate (between last two points)
irate(http_requests_total[5m])
When to use what:
-
rate()— for alerts and trend graphs (smooths spikes) -
irate()— for detailed analysis (shows peaks)
Aggregation with sum, avg, max
# Total app RPS
sum(rate(http_requests_total[5m]))
# RPS by method
sum(rate(http_requests_total[5m])) by (method)
# RPS by endpoint, sorted
sort_desc(sum(rate(http_requests_total[5m])) by (endpoint))
# Average latency
avg(rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m]))
Histogram and Percentiles
For Histogram metrics (latency, duration) use histogram_quantile:
# P50 (median) latency
histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m]))
# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# P99 latency (99% of requests faster than this)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# P95 per endpoint
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) by (endpoint)
Complex Queries
1. Success Rate (percentage of successful requests)
(
sum(rate(http_requests_total{status_code=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) * 100
2. Error Rate (percentage of errors)
(
sum(rate(http_requests_total{status_code=~"4..|5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) * 100
3. Cache Hit Rate
(
sum(rate(cache_hits_total[5m]))
/
(sum(rate(cache_hits_total[5m])) + sum(rate(cache_misses_total[5m])))
) * 100
4. Top-5 Slowest Endpoints
topk(5,
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) by (endpoint)
)
5. API Health Score (0-100)
(
(
sum(rate(external_api_calls_total[5m]))
-
sum(rate(external_api_errors_total[5m]))
)
/
sum(rate(external_api_calls_total[5m]))
) * 100
Grafana Dashboards: Visualization
Now the fun part — turning raw metrics into beautiful and informative dashboards.
Dashboard 1: HTTP & Performance
Panel 1: Request Rate
sum(rate(http_requests_total[5m]))
- Type: Time series
- Color: Blue gradient
- Unit: requests/sec
- Legend: Total RPS
Panel 2: Success Rate
(
sum(rate(http_requests_total{status_code=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) * 100
- Type: Stat
- Color: Green if > 95%, yellow if > 90%, red if < 90%
- Unit: percent (0-100)
- Value: Current (last)
Panel 3: Response Time (P50, P95, P99)
# P50
histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m]))
# P95
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# P99
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
- Type: Time series
- Unit: seconds (s)
- Legend: P50, P95, P99
Panel 4: Errors by Type
sum(rate(http_requests_total{status_code=~"4.."}[5m])) by (status_code)
sum(rate(http_requests_total{status_code=~"5.."}[5m])) by (status_code)
- Type: Bar chart
- Colors: Yellow (4xx), Red (5xx)
Panel 5: Request Rate by Endpoint
sort_desc(sum(rate(http_requests_total[5m])) by (endpoint))
- Type: Bar chart
- Limit: Top 10
Dashboard 2: Business Metrics
This dashboard shows real product usage — what users do and how often.
Panel 1: User Activity (24h)
# Logins
increase(user_logins_total[24h])
# Registrations
increase(user_registrations_total[24h])
# Deletions
increase(user_deletions_total[24h])
- Type: Stat
- Layout: Horizontal
Panel 2: Downloads by Type
sum(rate({__name__=~".*_downloads_total"}[5m])) by (__name__)
- Type: Pie chart
- Legend: Right side
Panel 3: Feature Usage Timeline
rate(gpx_fixer_usage_total[5m])
rate(search_usage_total[5m])
rate(manual_reports_total[5m])
- Type: Time series
- Stacking: Normal
Dashboard 3: External API
Critical to monitor dependencies on external services — they can become bottlenecks.
Panel 1: API Health Score
(
sum(rate(external_api_calls_total[5m])) - sum(rate(external_api_errors_total[5m]))
) / sum(rate(external_api_calls_total[5m])) * 100
- Type: Gauge
- Min: 0, Max: 100
- Thresholds: 95 (green), 90 (yellow), 0 (red)
Panel 2: API Latency by Endpoint
histogram_quantile(0.95, rate(external_api_latency_seconds_bucket[5m])) by (endpoint_type)
- Type: Bar chart
- Sort: Descending
Panel 3: Error Rate by Endpoint
sum(rate(external_api_errors_total[5m])) by (endpoint_type)
- Type: Bar chart
- Color: Red
Variables: Dynamic Dashboards
Grafana supports variables for interactive dashboards:
Creating a Variable
- Dashboard Settings → Variables → Add variable
- Name:
endpoint - Type: Query
- Query:
label_values(http_requests_total, endpoint)
Using in Panels
# Filter by selected endpoint
sum(rate(http_requests_total{endpoint="$endpoint"}[5m]))
# Multi-select
sum(rate(http_requests_total{endpoint=~"$endpoint"}[5m])) by (endpoint)
Useful Variables
# Time interval
Variable: interval
Type: Interval
Values: 1m,5m,10m,30m,1h
# HTTP method
Variable: method
Query: label_values(http_requests_total, method)
# Status code
Variable: status_code
Query: label_values(http_requests_total, status_code)
Alerting: System Reactivity
Monitoring without alerts is like a car without brakes. Let's set up smart alerts.
Grafana Alerting
Alert 1: High Error Rate
(
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) * 100 > 1
- Condition:
> 1(more than 1% errors) - For: 5m (for 5 minutes)
- Severity: Critical
- Notification: Slack, Email, Telegram
Alert 2: High Latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
- Condition: P95 > 2 seconds
- For: 10m
- Severity: Warning
Alert 3: External API Down
sum(rate(external_api_errors_total[5m])) / sum(rate(external_api_calls_total[5m])) > 0.5
- Condition: More than 50% API errors
- For: 2m
- Severity: Critical
Alert 4: No Data
absent_over_time(http_requests_total[10m])
- Condition: No metrics for 10 minutes
- Severity: Critical
- Means: app crashed or Prometheus can't collect metrics
Best Practices: Battle-Tested Experience
1. Labels: Don't Overdo It
❌ Bad:
# Too detailed labels = cardinality explosion
http_requests_total.labels(
method=request.method,
endpoint=request.url.path, # Every unique URL!
user_id=str(user.id), # Thousands of users!
timestamp=str(time.time()) # Infinite values!
).inc()
✅ Good:
# Normalized endpoints + limited label set
http_requests_total.labels(
method=request.method,
endpoint=normalize_path(request.url.path), # /api/users/{id}
status_code=str(response.status_code)
).inc()
Rule: High-cardinality data (user_id, timestamps, unique IDs) should NOT be labels.
2. Naming Convention
Follow Prometheus naming conventions:
# Good names:
http_requests_total # <namespace>_<name>_<unit>
external_api_latency_seconds # Unit in name
cache_hits_total # Clear it's a Counter
# Bad names:
RequestCount # Not CamelCase
api-latency # Don't use dashes
request_time # Unit not specified
3. Rate() Interval
Rate() interval should be minimum 4x larger than scrape_interval:
# If scrape_interval = 15s
rate(http_requests_total[1m]) # 4x = 60s ✅
rate(http_requests_total[30s]) # 2x = poor accuracy ❌
4. Histogram Buckets
Proper buckets are critical for accurate percentiles:
# Default (bad for latency):
Histogram('latency_seconds', 'Latency') # [.005, .01, .025, .05, .1, ...]
# Custom buckets for web latency:
Histogram(
'http_request_duration_seconds',
'Request latency',
buckets=[.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
Principle: Buckets should cover the typical range of values.
5. Metrics Cost
Every metric costs memory. Let's calculate:
Memory = Series count × (~3KB per series)
Series = Metric × Label combinations
Example:
# 1 metric × 5 methods × 20 endpoints × 15 status codes = 1,500 series
http_requests_total{method, endpoint, status_code}
# 1,500 × 3KB = ~4.5MB for one metric!
Tip: Regularly check cardinality:
# Top metrics by cardinality
topk(10, count by (__name__)({__name__=~".+"}))
Production Checklist
Before launching in production, check:
- [ ] Retention policy configured (
storage.tsdb.retention.time) - [ ] Disk space monitored (Prometheus can take a lot of space)
- [ ] Backups configured for Grafana dashboards
- [ ] Alerts tested (create artificial error)
- [ ] Notification channels work (send test alert)
- [ ] Access control configured (don't leave Grafana with admin/admin!)
- [ ] HTTPS configured for Grafana (via nginx reverse proxy)
- [ ] Cardinality checked (
topk(10, count by (__name__)({__name__=~".+"}))) - [ ] Documentation created (what metric is responsible for what)
- [ ] On-call process defined (who gets alerts and what to do)
Real Case: Finding a Problem
Imagine: users complain about slow performance. Here's how monitoring helped find and fix the problem in minutes.
Step 1: Open Grafana → HTTP Performance Dashboard
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
We see: P95 latency jumped from 0.2s to 3s.
Step 2: Check latency by endpoint
topk(5, histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) by (endpoint))
Found: /api/activities — 5 seconds!
Step 3: Check external APIs
histogram_quantile(0.95, rate(external_api_latency_seconds_bucket[5m])) by (endpoint_type)
External API athlete_activities — 4.8 seconds. There's the problem!
Step 4: Check error rate
rate(external_api_errors_total{endpoint_type="athlete_activities"}[5m])
No errors, just slow. So the problem isn't on our side — external service is lagging.
Solution:
- Add aggressive caching for external API (TTL 5 minutes)
- Set up alert for latency > 2s
- Add timeout to requests
Step 5: After deploy, verify
# Cache hit rate
(cache_hits_total / (cache_hits_total + cache_misses_total)) * 100
Hit rate 85% → latency dropped to 0.3s. Victory! 🎉
What's Next?
You've built a production-ready monitoring system. But this is just the beginning:
Next Steps:
- Distributed Tracing — add Jaeger/Tempo for request tracing
- Logging — integrate Loki for centralized logs
- Custom Dashboards — create dashboards for business (not just DevOps)
- SLO/SLI — define Service Level Objectives
- Anomaly Detection — use machine learning for anomaly detection
- Cost Monitoring — add cost metrics (AWS CloudWatch, etc.)
Useful Resources:
Conclusion
A monitoring system isn't "set it and forget it." It's a living organism that needs to evolve with your application. But the basic architecture we've built scales from startup to enterprise.
Key Takeaways:
- Three metric levels: HTTP (infrastructure) → API (dependencies) → Business (product)
- Middleware automates basic metrics collection
- PromQL is powerful — learn gradually
- Labels matter — but don't overdo cardinality
- Alerts are critical — monitoring without alerts is useless
-
Document — in six months you'll forget what
foo_bar_totalmeans
Monitoring is a culture, not a tool. Start simple, iterate, improve. And your application will run stably, while you sleep peacefully 😴
About Peakline
This monitoring system was built for Peakline — a web application for Strava activity analysis. Peakline provides athletes with:
- Detailed segment analysis with interactive maps
- Historical weather data for every activity
- Advanced FIT file generation for virtual races
- Automatic GPX track error correction
- Route planner
All these features require reliable monitoring to ensure quality user experience.
Questions? Leave them in the comments!
P.S. If you found this helpful — share with colleagues who might benefit!
About the Author
Solo developer building Peakline — tools for athletes. Athlete and enthusiast myself, believe in automation, observability, and quality code. Continuing to develop the project and share experience with the community in 2025.
Connect
- 🌐 Peakline Website
- 💬 Share your monitoring setup in comments
- 📧 Questions? Drop a comment below!
Tags: #prometheus #grafana #monitoring #python #fastapi #devops #observability #sre #metrics #production




Top comments (0)