Original Post:
From Zero to Observability: Building a Production-Grade Monitoring Stack with Prometheus & Grafana
Introduction
In today’s cloud-native world, monitoring isn’t optional — it’s essential. Whether you’re running a small side project or managing enterprise infrastructure, you need visibility into your systems. But setting up monitoring shouldn’t require weeks of configuration and a PhD in DevOps.
In this comprehensive guide, I’ll walk you through building a production-ready monitoring stack using three powerful open-source tools:
- Docker for containerization
- Prometheus for metrics collection
- Grafana for visualization
By the end of this tutorial, you’ll have:
A fully functional monitoring stack running in containers
Real-time system metrics from your infrastructure
Beautiful, interactive dashboards
Knowledge to extend and customize for your needs
Time to complete: 30 minutes
Skill level: Beginner to Intermediate
Prerequisites: Basic command-line knowledge, Docker installed
Why This Stack?
The Problem
Traditional monitoring setups are often:
Complex - Multiple services, complicated configurations
Expensive - Enterprise solutions cost thousands per month
Inflexible- Vendor lock-in limits customization
Hard to scale - Difficult to add new metrics or exporters
The Solution
Our stack solves these problems:
Simple - Deploy everything with one command
Free & Open Source - No licensing costs
Highly Customizable - Full control over metrics and dashboards
Scalable - Easy to add exporters and federate Prometheus
Architecture Overview
_Here's what we're building: _
*Components: *
Prometheus— Collects and stores time-series metrics
Grafana — Creates beautiful dashboards and visualizations
Node Exporter — Exposes system-level metrics (CPU, RAM, disk)
Application Exporter — Custom metrics from your applications
Part 1: Setting Up the Foundation
Step 1: Prepare Your Environment
First, ensure you have Docker and Docker Compose installed:
Check Docker version
docker — version
Docker version 20.10.0 or higher required
Check Docker Compose version
docker-compose — version
Docker Compose version 2.20.0 or higher recommended
Step 2: Create Project Structure
Create project directory
mkdir monitoring-stack && cd monitoring-stack
Create necessary directories
mkdir -p prometheus grafana src
Step 3: Configure Prometheus
Create prometheus/prometheus.yml:
global:
scrape_interval: 15s # Scrape targets every 15 seconds
evaluation_interval: 15s # Evaluate rules every 15 seconds
scrape_configs:
Prometheus monitors itself
- job_name: ‘prometheus’
static_configs:
- targets: [‘localhost:9090’]
Node Exporter — System metrics
- job_name: ‘node_exporter’
static_configs:
- targets: [‘host.docker.internal:9100’]
scrape_interval: 15s
Custom application metrics
- job_name: ‘application’
static_configs:
- targets: [‘host.docker.internal:8000’]
metrics_path: ‘/metrics’
scrape_interval: 5s
## What’s happening here?
scrape_interval: How often Prometheus collects metrics
job_name: Logical grouping for targets
targets: Where to find metrics endpoints
host.docker.internal: Allows containers to reach the host machine
## Part 2: Docker Compose Configuration
Create `docker-compose.yml` in your project root:
yaml
version: ‘3.8’
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- “9091:9090”
volumes:
./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
prometheus_data:/prometheus
command:
‘ — config.file=/etc/prometheus/prometheus.yml’
‘ — storage.tsdb.path=/prometheus’
‘ — storage.tsdb.retention.time=30d’
restart: unless-stopped
networks:
- monitoring
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- “3000:3000”
environment:
GF_SECURITY_ADMIN_PASSWORD=admin
GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
restart: unless-stopped
networks:
- monitoring
depends_on:
- prometheus
volumes:
prometheus_data:
grafana_data:
networks:
monitoring:
driver: bridge
## Key Configuration Details
1. Ports: Prometheus on 9091, Grafana on 3000
2. Volumes: Persist data even if containers restart
3. Networks: Isolated bridge network for service communication
4. Retention: Keep metrics for 30 days
5. Restart Policy: Automatically restart on failure
## Part 3: Installing Node Exporter
Node Exporter provides system-level metrics. Install it on your host machine:
Create a dedicated user
sudo useradd — no-create-home — shell /bin/false node_exporter
Download Node Exporter
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
Extract and install
tar xzf node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
Create a systemd service at `/etc/systemd/system/node_exporter.service`:
ini
[Unit]
Description=Node Exporter
After=network.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
Start the service:
bash
sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter
Verify it’s running
curl http://localhost:9100/metrics | head -20
You should see metrics output like:
HELP node_cpu_seconds_total Seconds the CPUs spent in each mode.
TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu=”0",mode=”idle”} 12345.67
node_cpu_seconds_total{cpu=”0",mode=”user”} 890.12
Part 4: Creating a Custom Metrics Exporter
Let’s create a simple Python application that exposes custom metrics.
Create src/metrics_exporter.py:
#!/usr/bin/env python3
“””
## Simple Prometheus Metrics Exporter
Demonstrates how to instrument your applications
“””
from prometheus_client import start_http_server, Counter, Gauge, Histogram
import psutil
import random
import time
Define application metrics
request_count = Counter(
‘app_requests_total’,
‘Total HTTP requests’,
[‘method’, ‘endpoint’, ‘status’]
)
active_users = Gauge(
‘app_active_users’,
‘Number of active users’
)
response_time = Histogram(
‘app_response_time_seconds’,
‘Response time in seconds’,
buckets=[0.1, 0.5, 1.0, 2.0, 5.0]
)
System metrics
cpu_gauge = Gauge(‘system_cpu_percent’, ‘CPU usage percentage’)
memory_gauge = Gauge(‘system_memory_percent’, ‘Memory usage percentage’)
disk_gauge = Gauge(‘system_disk_percent’, ‘Disk usage percentage’)
def collect_system_metrics():
Collect system metrics using psutil
cpu_gauge.set(psutil.cpu_percent(interval=1))
memory_gauge.set(psutil.virtual_memory().percent)
disk_gauge.set(psutil.disk_usage(‘/’).percent)
def simulate_application_activity():
Simulate application metrics for demo purposes
methods = [‘GET’, ‘POST’, ‘PUT’, ‘DELETE’]
endpoints = [‘/api/users’, ‘/api/orders’, ‘/api/products’]
statuses = [200, 201, 400, 404, 500]
Simulate a request
method = random.choice(methods)
endpoint = random.choice(endpoints)
status = random.choices(statuses, weights=[85, 10, 3, 1, 1])[0]
request_count.labels(method=method, endpoint=endpoint, status=status).inc()
Simulate response time
response_time.observe(random.uniform(0.05, 2.0))
Update active users
active_users.set(random.randint(10, 100))
def main():
“””Main exporter loop”””
Start metrics server on port 8000
PORT = 8000
start_http_server(PORT)
print (f”Metrics server started on port {PORT}”)
print (f” Metrics available at http://localhost: {PORT}/metrics")
Create `requirements.txt`:
Press enter or click to view image in full size
Create `start_exporter.sh
Press enter or click to view image in full size
Check if Python is installed
if ! command -v python3 &> /dev/null; then
echo “ Python 3 is not installed”
exit 1
fi
Install dependencies
pip3 install -r requirements.txt
Start the exporter
python3 src/metrics_exporter.py
Part 5: Launching the Stack
- Now we’re ready to start everything:
Start Prometheus and Grafana
docker-compose up -d
Check if containers are running
docker-compose ps
NAME IMAGE STATUS
grafana grafana/grafana:latest Up
prometheus prom/prometheus:latest Up
## Access your services
Prometheus: http://localhost:9091
Grafana: http://localhost:3000 (admin/admin)
Node Exporter Metrics: http://localhost:9100/metrics
Application Metrics: http://localhost:8000/metrics
## Part 6: Configuring Grafana
**Step 1: Add Prometheus as a Data Source**
Open Grafana at http://localhost:3000
Login with `admin` / `admin` (change password when prompted)
Go to Configuration → Data Sources
Click Add data source
Select Prometheus
Set URL: `http://prometheus:9090
Click Save & Test
You should see: “Data source is working”
**Step 2: Import a Dashboard**
Go to Dashboards → Import
Enter dashboard ID: 1860 (Node Exporter Full)
Click Load
Select Prometheus as the data source
Click Import
You now have a beautiful dashboard showing:
CPU usage across all cores
Memory utilization
Disk space and I/O
Network traffic
System load
Press enter or click to view image in full size
Press enter or click to view image in full size
Press enter or click to view image in full size
## Part 7: Creating Custom Dashboards
Let’s create a custom dashboard for our application metrics.
**Step 1: Create a New Dashboard**
Click **+** → **Create Dashboard**
Click **Add new panel**
Step 2: Add a Request Rate Panel
Query:
promql
rate(app_requests_total[5m])
Panel Settings:
Title: “HTTP Request Rate”
Visualization: Time series
Legend: `{{method}} {{endpoint}}`
Step 3: Add Active Users Panel
Query:
promql
app_active_users
Panel Settings:
Title: “Active Users”
Visualization: Stat
Color: Based on thresholds (green < 50, yellow < 80, red >= 80)
Step 4: Add Response Time Panel
Query:
promql
histogram_quantile(0.95, rate(app_response_time_seconds_bucket[5m]))
Panel Settings:
Title: “95th Percentile Response Time”
Visualization: Gauge
Unit: seconds
Step 5: Add CPU Usage Panel
Query:
promql
100 — (avg(rate(node_cpu_seconds_total{mode=”idle”}[5m])) * 100)
## Panel Settings:
Title: “CPU Usage %”
Visualization: Graph
Thresholds: Yellow at 60%, Red at 80%
Click Save dashboard and give it a name like “Application Monitoring”.
Part 8: Understanding PromQL
Prometheus Query Language (PromQL) is powerful. Here are essential queries:
## Press enter or click to view image in full size
Basic Queries
promql
Get current value
node_memory_MemTotal_bytes
Press enter or click to view image in full size
Rate of change over 5 minutes
rate(node_cpu_seconds_total[5m])
Press enter or click to view image in full size
Average across all instances
avg(node_load1)
Press enter or click to view image in full size
Part 9: Setting Up Alerts
Alerts notify you when things go wrong. Let’s configure some.
Create prometheus/alerts.yml:
Press enter or click to view image in full size
Update prometheus/prometheus.yml to include alerts:
global:
scrape_interval: 15s
evaluation_interval: 15s
Load alert rules
rule_files:
- ‘/etc/prometheus/alerts.yml’
scrape_configs:
# … (existing scrape configs)
Update docker-compose.yml to mount the alerts file:
prometheus:
# … (existing config)
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/alerts.yml:/etc/prometheus/alerts.yml
- prometheus_data:/prometheus
Restart Prometheus:
docker-compose restart prometheus
Check alerts at http://localhost:9091/alerts
Press enter or click to view image in full size
## Part 10: Production Best Practices
**Security**
**1. Change Default Passwords**
Update `docker-compose.yml`:
yaml
grafana:
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
Create `.env` file:
GRAFANA_PASSWORD=your_secure_password_here
**2. Use Read-Only Volumes**
yaml
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
**3. Run as Non-Root User**
Resource Limits
Backup Strategy
Backup Prometheus data
docker run — rm \
-v prometheus_data:/data \
-v $(pwd)/backups:/backup \
alpine tar czf /backup/prometheus-$(date +%Y%m%d).tar.gz /data
Backup Grafana data
docker run — rm \
-v grafana_data:/data \
-v $(pwd)/backups:/backup \
alpine tar czf /backup/grafana-$(date +%Y%m%d).tar.gz /data
High Availability
For production, consider:
Prometheus Federation — Multiple Prometheus instances
Thanos — Long-term storage and global view
Grafana HA — Multiple Grafana instances behind load balancer
Part 11: Troubleshooting Common Issues
Issue 1: Container Won’t Start
# Check logs
docker-compose logs prometheus
docker-compose logs grafana
# Common causes:
# — Port already in use
# — Configuration file syntax error
# — Insufficient permissions
**Issue 2: Grafana Can’t Connect to Prometheus**
**Problem: Data source test fails**
Solution: Use container name, not localhost:
URL: http://prometheus:9090
URL: http://localhost:9091
Issue 3: No Metrics Showing
# Check Prometheus targets
curl http://localhost:9091/api/v1/targets | jq
# Verify exporters are reachable
curl http://localhost:9100/metrics
curl http://localhost:8000/metrics
**Issue 4: Data Not Persisting**
# Check volume mounts
docker inspect prometheus | grep -A 10 Mounts
# Fix permissions (Prometheus runs as UID 65534)
sudo chown -R 65534:65534 prometheus_data/
## Part 12: Extending Your Stack
Add MySQL Monitoring
Add Nginx Monitoring
Add Redis Monitoring
## Part 13: Real-World Use Cases
**Use Case 1: E-commerce Platform**
Metrics to track:
1. Order processing rate
2. Payment gateway latency
3. Inventory stock levels
4. User cart abandonment rate
**Sample custom metrics:**
python
from prometheus_client import Counter, Histogram
orders_total = Counter(‘orders_total’, ‘Total orders’, [‘status’])
payment_duration = Histogram(‘payment_duration_seconds’, ‘Payment processing time’)
inventory_stock = Gauge(‘inventory_stock’, ‘Product stock level’, [‘product_id’])
Use Case 2: API Service
Metrics to track:
Request rate per endpoint
Response time percentiles
Error rates by status code
Rate limiting hits
PromQL Queries:
promql
Requests per second by endpoint
rate(api_requests_total[1m]) by (endpoint)
99th percentile latency
histogram_quantile(0.99, rate(api_duration_seconds_bucket[5m]))
Error rate
sum(rate(api_requests_total{status=~”5..”}[5m])) / sum(rate(api_requests_total[5m]))
Use Case 3: Batch Processing Pipeline
Metrics to track:
Job completion time
Records processed per minute
Failed jobs count
Queue depth
Part 14: Performance Optimization
Optimize Prometheus Storage
Optimize Scrape Intervals
Use Recording Rules for Expensive Queries
Press enter or click to view image in full size
Then use the pre-computed metrics:
promql
Instead of this expensive query:
rate(api_requests_total[5m])
Use this:
job:api_request_rate:5m
Conclusion
You’ve built a complete monitoring stack from scratch. Here’s what you’ve accomplished:
Deployed a containerized monitoring infrastructure
Configured Prometheus to collect metrics
Created beautiful Grafana dashboards
Instrumented a custom application
Set up alerts for critical issues
Learned PromQL for advanced queries
Applied production best practices
Key Takeaways
Docker makes deployment simple — One command starts everything
Prometheus is powerful — Time-series data with flexible querying
Grafana is beautiful — Create stunning, informative dashboards
Monitoring is essential — Know what’s happening in your systems
Start simple, extend gradually — Add exporters as you need them
Next Steps
Deploy to production — Use Docker Swarm or Kubernetes
Add more exporters — Monitor databases, message queues, etc.
Implement alerting — Connect to Slack, PagerDuty, or email
Long-term storage — Integrate Thanos for infinite retention
Advanced dashboards — Create business-specific metrics
Resources
GitHub Repository: (https://github.com/abidaslam892/Grafana-Prometheus-Monitoring-Deployment-)
Prometheus Docs: https://prometheus.io/docs/
Grafana Dashboards: https://grafana.com/grafana/dashboards/
PromQL Guide: https://prometheus.io/docs/prometheus/latest/querying/basics/
Docker Docs: https://docs.docker.com/
Questions?
Feel free to reach out in the comments below! I’d love to hear:
What are you monitoring?
What challenges did you face?
What metrics matter most to your business?
If this guide helped you, please:
- ⭐ Star the GitHub repository
- 👏 Clap for this article
- 🔗 Share with your team
- 💬 Leave a comment
Happy monitoring! 📊
#Docker #Prometheus #Grafana #DevOps #Monitoring #Kubernetes #CloudNative #SRE #Infrastructure #Tutorial
Top comments (0)