Installing Pushgateway on Windows and macOS (Mac)
1. Quick Review: What Is Pushgateway?
Normally, Prometheus works by scraping metrics from targets (servers, applications, exporters).
However, scraping is not always possible.
When Pushgateway Is Used
Pushgateway is designed for scenarios such as:
- Short-lived jobs (batch jobs, cron jobs)
- Serverless workloads (e.g., AWS Lambda)
- Jobs behind load balancers
- Jobs that terminate before Prometheus can scrape them
In these cases:
- The application pushes metrics instead of being scraped
- You use a Prometheus client library (Python, Go, Java, etc.)
- Metrics are sent to Pushgateway
- Prometheus scrapes Pushgateway
Important Architecture Note
- Pushgateway does NOT replace Prometheus
- It is a component that works with Prometheus
-
It does not need to be on the same server, but:
- If you have one server, installing it together is fine
2. Downloading Pushgateway
Step 1: Go to Prometheus Website
- Open: 👉 https://prometheus.io
- Navigate to Downloads
- Scroll down to Pushgateway
You will see platform-specific packages.
3. Installing Pushgateway on Windows
Step 1: Download
- Choose: Windows → windows-amd64.zip
Step 2: Extract
- Unzip the file
-
You will get a folder containing:
pushgateway.exe
Step 3: Run Pushgateway
Open Command Prompt or PowerShell, navigate to the folder, and run:
pushgateway.exe --help
This confirms the binary works and shows available options.
4. Installing Pushgateway on macOS (Mac)
Important Note About macOS
At the time of this lecture:
- Pushgateway is NOT available via Homebrew
- It is NOT available via MacPorts
- You must install it manually
Step 1: Download
- Choose: Darwin → darwin-amd64.tar.gz
Step 2: Extract
tar -xvzf pushgateway-*.tar.gz
cd pushgateway-*
You will see the binary named:
pushgateway
Step 3: Run Pushgateway
./pushgateway --help
5. Running Pushgateway
Default Behavior
- Pushgateway listens on port 9091
-
Same port is used for:
- Pushing metrics
- Scraping metrics
Default metrics endpoint:
/metrics
Example: Start Pushgateway on a Custom Port
To start Pushgateway on port 9092:
./pushgateway --web.listen-address=":9092"
Verify Pushgateway Is Running
Open your browser:
http://localhost:9092/metrics
You should see internal Pushgateway metrics, which means it’s working.
6. Pushgateway Configuration Options (Brief Overview)
Run:
./pushgateway --help
You will notice:
- Configuration file support (experimental)
-
Admin API
- Used for deleting metrics
- In production, this should usually be disabled for security
Example (disable admin API):
./pushgateway --web.enable-admin-api=false
7. Connecting Pushgateway to Prometheus
Pushgateway must be added as a scrape target in Prometheus.
Step 1: Edit prometheus.yml
Add a new scrape job:
scrape_configs:
- job_name: "pushgateway"
static_configs:
- targets:
- "localhost:9092"
Step 2: Restart Prometheus
After restarting Prometheus:
- Go to Prometheus UI
- Navigate to Status → Targets
- You should see pushgateway listed as UP
8. Key Concept to Remember
- Pushgateway behaves like an exporter
- Prometheus scrapes Pushgateway
- Applications push metrics to Pushgateway
- Prometheus never scrapes the application directly in this model
Installing Pushgateway on Ubuntu and Sending Metrics Using Python
1. What We Will Do in This Lecture
In this lecture, we will:
- Install Pushgateway on an Ubuntu server
- Run Pushgateway as a systemd service
- Send custom metrics to Pushgateway using Python
- Verify those metrics in Prometheus
In the previous lecture, we already:
- Installed Pushgateway on Mac/Windows
- Added Pushgateway as a scrape target in Prometheus
Now we move to real metric pushing.
2. Downloading Pushgateway for Ubuntu (Linux)
Step 1: Go to Prometheus Downloads
- Open: 👉 https://prometheus.io/download
- Scroll down to Pushgateway
- You will see three packages:
- Windows
- Linux (middle option)
- macOS (Darwin)
We need the Linux AMD64 package.
Step 2: Copy the Download URL
Right-click the Linux Pushgateway link and copy the full URL.
3. Installing Pushgateway on Ubuntu
Step 1: Connect to the Ubuntu Server
ssh ubuntu@<SERVER_IP>
(You can also connect as another user if applicable.)
Step 2: Download Pushgateway
wget <PASTE_PUSHGATEWAY_LINUX_URL_HERE>
Example:
wget https://github.com/prometheus/pushgateway/releases/download/v1.7.0/pushgateway-1.7.0.linux-amd64.tar.gz
Step 3: Extract the Package
tar -xvzf pushgateway-*.tar.gz
cd pushgateway-*
Inside the directory, you will see a binary named:
pushgateway
Step 4: Verify the Binary
./pushgateway --help
Why this is important:
- Confirms you downloaded the correct binary
- Shows configuration options
- Confirms default port 9091
Important defaults:
-
Port:
9091 -
Metrics endpoint:
/metrics - Same port is used for push + scrape
4. Installing Pushgateway as a systemd Service
Step 1: Move Binary to /usr/local/bin
sudo cp pushgateway /usr/local/bin/
Step 2: Set Ownership (Recommended)
If Prometheus is running as user prometheus:
sudo chown prometheus:prometheus /usr/local/bin/pushgateway
If Prometheus is not installed on this server, create a
prometheususer and group first.
Step 3: Create systemd Service File
sudo nano /etc/systemd/system/pushgateway.service
Paste the following:
[Unit]
Description=Prometheus Pushgateway
After=network.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/pushgateway \
--web.listen-address=":9091" \
--web.enable-admin-api=false
Restart=always
[Install]
WantedBy=multi-user.target
Save and exit.
Step 4: Reload systemd and Start Service
sudo systemctl daemon-reload
sudo systemctl start pushgateway
sudo systemctl enable pushgateway
Step 5: Verify Service Status
systemctl status pushgateway
You should see:
Active: active (running)
Step 6: Verify in Browser
Open:
http://<SERVER_IP>:9091/metrics
You should see Pushgateway internal metrics.
5. Sending Metrics to Pushgateway Using Python
Now we will push custom metrics.
6. Installing Prometheus Python Client
Make sure Python 3 and pip are installed.
pip3 install prometheus-client
7. Why We Need a Custom Registry
Prometheus has a default registry.
When pushing metrics:
- We must NOT use the default registry
- We must create a new CollectorRegistry
- This avoids metric name collisions
8. Python Code: Push Metrics to Pushgateway
Create a file:
nano push_metrics.py
Paste the following:
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
import time
# Create a new registry (NOT default)
registry = CollectorRegistry()
# Create a Gauge metric
job_runtime = Gauge(
'batch_job_runtime_seconds',
'Runtime of batch job',
registry=registry
)
# Set a value (example: current time)
job_runtime.set(time.time())
# Push metric to Pushgateway
push_to_gateway(
'localhost:9091',
job='demo_batch_job',
registry=registry
)
Save and exit.
9. Run the Python Script
python3 push_metrics.py
This sends the metric to Pushgateway.
10. Verify Metric in Prometheus
- Open Prometheus UI
- Go to Graph
- Enter metric name:
batch_job_runtime_seconds
- Click Execute
You should see your metric value.
11. Key Takeaways (Very Important)
- Pushgateway is used when scraping is impossible
- Applications push metrics
- Prometheus scrapes Pushgateway
- Always use a custom CollectorRegistry
- Pushgateway should be treated as temporary storage
- Do not use Pushgateway for long-running services
Sending Metrics to Pushgateway from Jobs (Java & .NET)
In this lecture, we will learn how to send metrics to Pushgateway from jobs, instead of letting Prometheus scrape them directly.
We will cover:
- Sending metrics from a Java job
- Sending metrics from a .NET console application
- Understanding Collector Registry / Registry
- Verifying metrics in Prometheus
1. Why Pushgateway for Jobs?
Pushgateway is used when:
- Jobs are short-lived
- Jobs start and exit before Prometheus can scrape them
-
Examples:
- Batch jobs
- CI/CD jobs
- One-time scripts
- Serverless executions
In these cases:
- The job pushes metrics
- Prometheus scrapes Pushgateway
- Pushgateway acts as temporary storage
2. Important Concept: Collector Registry (Very Important)
Prometheus stores metrics in a structure called a Collector Registry.
Default Collector Registry
- Exists automatically
- Every metric you create is registered here by default
Why Default Registry Cannot Be Used with Pushgateway
When pushing metrics:
- You must not use the default registry
-
Otherwise:
- Metrics may conflict
- Old metrics may mix with new ones
- Duplicate names cause problems
Correct Approach
- Create a new Collector Registry
- Register your metrics only in that registry
- Push that registry to Pushgateway
This rule applies to:
- Java
- Python
- .NET
- Any Prometheus client
PART 1: Sending Metrics from Java to Pushgateway
3. Java Project Setup (Quick Recap)
- Java project created
- Prometheus client already installed using Maven
- (Covered in earlier lecture, not repeated here)
4. Required Java Imports
import io.prometheus.client.Gauge;
import io.prometheus.client.CollectorRegistry;
import io.prometheus.client.exporter.PushGateway;
These give us:
-
Gauge→ metric type -
CollectorRegistry→ custom registry -
PushGateway→ pushing mechanism
5. Java Code: Push Metric to Pushgateway
public class PushGatewayJob {
public static void main(String[] args) throws Exception {
// Create Pushgateway instance
PushGateway pushGateway = new PushGateway("localhost:9091");
// Create custom registry (NOT default)
CollectorRegistry registry = new CollectorRegistry();
// Create Gauge and register it to custom registry
Gauge jobGauge = Gauge.build()
.name("java_pushgateway_job_metric")
.help("Sample metric pushed from Java job")
.register(registry);
// Set value (example: current time)
jobGauge.set(System.currentTimeMillis());
// Push metrics
pushGateway.push(registry, "java_batch_job");
}
}
6. Key Points (Java)
-
CollectorRegistryis mandatory - Metric is registered using:
.register(registry)
-
jobname is used for grouping - Job can push once or repeatedly (loop if needed)
7. Verify in Prometheus
In Prometheus UI → Graph:
java_pushgateway_job_metric
Click Execute → metric appears.
PART 2: Sending Metrics from .NET to Pushgateway
Now let’s do the same thing using .NET.
8. Create .NET Console Application
- Create a new Console App
- Name it something like:
PushGatewayDotNetSample
9. Install Prometheus .NET Client
Add NuGet package:
prometheus-net
10. Registry Concept in .NET
In .NET:
- Registry is created using
Metrics.NewCustomRegistry() - Metrics are created using a metric factory
- Factory is bound to that registry
This ensures:
- Metrics do NOT go to default registry
- Only pushed metrics are included
11. .NET Code: Push Metrics to Pushgateway
using Prometheus;
using System;
using System.Threading;
class Program
{
static void Main(string[] args)
{
// Create custom registry
var registry = Metrics.NewCustomRegistry();
// Create metric factory bound to registry
var factory = Metrics.WithCustomRegistry(registry);
// Create Pushgateway pusher
var pusher = new MetricPusher(
endpoint: "http://localhost:9091/metrics",
job: "dotnet_pushgateway_job",
instance: "instance-1",
registry: registry
);
pusher.Start();
// Create Gauge metric
var gauge = factory.CreateGauge(
"dotnet_pushgateway_metric",
"Sample metric pushed from .NET job"
);
// Push values in a loop
while (true)
{
gauge.Set(DateTimeOffset.UtcNow.ToUnixTimeMilliseconds());
Thread.Sleep(1000);
}
// pusher.Stop(); (not reached in infinite loop)
}
}
12. Important .NET Notes
-
MetricPushermust be:- Started →
Start() - Stopped →
Stop()
- Started →
Metrics must be created after Start
Metrics must use factory, not static
Metrics.CreateX()
13. Verify .NET Metrics in Prometheus
In Prometheus UI → Graph:
dotnet_pushgateway_metric
You will see:
- Job label →
dotnet_pushgateway_job - Instance label →
instance-1 - Metric values updating
14. Why Graph Looks “Simple”
- Values are random or timestamps
- Purpose is data flow demonstration
-
Real use cases:
- Job duration
- Success/failure count
- Records processed
- Execution time
15. Key Takeaways (Very Important)
- Pushgateway is for jobs, not services
- Never push to default registry
-
Always use:
- Custom registry
- Custom metric factory
Job & instance labels matter
Pushgateway stores metrics until overwritten or deleted
Securing Prometheus and Its Components (Authentication & HTTPS)
One of the most crucial aspects of any software system is authentication and security, and Prometheus is no exception.
Prometheus exposes:
- A Web UI
- HTTP APIs
- Exporters
- Pushgateway
If these endpoints are not protected, anyone with network access can:
- View metrics
- Query APIs
- Scrape exporters
- Push fake metrics
In this section, we will learn how to secure Prometheus and its surrounding components.
1. Security Mechanisms in Prometheus
Prometheus supports multiple security mechanisms:
1. Basic Authentication
- Username + password
-
Used for:
- Prometheus Web UI
- Prometheus HTTP APIs
2. OAuth 2.0 / OIDC
- Used mostly for exporters or reverse proxies
- Integrates with identity providers
3. TLS / mTLS (Mutual TLS)
- Encrypts traffic
- Authenticates servers and/or clients
-
Used for:
- Prometheus
- Exporters
- Pushgateway
In this lecture, we focus on:
- Basic Authentication
- HTTPS (TLS)
- Securing exporters (Node Exporter example)
PART 1: Securing Prometheus with Basic Authentication
2. What Basic Authentication Protects
Basic authentication protects:
- Prometheus Web UI (
/graph,/targets, etc.) - Prometheus HTTP APIs (
/api/v1/...)
After enabling it:
- Browser prompts for username + password
- API clients must send credentials
3. Steps to Enable Basic Authentication
- Choose a strong username and password
- Hash the password using bcrypt
- Create a web configuration file
- Start Prometheus with
--web.config.file
4. Generating a bcrypt Password Hash
Prometheus requires bcrypt hashes.
Option 1: Using htpasswd (Linux / macOS)
Check if Apache tools are installed:
htpasswd
If available, generate bcrypt hash:
htpasswd -nBC 10 admin
-
-B→ bcrypt -
-C 10→ cost factor (required by Prometheus) -
admin→ username
You will be prompted to enter the password twice.
Output example:
admin:$2y$10$xxxxxxxxxxxxxxxxxxxxxxxxxxxx
Option 2: Online Tool (Non-Production Only)
You can use:
- bcrypt hash generators (e.g., bcrypt-generator)
⚠️ Only for learning / testing, never production.
Make sure:
- Cost factor = 10
5. Creating Prometheus Web Config File
Create a file called:
web.yml
Example content:
basic_auth_users:
admin: $2y$10$xxxxxxxxxxxxxxxxxxxxxxxxxxxx
Notes:
- Indentation matters
- Username → bcrypt hash
- You can add multiple users
Example:
basic_auth_users:
admin: $2y$10$xxxx
readonly: $2y$10$yyyy
6. Starting Prometheus with Basic Auth
Prometheus supports a web configuration file via:
--web.config.file
Example (manual start)
prometheus \
--config.file=prometheus.yml \
--web.config.file=web.yml
If Prometheus Runs as a Service
- Linux (systemd) Edit Prometheus service file and add:
--web.config.file=/path/to/web.yml
- macOS (Homebrew) Edit:
/usr/local/etc/prometheus.args
Add:
--web.config.file=/usr/local/etc/web.yml
Restart:
brew services restart prometheus
7. Verifying Basic Authentication
Open browser:
http://localhost:9090
Result:
- Browser prompts for username + password
- Prometheus UI loads after authentication
PART 2: Enabling HTTPS (TLS) for Prometheus
By default, Prometheus uses HTTP only.
This means:
- Credentials are sent in plain text
- APIs are unencrypted
We now enable HTTPS.
8. Why HTTPS Is Required
Without HTTPS:
- Browsers reject secure integrations
- Tools like Grafana cannot securely connect
- Credentials are exposed
9. TLS Certificates Options
Production
- Buy certificate from a trusted Certificate Authority (CA)
Practice / Internal Use
- Generate self-signed certificates
10. Generating TLS Certificates Using OpenSSL
On macOS / Linux:
openssl req -x509 -newkey rsa:2048 -days 365 -nodes \
-keyout prometheus.key \
-out prometheus.crt \
-subj "/CN=localhost"
This creates:
-
prometheus.key→ private key -
prometheus.crt→ certificate
11. Updating Prometheus Web Config for HTTPS
Edit web.yml:
tls_server_config:
cert_file: prometheus.crt
key_file: prometheus.key
basic_auth_users:
admin: $2y$10$xxxxxxxxxxxxxxxx
Important:
- Certificate files must be readable by Prometheus user
- Use absolute paths if files are elsewhere
12. Restart Prometheus
After restart:
- HTTP no longer works
- HTTPS is required
https://localhost:9090
Browser may warn about:
Self-signed certificate
This is expected.
PART 3: Securing Exporters (Node Exporter Example)
Now we secure exporters, using Node Exporter as an example.
Goal:
- Prometheus → Exporter communication via HTTPS
13. Create Web Config for Node Exporter
Create:
node-web.yml
Content:
tls_server_config:
cert_file: /full/path/prometheus.crt
key_file: /full/path/prometheus.key
14. Start Node Exporter with Web Config
Windows
node_exporter.exe --web.config.file=node-web.yml
Linux (systemd)
Edit node exporter service:
--web.config.file=/path/node-web.yml
Restart service.
macOS (Homebrew)
Edit:
/usr/local/etc/node_exporter.args
Add:
--web.config.file=/full/path/node-web.yml
Restart:
brew services restart node_exporter
15. Verify Exporter HTTPS
Open browser:
https://localhost:9100/metrics
You should see metrics after accepting the certificate warning.
16. Updating Prometheus to Scrape HTTPS Exporter
Edit prometheus.yml:
scrape_configs:
- job_name: "node"
scheme: https
tls_config:
ca_file: /full/path/prometheus.crt
server_name: localhost
static_configs:
- targets: ["localhost:9100"]
Notes:
-
scheme: httpsis mandatory -
server_namemust match certificate CN -
ca_filerequired for self-signed certs
17. Restart Prometheus and Verify
- Restart Prometheus
- Open Targets page
- Node exporter should be UP
Test metric:
node_cpu_seconds_total
Securing Pushgateway and Alertmanager (Authentication & HTTPS)
In this lecture, we will complete the security setup of the Prometheus ecosystem by protecting:
- Pushgateway
- Alertmanager
The goal is to ensure that:
- No unauthorized user can push fake metrics
- No malicious user can trigger or delete alerts
- All communication is authenticated and encrypted
1. Why Pushgateway Must Be Secured
If Pushgateway is not protected:
-
Anyone who can reach its endpoint can:
- Push fake metrics
- Corrupt dashboards
- Trigger false alerts
Pushgateway supports the same security model as:
- Prometheus
- Node Exporter
This includes:
- Basic authentication
- HTTPS (TLS)
- Web configuration files
2. Pushgateway Supports --web.config.file
Run:
pushgateway --help
You will see:
--web.config.file
This is the same option used by:
- Prometheus
- Node Exporter
- Alertmanager
👉 All Prometheus components share the same web config format
3. Creating Web Config for Pushgateway
You do not need to create a new file from scratch.
If you already created a web config for Node Exporter, you can reuse or duplicate it.
Example: pushgateway-web.yml
tls_server_config:
cert_file: /usr/local/etc/prometheus/prom.crt
key_file: /usr/local/etc/prometheus/prom.key
basic_auth_users:
admin: $2y$10$xxxxxxxxxxxxxxxxxxxxxxxxxxxx
Notes:
- Structure is identical across components
- Username/password are bcrypt-hashed
- TLS and Basic Auth are combined
4. Starting Pushgateway with Security Enabled
Example (manual start):
pushgateway \
--web.config.file=/usr/local/etc/prometheus/pushgateway-web.yml
Pushgateway will now:
- Listen on port 9091
- Require HTTPS
- Require username + password
5. Verifying Pushgateway Security
Open a private browser window:
https://localhost:9091
Result:
- Browser prompts for credentials
- Lock icon appears
- Connection is encrypted
✅ Pushgateway is now secured
6. Updating Python Code to Authenticate with Pushgateway
Now that Pushgateway is protected, clients must authenticate.
7. Pushgateway Python Client: Authentication Support
The push_to_gateway() function supports a custom handler.
We use:
basic_auth_handler
8. Updated Python Code with Basic Authentication
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
from prometheus_client.exposition import basic_auth_handler
import time
def auth_handler(url, method, timeout, headers, data):
return basic_auth_handler(
url, method, timeout, headers, data,
username="admin",
password="password"
)
registry = CollectorRegistry()
gauge = Gauge(
'python_pushgateway_metric',
'Metric pushed securely',
registry=registry
)
gauge.set(time.time())
push_to_gateway(
'https://localhost:9091',
job='python_secure_job',
registry=registry,
handler=auth_handler
)
9. Handling Self-Signed Certificates in Python
If you see SSL errors due to self-signed certs:
export SSL_CERT_FILE=/usr/local/etc/prometheus/prom.crt
This allows Python to trust the certificate.
10. Updating Prometheus to Scrape Secured Pushgateway
Prometheus must also:
- Use HTTPS
- Authenticate with Pushgateway
Update prometheus.yml
scrape_configs:
- job_name: "pushgateway"
scheme: https
basic_auth:
username: admin
password: password
tls_config:
ca_file: /usr/local/etc/prometheus/prom.crt
server_name: localhost
static_configs:
- targets: ["localhost:9091"]
Restart Prometheus.
11. Verifying Pushgateway Target
In Prometheus UI → Targets:
- Pushgateway should now be UP
- Previously red targets turn green once HTTPS + auth are configured
Check metric:
python_pushgateway_metric
PART 2: Securing Alertmanager
12. Why Alertmanager Must Be Secured
If Alertmanager is not protected:
-
Anyone can:
- Trigger fake alerts
- Delete active alerts
- Abuse the Admin API
Alertmanager supports:
- HTTPS
- Basic authentication
- Web config file (same format)
13. Alertmanager Web Config File
You can reuse an existing web config.
Example: alertmanager-web.yml
tls_server_config:
cert_file: /usr/local/etc/prometheus/prom.crt
key_file: /usr/local/etc/prometheus/prom.key
basic_auth_users:
admin: $2y$10$xxxxxxxxxxxxxxxxxxxxxxxxxxxx
14. Starting Alertmanager Securely
Manual start
alertmanager \
--config.file=alertmanager.yml \
--web.config.file=alertmanager-web.yml
Alertmanager listens on 9093 by default.
Ubuntu (systemd)
Edit Alertmanager service:
--web.config.file=/path/alertmanager-web.yml
Reload and restart service.
macOS (MacPorts)
- Copy web config to:
/opt/local/etc/prometheus/alertmanager/
- Edit plist file:
/opt/local/etc/launchd/alertmanager.plist
- Add:
--web.config.file=/opt/local/etc/prometheus/alertmanager/alertmanager-web.yml
Reload service:
sudo port unload alertmanager
sudo port load alertmanager
15. Updating Prometheus to Talk to Secured Alertmanager
Edit prometheus.yml:
alerting:
alertmanagers:
- scheme: https
basic_auth:
username: admin
password: password
tls_config:
ca_file: /usr/local/etc/prometheus/prom.crt
server_name: localhost
static_configs:
- targets:
- localhost:9093
Restart Prometheus.
16. Verification
- Prometheus starts without errors
- Alerts continue to work
- Alertmanager UI prompts for credentials
- Communication is encrypted
Introduction to Grafana and Installing Grafana on Windows
Up to this point, we have learned a lot about Prometheus:
- How to scrape metrics
- How to push metrics (Pushgateway)
- How PromQL functions work
- How to create rules and alerts
- How to secure Prometheus and its components
This is a good time to introduce Grafana.
After this section, we will come back to advanced Prometheus topics, but first we need better visualization.
1. Why Do We Need Grafana?
Prometheus does have graphs, but they are:
- Basic
- Limited in customization
- Not suitable for complex dashboards
Example:
- You can graph
node_cpu_seconds_total - You can see time-based data
- But dashboards are not flexible or advanced
Grafana solves this problem.
2. What Is Grafana?
Grafana is an open-source visualization and dashboarding tool.
It is designed to:
- Visualize time-series data
- Build advanced dashboards
- Combine data from multiple sources
Time-Series Reminder
A time-series is:
- A metric
- With a timestamp
- Stored over time
Prometheus is a time-series database, and Grafana is one of the best visualization tools for it.
3. Grafana Data Sources (Very Important)
A single Grafana dashboard can pull data from multiple sources:
- Prometheus
- MySQL / PostgreSQL
- SQL Server
- Amazon CloudWatch
- Elasticsearch
- Loki (logs)
- Tempo (traces)
👉 This allows you to correlate data from different systems on one dashboard.
4. Alerts in Grafana vs Prometheus
Grafana can also:
- Create alerts
- Send notifications (email, Slack, PagerDuty, etc.)
So a common design question is:
Should alerts live in Prometheus or Grafana?
Typical approach:
- Prometheus → service health & infrastructure alerts
- Grafana → visualization-driven or cross-datasource alerts
Both approaches are valid.
5. Organizations, Users, and Access Control
Grafana supports:
- Multiple organizations
- Multiple teams
- Fine-grained RBAC
- Read-only users
- Admin users
This makes Grafana suitable for large organizations.
6. Grafana Deployment Options
Before installing Grafana, you must choose how to run it.
Option 1: Grafana Cloud
Grafana Cloud is a fully managed observability platform.
Advantages
- No installation or maintenance
- Always up to date
- Managed scalability
- Free tier is enough for learning
You can sign up at:
https://grafana.com/products/cloud
Disadvantages
- Can be expensive at scale
- Vendor lock-in
- Data stored outside your infrastructure
- Possible compliance issues (GDPR, regulations)
Option 2: Self-Hosted Grafana (On-Prem / VM / EC2)
Advantages
- Full control over data
- Better security and compliance
- Highly customizable
- Open-source version is free
Disadvantages
- Maintenance overhead
- You must handle upgrades
- You must design scalability
- Requires operational knowledge
7. How to Decide Between Cloud and Self-Hosted
Ask yourself:
- Do we have engineers to maintain it?
- Do we need customization?
- Are there compliance restrictions?
- What is our budget?
- Do we want full control over data?
There is no single correct answer.
Installing Grafana on Windows
Now let’s install Grafana on Windows.
8. Download Grafana for Windows
- Go to:
https://grafana.com
- Click Get Grafana
- Navigate to Download
- Select Windows
Download the Windows Installer (.exe).
9. Install Grafana
- Run the installer
- Choose the installation directory 👉 Remember this location
- Complete the installation
Grafana will be installed as a Windows service.
10. Verify Grafana Service
- Open Services
- Look for:
Grafana
- Ensure it is:
- Running
- Startup type = Automatic (recommended)
If it is stopped, start it manually.
11. Grafana Configuration File (Windows)
Navigate to the installation directory, usually:
C:\Program Files\GrafanaLabs\grafana\
Inside, you will find:
conf\
└── defaults.ini
This is Grafana’s default configuration file.
⚠️ Best practice:
- Do NOT edit
defaults.inidirectly - Copy it and override settings later if needed
For now, just be aware of it.
12. Default Grafana Port
Grafana listens on:
http://localhost:3000
You can change this later if:
- Port is occupied
- Firewall blocks it
13. First Login to Grafana
Open browser:
http://localhost:3000
Default credentials:
-
Username:
admin -
Password:
admin
On first login:
- Grafana forces password change
- Choose a strong password
Installing Grafana on macOS, Linux, and Docker
In this section, we will learn multiple ways to install Grafana, depending on your operating system and use case.
You can install Grafana on:
- macOS (Homebrew)
- Ubuntu
- Amazon Linux / Red Hat
- Docker (standalone or docker-compose)
The core concepts and configuration files are the same across all installations.
Part 1: Installing Grafana on macOS (Homebrew – Recommended)
1. Verify Homebrew Is Installed
Open Terminal and run:
brew --version
If you see a version number, Homebrew is installed.
If not:
- Go to https://brew.sh
- Follow the installation instructions
2. Install Grafana Using Homebrew
brew install grafana
This installs Grafana Open Source, which is perfect for learning and most real-world use cases.
3. Grafana Configuration Location (macOS)
After installation, navigate to:
/usr/local/etc/grafana/
You will see:
grafana.ini
This is Grafana’s main configuration file.
4. Best Practice: Use custom.ini
Do not edit grafana.ini directly.
Instead:
cp grafana.ini custom.ini
Always make changes in:
custom.ini
This protects you from accidental misconfiguration and upgrades overwriting your settings.
5. Important Settings to Review (macOS)
Open custom.ini in an editor (nano or VS Code).
a) Server Port
[server]
http_port = 3000
- Default port: 3000
-
If you change the port:
- Remove the semicolon (
;) - Otherwise the line is ignored
- Remove the semicolon (
b) Database Configuration
By default:
[database]
type = sqlite3
Other supported options:
- MySQL
- PostgreSQL
You may switch databases by:
- Removing the semicolon
- Changing
type - Providing host, user, password
SQLite is fine for:
- Single instance
- Learning
- Local setups
External DB is recommended when:
- Running Grafana in Docker
- Running multiple Grafana instances
- You need persistence across restarts
c) Logs Location
[paths]
logs = /var/log/grafana
Knowing this path is critical for:
- Debugging startup issues
- Plugin failures
- Authentication problems
6. Start / Restart Grafana (macOS)
Check service status:
brew services info grafana
Restart after config changes:
brew services restart grafana
7. Access Grafana (macOS)
Open browser:
http://localhost:3000
Default credentials:
- Username: admin
- Password: admin
You will be forced to change the password on first login.
Part 2: Installing Grafana on Ubuntu
8. Update Package Index (Important)
sudo apt update
This step is mandatory. Skipping it causes dependency failures.
9. Install Required Dependencies
sudo apt install -y adduser libfontconfig1 musl
musl is a critical C library required by Grafana.
10. Download Grafana Debian Package
Get the latest .deb package from Grafana documentation.
Example (amd64):
wget https://dl.grafana.com/oss/release/grafana_<version>_amd64.deb
If your system is ARM:
- Use
arm64instead ofamd64
11. Install Grafana
sudo dpkg -i grafana_*.deb
12. Enable and Start Grafana Service
sudo systemctl daemon-reload
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
Verify:
sudo systemctl status grafana-server
You should see active (running).
13. Access Grafana (Ubuntu)
Open browser:
http://<SERVER_PUBLIC_IP>:3000
Make sure:
- Port 3000 is allowed in security groups / firewall
Login:
- admin / admin
- Change password on first login
Part 3: Installing Grafana on Amazon Linux / Red Hat
The process is identical for Amazon Linux and Red Hat.
14. Install Grafana Using RPM
sudo yum install -y <GRAFANA_RPM_URL>
(The RPM link is provided in Grafana documentation.)
15. Enable and Start Service
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
Verify:
sudo systemctl status grafana-server
16. Access Grafana
http://<PUBLIC_IP>:3000
Ensure port 3000 is open.
Part 4: Installing Grafana Using Docker
17. Prerequisite: Docker Desktop
Install Docker Desktop from:
https://www.docker.com/products/docker-desktop
Make sure Docker Desktop is running.
18. Grafana Docker Images
Two official images exist:
- Open Source:
grafana/grafana-oss
- Enterprise (requires license):
grafana/grafana-enterprise
We use OSS.
19. Run Grafana with Docker
docker run -d \
--name grafana \
-p 3001:3000 \
grafana/grafana-oss
- Host port: 3001
- Container port: 3000
Access:
http://localhost:3001
Login:
- admin / admin
20. Important Docker Limitation
If Grafana runs in Docker:
- It cannot access Prometheus on localhost
- Unless Prometheus also runs in Docker
👉 Solution: Docker Compose
21. Docker Compose (Recommended for Labs)
Use a docker-compose.yml that includes:
- Prometheus
- Grafana
- Loki
- Shared Docker network (e.g.,
monitoring)
Example command:
docker compose up -d
All services:
- Share the same network
- Can communicate via container names
Part 5: Grafana Configuration (All Platforms)
22. Configuration File Location (Same Everywhere)
Inside container or host:
/etc/grafana/grafana.ini
Best practice:
cp grafana.ini custom.ini
23. Common Configuration Changes
a) Instance Name
Used when multiple Grafana instances exist.
b) Logs
[paths]
logs = /var/log/grafana
(Remove semicolon!)
c) Database (Critical for Docker / HA)
Use MySQL or PostgreSQL when:
- Running multiple instances
- Using Docker
- You need persistence
24. Restart Grafana After Changes
sudo systemctl restart grafana-server
Or stop/start for reliability:
sudo systemctl stop grafana-server
sudo systemctl start grafana-server
25. Key Takeaways
- Grafana installation varies, but configuration is consistent
-
custom.iniis preferred overgrafana.ini - SQLite is fine for single instance
- External DB is required for HA and Docker
- Docker Compose is best for local observability stacks
Grafana Dashboard Design Best Practices & Getting Ready to Build Dashboards
Before we start creating dashboards and working with different Grafana panels, it is very important to understand how dashboards should be designed and what layouts work best for different use cases.
A well-designed dashboard:
- Tells a story
- Highlights what matters first
- Avoids clutter
- Helps humans make decisions quickly
1. Why Grafana Dashboards Matter
In Prometheus, we can:
- Query metrics
- Draw simple graphs
- Explore time-series data
However:
- Prometheus graphs are basic
- They are not ideal for large-scale observability
- They lack layout, grouping, and advanced UX
This is where Grafana shines.
Grafana allows us to:
- Build structured dashboards
- Combine multiple data sources
- Visualize metrics in meaningful ways
- Serve different audiences (engineers, SREs, business teams)
2. Types of Dashboards You Can Create
There is no single dashboard design that fits all needs.
Dashboards should be designed based on purpose and audience.
Common Dashboard Categories
- Browser / Frontend Dashboards
- Angular, React, Vue apps
- User-experience focused
- Application Performance Monitoring (APM) Dashboards
- Backend services
- APIs and microservices
- Infrastructure Dashboards
- Hosts, VMs, containers
- CPU, memory, disk, network
- Synthetic Monitoring Dashboards
- External checks
- Availability and uptime
- Business / Operational Dashboards
- Sales
- Revenue
- Refunds
- Conversion rates
Each category has different priorities.
3. Recommended Layout: Browser / Frontend Dashboards
What Matters Most?
- Errors
- Performance
- Traffic
- User experience
Suggested Layout
Top section (most important):
- Error rate
- Number of errors
- Top N errors
Middle section:
- Page load time
- Throughput (page views per minute)
Bottom section:
-
Web Vitals:
- LCP (Largest Contentful Paint)
- FID (First Input Delay)
- CLS (Cumulative Layout Shift)
Design Principle
If users are seeing errors or slow pages, that should be visible immediately.
4. Recommended Layout: APM / Backend Services
Key Metrics
- API calls per minute
- Error rate
- Latency
- Logs volume
- Resource usage
Suggested Layout
- API calls per minute
- Error rate
- Log volume
- CPU & memory usage
- Hosts / containers running the service
This layout helps answer:
Is the service healthy, fast, and scalable?
5. Recommended Layout: Infrastructure Dashboards
Top Summary Section
- Number of hosts
- Applications
- Events
- Alerts / warnings
Core Metrics
- CPU usage
- Memory usage
- Disk usage
- Disk utilization
Detail Section
- List of all hosts / VMs
- Container details
- Databases (MySQL, Redis, etc.)
Infrastructure dashboards are usually used by:
- SREs
- DevOps engineers
- Platform teams
6. Recommended Layout: Synthetic Monitoring Dashboards
Synthetic monitoring means:
Monitoring without instrumenting applications or infrastructure.
Examples
- HTTP checks
- Ping checks
- API health endpoints
Suggested Panels
- Website availability (up/down)
- API health checks
- Page load time
- External dependencies (Redis, Kafka, RabbitMQ, cloud services)
Color matters here:
- Green → healthy
- Red → broken
This dashboard answers:
Can users reach us right now?
7. Recommended Layout: Business Dashboards
Business dashboards are not technical dashboards.
Typical Metrics
- Total sales count
- Total refund count
- Sales value
- Refund value
- Conversion rate
- Customer acquisition
- Abandoned checkouts
- Payment methods
- Average basket value
Recommended Visuals
- Comparison with last week / last month
- Region-based breakdown
- Trends over time
These dashboards are often viewed by:
- Managers
- Executives
- Operations teams
8. Test Data for This Course: ShoeHub
To make dashboards realistic, we will use test metrics from an imaginary company:
Company: ShoeHub
-
Products:
- Loafers
- High heels
- Boots
-
Payment methods:
- Credit card
- PayPal
- Cash
-
Countries:
- US
- India
- Australia
9. ShoeHub Metrics Generator
I have created a sample application that generates random metrics.
Options to Run It
Option 1: Binary (Releases)
- Download from GitHub
- Choose your OS (Windows / Linux / macOS)
- Run the executable
Metrics endpoint:
http://localhost:5000/metrics
Option 2: Docker (Recommended)
docker pull asrf/shoehub
docker run -p 8030:8080 asrf/shoehub
Scrape:
http://localhost:8030/metrics
10. Verifying Metrics in Prometheus
Once scraped, in Prometheus:
- Go to Targets → target is UP
- Search for metrics starting with:
shoehub_
You will see:
- Country-based metrics
- Payment method metrics
- Product sales metrics
These are intentionally designed to support different dashboard types.
11. Connecting Grafana to Prometheus
Add Prometheus as Data Source
- Open Grafana
- Hover over Configuration (gear icon)
- Click Data Sources
- Click Add data source
- Select Prometheus
Configuration
- URL:
http://localhost:9090
(or HTTPS if secured)
Optional:
- Basic authentication
- Custom headers
- TLS certificates
Click Save & Test
✅ Green checkmark means success.
12. Creating a Dashboard in Grafana
Step 1: Create Folder (Optional)
- Dashboards → New Folder
- Example:
Tech Team
Step 2: Create Dashboard
- Dashboards → New Dashboard
- Save immediately (important!)
Name example:
ShoeHub
13. Dashboard Settings Best Practices
Open Dashboard Settings (⚙️)
- Title & description
- Tags (e.g.
shoehub,training,demo) -
Time zone:
- Recommended: Default or Browser time
Read-only mode (for TV dashboards)
Save settings.
14. Using Rows for Layout
Rows help structure dashboards.
Example:
- Row 1: Technical Charts
- Row 2: Business Charts
Rows:
- Are collapsible
- Can have titles
- Can be repeated later using variables
Working with Grafana Panels: From Basic to Advanced
Once you have:
- Created a dashboard
- Connected Grafana to Prometheus
…the next step is to add visualizations, which in Grafana are called panels.
👉 Any chart, graph, or visualization you add to a dashboard is called a panel.
1. Adding Your First Panel
To add a panel:
- Hover over Add
- Click Add visualization
You will see three main sections on the screen:
- Panel preview (center)
- Panel properties (right side)
- Query editor & data source (bottom)
2. Panel Types (Visualization Types)
In the panel type dropdown, you will see many visualization options.
Some are used very frequently, others less so.
Most Common Panel Types
- Time series (default and most used)
- Stat
- Gauge
- Bar chart
- Pie chart
- Table
Time Series Panel
- Best for showing trends over time
- This is the default panel type
-
Ideal for metrics like:
- Response time
- Throughput
- CPU usage
- Memory usage
3. Writing Queries for Panels
Each panel can have multiple queries:
- Query A
- Query B
- Query C
- …
Each query:
- Pulls data from Prometheus
- Can be visualized together in one panel
Two Ways to Build Queries
Option 1: Query Builder
- UI-based
- Beginner-friendly
- Prometheus functions appear as operations
Option 2: Code (PromQL)
- Direct PromQL
- Faster
- Easier for complex logic
- Preferred by experienced users
Both approaches are valid and interchangeable.
4. Example: Simple Time Series Panel
Let’s create a response time panel.
- Panel type: Time series
- Data source: Prometheus
- Paste a PromQL query (or build it)
- Click Run query
The graph will appear immediately.
💡 Prometheus functions (like rate) appear as operations in the builder.
5. Saving and Organizing Panels
Important rules:
- Always save early
- Always apply changes
You can:
- Drag panels into rows
- Collapse and expand rows
- Move panels between rows
This helps keep dashboards clean and readable.
6. Improving Legends (Display Names)
By default, Grafana generates very long legends based on labels.
This is often too noisy.
For Single-Query Panels
- Edit panel
- Go to Standard options
- Set Display name
Example:
Response Time (ms)
7. Multi-Query Panels (Sales Example)
Now let’s build a panel with multiple queries.
Example: Shoe Sales
We have metrics for:
- Boots
- High heels
- Loafers
Each metric:
- Uses
rate() - Uses the same time range (important!)
⚠️ If ranges differ (1m vs 24h), comparisons are meaningless.
8. Calculating Totals with PromQL
We can calculate total sales directly in PromQL.
Example (Code Mode)
rate(shoehub_sales_boots[1m])
+ rate(shoehub_sales_high_heels[1m])
+ rate(shoehub_sales_loafers[1m])
This creates a derived metric without storing it in Prometheus.
9. Setting Legends for Multi-Query Panels
When a panel has multiple queries, you must set legends per query.
Steps:
- Expand query options
- Set Legend → Custom
- Provide a name
Example:
- Query A → Boots
- Query B → High Heels
- Query C → Loafers
- Query D → Total
This makes the chart readable and professional.
10. Using Data Transformations (Very Important)
Instead of writing a new PromQL query, Grafana can calculate values locally.
Why Use Transformations?
- Reduce query complexity
- Improve readability
- Reuse existing data
- Faster iteration
Example: Total Sales via Transformation
- Edit panel
- Go to Transformations
- Click Add transformation
- Choose Add field from calculation
- Mode: Reduce row
- Operation: Sum
- Alias:
Total Sales
Result:
- Grafana calculates the total
- No new PromQL query required
11. Time Series vs Pie Charts
Time Series Panels
Best for:
- Trends
- Changes over time
- Rate analysis
Pie Charts
Best for:
- Distribution
- Percentages
- Contribution to total
12. Creating a Pie Chart Panel
Example: Card Payments by Country
- Add visualization
- Panel type: Pie chart
- Title:
Card Payments by Country
Queries (One per Country)
Each query:
- Filters by country label
- Uses
rate()
Example logic:
- Australia
- India
- United States
Now each slice represents one country’s share.
13. Pie Chart Customization Options
You can configure:
- Pie vs donut
-
Labels:
- Name
- Value
- Percentage
Legend position
Tooltip behavior
Links (to dashboards or external URLs)
Pie charts are excellent for business dashboards.
14. Saving and Organizing Business Panels
Once created:
- Save & apply
- Move panel into Business row
- Keep technical and business metrics separate
This improves clarity and usability.
15. Key Takeaways
- Panels are the building blocks of Grafana dashboards
- Use Time series for trends
- Use Pie charts for proportions
- Prefer code mode for complex queries
- Use transformations to avoid unnecessary PromQL
- Always clean up legends and titles
Comparing Metrics Across Time & Using Grafana Variables
In many real-world scenarios, we don’t just want to see current values — we want to compare the same metric across different time periods.
Examples:
- Network errors today vs last week
- Sales this month vs last month
- Latency before vs after a deployment
Grafana + Prometheus give us two powerful tools for this:
- PromQL time offsets
- Grafana variables
1. Comparing the Same Metric Across Time (Offset)
Goal
Compare:
- Sales now
- Sales in the past (e.g., last week)
Step 1: Create a New Panel
- Panel type: Time series
- Title:
Sales Today vs Sales Last Week
Step 2: First Query – Current Sales
Example metric (simplified):
shoehub_sales_loafers
Apply a rate:
rate(shoehub_sales_loafers[$__interval])
Why $__interval?
- It automatically adapts to the dashboard time range
- Makes the panel reusable and scalable
Step 3: Second Query – Past Sales (Offset)
Duplicate the query and add an offset:
rate(shoehub_sales_loafers[$__interval]) offset 1m
In production you would normally use:
-
offset 7d(last week) -
offset 30d(last month)
Here we use 1m only because historical data is limited.
Result
- Two time series on the same panel
- One shifted backward in time
- Easy visual comparison of trends
You can now:
- Set custom legends (e.g. Today, Last Week)
- Move this panel into the Business section
2. Practice Review: Payment Method Percentage by Country
Business Question
Do any payment methods contribute less than 5% of total payments in the US?
If yes, the business may decide to remove them due to maintenance overhead.
Step 1: Create a New Panel
- Panel type: Time series
- Title:
Percentage of Payment Methods in the United States
Step 2: Card Payments (Percentage)
sum(shoehub_payments{country_code="us", payment_method="card"})
/
sum(shoehub_payments{country_code="us"})
* 100
Step 3: Duplicate for Other Methods
Repeat the same query for:
cashpaypal
Set custom legends:
- Card
- Cash
- PayPal
Step 4: Improve Visualization
Under Graph styles:
- Line interpolation → Smooth
Step 5: Add Thresholds
Under Thresholds:
- Mode: Percentage
- Threshold:
5 - Display: Filled region
- Color: Red
✅ Now you can visually see:
- Which payment methods dip below 5%
- For how long they stay there
3. Introducing Grafana Variables
So far, our panels are hard-coded:
- Country =
us
This is not scalable.
Grafana variables allow us to:
- Parameterize dashboards
- Reuse panels
- Avoid duplication
4. Creating a Dashboard Variable
Step 1: Open Dashboard Settings
- Go to Dashboard settings
- Click Variables
- Click Add variable
Step 2: Simple (Custom) Variable
Example:
- Name:
country - Type: Custom
- Values:
au,in,us
This works — but it’s not dynamic.
5. Dynamic Variable Using Prometheus Labels (Recommended)
Variable Configuration
- Type: Query
- Data source: Prometheus
- Query type: Classic
- Query:
label_values(shoehub_payments{payment_method="card"}, country_code)
Result:
- Automatically detects all countries
- New countries appear without manual changes
Optional settings:
- Enable Multi-value
- Enable Include All
- Sort alphabetically
Save the variable.
6. Using Variables in Panels
Update Panel Title
Change:
Percentage of Payment Methods in the US
To:
Percentage of Payment Methods in $country
Update Queries
Replace hardcoded values:
country_code="us"
With:
country_code="$country"
Now:
- One panel
- Multiple countries
- No duplication
7. Repeating Panels (Auto-Scaling Dashboards)
Sometimes dashboards are displayed on TVs or NOC screens, and no one will manually change variables.
Grafana allows panel repetition.
How to Repeat Panels
- Edit panel
- Go to Panel options
- Enable Repeat
- Choose variable:
country - Direction:
- Vertical
- Horizontal
- Set max panels per row
Now:
- One panel per country
- Automatically generated
- Fully dynamic
8. Practice Review: Payment Method Variable
Goal
Show payment amount by method across all countries.
Step 1: Create Variable
- Name:
payment_method - Type: Query
- Query:
label_values(shoehub_payments, payment_method)
- Enable Multi-value
Step 2: Panel Query
sum(shoehub_payments{payment_method="$payment_method"})
Step 3: Dynamic Legend
Instead of hardcoding legend names, use labels:
{{payment_method}}
Grafana automatically substitutes the label value.
Step 4: Repeat Panel by Payment Method
- Panel options → Repeat
- Repeat by:
payment_method
Result:
- One panel per payment method
- Fully dynamic
- No hardcoding
Key Takeaways
- Offset lets you compare metrics across time
-
$__intervalmakes queries adaptive - Variables eliminate hardcoding
- Query-based variables scale automatically
- Repeating panels create powerful dashboards with minimal effort
Grafana Loki – Log Aggregation & Analysis
1. What Is Grafana Loki?
Grafana Loki is an open-source log aggregation system designed to work seamlessly with Grafana.
Observability is not just about metrics — it includes:
- Metrics (Prometheus)
- Logs (Loki)
- Traces (Tempo / OpenTelemetry)
Loki focuses specifically on logs.
Important note:
- Loki has no UI of its own
- Logs are viewed and analyzed inside Grafana
- Loki acts as a backend log store
2. Key Features of Grafana Loki
🔹 Log Aggregation
- Collects logs from multiple sources
- Stores and indexes them efficiently
🔹 Fast Queries at Scale
- Designed to query huge volumes of logs quickly
- Optimized for label-based filtering
🔹 Prometheus-Inspired Design
- Similar concepts to Prometheus
- Uses labels, not full-text indexing
- Query language: LogQL
🔹 Native Grafana Integration
- Loki is added as a Grafana data source
-
Logs can be correlated with:
- Metrics
- Dashboards
- Alerts
🔹 Distributed & Scalable
- Horizontally scalable
- Suitable for large environments
🔹 Cost-Effective Storage
- Uses chunk-based storage
- Logs are compressed into chunks
- Much cheaper than traditional log systems (ELK)
3. Loki Architecture – How It Works
Typical Flow
- Application / Backend Service
- Written in Python, Java, or .NET
-
Writes logs to disk
Example:
/var/log/myapp.log
- Log Shipping Agent
- Runs on the same machine as the application
- Discovers and reads log files
- Sends logs to Loki
- Grafana Loki
- Receives logs
- Stores them
- Makes them queryable
- Grafana
- Uses Loki as a data source
- Displays logs in dashboards & Explore view
4. Promtail vs Grafana Alloy
Promtail
- Official Loki log shipping agent
- Discovers log files using config
- Lightweight and simple
-
Ideal for:
- Metrics + logs only
- Small to medium setups
Grafana Alloy
- Next-generation agent
-
Can collect:
- Logs
- Metrics
- Traces (OpenTelemetry)
More powerful and scalable
-
Better for:
- Large environments
- Future-proof observability platforms
Which One Should You Learn?
👉 Both
In this section:
- We use Promtail
Next section:
- Dedicated to Grafana Alloy
5. Ways to Use Grafana Loki
Option 1: Grafana Cloud (SaaS)
- Fully managed Loki
- No backend maintenance
- Still requires Promtail or Alloy on your servers
Steps:
- Go to grafana.com
- Sign up / sign in
- Products → Logs → Loki
- Configure Promtail to ship logs to cloud Loki
Good for:
- Learning
- Fast setup
- Small teams
Option 2: Self-Managed Loki (Local or Server)
You can install Loki:
- Locally
- On a VM
- Using Docker
- On Kubernetes
In this course:
- Docker (local learning)
- Linux (production-like setup)
6. Installing Loki with Docker (Recommended for Learning)
Why Docker?
- Works on Mac, Windows, Linux
- Fast and clean setup
- Ideal for local labs
Docker Architecture
Docker Compose stack includes:
- Loki (log store)
- Promtail (log shipper)
- Grafana (visualization)
All containers share a Docker network.
Step 1: Download Docker Compose File
Use curl, wget, or browser:
curl -O https://raw.githubusercontent.com/grafana/loki/main/production/docker-compose.yaml
This file:
- Creates a Docker network
- Runs Loki on port 3100
- Runs Grafana on port 3000
- Mounts
/var/loginto Promtail
Step 2: Start the Stack
docker compose up -d
Verify:
- Containers are running
- Docker network exists
- Logs appear in Docker Desktop
Step 3: Access Grafana
Open browser:
http://localhost:3000
Login:
- Username:
admin - Password:
admin
Step 4: Verify Loki Data Source
Grafana → Connections → Data Sources
You should already see:
- Loki
- URL:
http://loki:3100
If Grafana is outside Docker:
http://localhost:3100
7. Promtail Log Discovery (Docker)
Default Behavior
- Promtail reads:
/var/log/*.log
-
Files must:
- End with
.log - Be plain text or JSON
- End with
Volume Mapping
volumes:
- /var/log:/var/log
So:
- Logs written on your machine → visible inside Promtail container
8. Viewing Logs in Grafana
- Open Grafana
- Go to Explore
- Select Loki as data source
- Filter by label:
-
filename- Choose your
.logfile - Set time range (e.g. last 15 minutes)
- Run query
- Choose your
You will see:
- Log lines
- Timestamps
- Metadata
9. Installing Loki & Promtail on Linux (Production-Like Setup)
Architecture
- Loki server → stores logs
- Application server → runs Promtail
Step 1: Install Loki (Loki Server)
sudo apt update
sudo apt install loki
Open port 3100 in security group.
Use private IP whenever possible.
Step 2: Install Promtail (App Server)
Promtail is not installed via apt.
- Go to Grafana Loki GitHub releases
- Download Promtail binary
- Unzip and move binary:
sudo mv promtail-linux-amd64 /usr/local/bin/promtail
sudo chmod +x /usr/local/bin/promtail
Step 3: Promtail Configuration
Create config:
/etc/promtail/config.yaml
Key section:
clients:
- url: http://<LOKI_PRIVATE_IP>:3100/loki/api/v1/push
Step 4: Create Promtail Service
/etc/systemd/system/promtail.service
Start & enable:
sudo systemctl start promtail
sudo systemctl enable promtail
Verify:
sudo systemctl status promtail
10. Generating Logs for Testing
Test app:
- Writes logs to:
/var/log/loki_udemy.log
-
Log levels:
- INFO
- WARNING
- ERROR
-
Components:
- backend
- database
Example log format:
2024-02-01T10:15:32Z ERROR backend Database connection failed
11. Verifying Logs in Loki
In Grafana:
- Explore → Loki
- Filter by
filename - Select
loki_udemy.log - Adjust time range
- Run query
You should see logs streaming in.
12. Important Observation
Right now:
- Only default labels exist (filename, job)
👉 No structured labels yet
In the next lecture, we will:
- Parse logs
- Add labels (level, component)
- Filter logs efficiently
- Write LogQL queries
Grafana Loki – Static Labels, Dynamic Labels & Log Visualizations
Up to this point, we have successfully ingested logs into Loki and verified that they appear in Grafana → Explore.
However, if you open any log entry, you will notice a limitation:
-
Available labels:
filenamejob
-
Missing labels:
- environment
- team
- component
- log level (info / warning / error)
Without labels, logs are hard to filter, slow to query, and difficult to analyze at scale.
This lecture focuses on:
- Static labels
- Dynamic labels (extracted from logs)
- Using Loki logs in Grafana dashboards
1. Static Labels in Promtail
What Are Static Labels?
Static labels are manually assigned labels that do not come from the log content itself.
They are useful for:
- Environment (
prod,staging,dev) - Team ownership (
devops,backend) - Cluster or region metadata
These labels apply to all logs collected by a job.
Where Are Static Labels Defined?
In Promtail configuration, under:
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs
__path__: /var/log/*.log
Adding Static Labels
Extend the labels section:
labels:
job: varlogs
team: devops
environment: prod
__path__: /var/log/*.log
Apply the Changes
- Save the config
-
Restart Promtail
- Docker: restart container
- Linux: restart service
Result in Grafana
Now in Grafana → Explore:
-
You can filter by:
environment="prod"team="devops"
You can query logs across all files, ignoring filenames entirely
Static labels are extremely powerful for environment-wide filtering.
2. Searching Unstructured Logs (Without Labels)
At this stage, our logs are unstructured text, not JSON.
Example log line:
2024-01-20T10:15:22 ERROR component=database Connection failed
If you want to search for:
component=database
You have two options:
- Text search (inefficient)
- Logfmt parsing (recommended)
Text Search (Not Recommended)
You can search by text:
|= "database"
⚠️ Downsides:
- CPU intensive
- Not indexed
- Slow at scale
Use only for small datasets.
3. Logfmt – Client-Side Parsing in Grafana
Grafana provides logfmt parsing directly in queries.
Example:
| logfmt
| component="database"
This:
- Parses
key=valuepairs - Extracts fields temporarily (client-side)
- Allows filtering without changing Promtail
✔ Useful for quick exploration
✖ Not indexed
✖ Not scalable for production
4. Dynamic Labels (Best Practice)
Now we move to the correct, production-grade solution:
👉 Extract labels at ingestion time
This is done using Promtail pipelines.
5. Promtail Pipelines – Core Concept
A pipeline is a sequence of stages applied to each log line before it reaches Loki.
Each stage:
- Modifies the log entry
- Extracts data
- Adds labels
Pipeline structure:
pipeline_stages:
- stage1
- stage2
- stage3
6. Extracting Labels Using logfmt
Step 1: Add Pipeline Stages
Under the same scrape_configs job:
pipeline_stages:
- logfmt:
mapping:
component:
level:
This:
- Parses
component=... - Parses
level=...
Step 2: Allow Labels in Static Config
Promtail requires labels to exist in the label allowlist (label map).
Add empty labels:
labels:
job: varlogs
component:
level:
__path__: /var/log/*.log
This is mandatory.
Step 3: Attach Extracted Fields as Labels
Add a labels pipeline stage:
pipeline_stages:
- logfmt:
mapping:
component:
level:
- labels:
component:
level:
Step 4: Restart Promtail
After restart:
- New logs will contain extracted labels
- Old logs will not (labels are not retroactive)
7. Verifying Dynamic Labels
In Grafana → Explore:
- Filter by
filename - Expand a log entry
You should now see:
component=backendcomponent=databaselevel=infolevel=error
Filtering now works efficiently:
{component="database", level="error"}
This is indexed, fast, and scalable.
8. Using Loki Logs in Grafana Dashboards
Logs are not just for Explore — they can be visualized.
8.1 Logs Panel
- Add new panel
- Data source: Loki
- Visualization: Logs
- Query:
{filename="loki_udemy.log"}
- Limit rows (e.g. 10)
This panel:
- Shows latest logs
- Expandable
- Perfect for dashboards
8.2 Turning Logs into Metrics
To visualize logs in time series, bar charts, or pie charts, logs must be converted into numbers.
Why?
Charts require time series vectors, not raw log lines.
Example: Error Count per Minute
Query:
{level="error", component="backend"}
❌ This returns logs, not metrics
Convert using rate:
rate({level="error", component="backend"}[1m])
Now:
- Each bar = number of errors per minute
- Fully compatible with charts
Alternative Functions
rate()count_over_time()
Both are valid.
9. Comparing Error Sources (Backend vs Database)
Backend Errors
rate({level="error", component="backend"}[5m])
Database Errors
rate({level="error", component="database"}[5m])
Ensure:
- Same time window
- Same function
This allows accurate comparison.
10. Pie Chart – Error Distribution
Goal
Compare:
- Backend errors vs Database errors
Queries
Backend:
rate({level="error", component="backend"}[1h])
Database:
rate({level="error", component="database"}[1h])
Visualization:
- Panel type: Pie chart
Result:
- Immediate insight into where errors originate
- Helps prioritize engineering effort
11. Key Takeaways
Static Labels
- Added in
static_configs - Best for environment-wide metadata
Dynamic Labels
- Extracted via pipelines
- Indexed
- Fast queries
- Production-ready
Log Visualizations
- Logs panel → raw inspection
- Rate/count → metrics
- Charts → trends & comparisons
OpenTelemetry (OTel): What It Is and Why It Matters
In this lecture, we introduce OpenTelemetry and explain why it is critical to modern observability and how it fits into everything you’ve learned so far in this course.
What Is OpenTelemetry?
OpenTelemetry is a vendor-neutral, open-source observability framework.
It is designed to help teams:
- Avoid vendor lock-in
- Standardize how telemetry data is produced
- Switch observability backends without rewriting applications
OpenTelemetry is:
- Open source
- Sponsored by the Cloud Native Computing Foundation
- Actively adopted across cloud-native ecosystems
What Does OpenTelemetry Collect?
OpenTelemetry supports all three pillars of observability:
- Metrics
- Logs
- Traces
⚠️ Important distinction:
OpenTelemetry is NOT a backend.
It does not store data like:
- Prometheus
- Jaeger
- New Relic
- Splunk
Instead, OpenTelemetry focuses on:
- Generating
- Collecting
- Exporting
Telemetry data to actual backends.
Two Perspectives: Developer vs DevOps
1. Developer Perspective (Code-Level Observability)
As a developer:
- You instrument code
-
You explicitly generate:
- Metrics
- Traces
- Logs
Example:
- Increment an
order_countmetric whenever an order is created - Attach trace IDs to incoming HTTP requests
This is done using OpenTelemetry SDKs, available for:
- C++
- .NET
- Go
- Java
- JavaScript
- PHP
- Python
- Ruby
- Rust
- Swift
(Some community SDKs exist beyond this list.)
This approach gives:
- Maximum flexibility
- Custom business metrics
- Fine-grained tracing
2. DevOps / Platform Perspective (Zero-Code Observability)
As a DevOps engineer:
- You often cannot modify application code
-
You still need:
- Metrics
- Logs
- Traces
OpenTelemetry supports auto-instrumentation:
- Uses runtime profilers (Java, .NET, etc.)
- Extracts telemetry automatically
- No code changes required
⚠️ Trade-off:
- Less flexible than manual instrumentation
- Still extremely powerful for infrastructure and platforms
Exporters and the OTLP Protocol
After telemetry is generated, it must be exported.
OpenTelemetry supports multiple exporters:
- Prometheus exporter
- New Relic exporter
- Splunk exporter
- Jaeger exporter
OTLP (OpenTelemetry Protocol)
OTLP is the standard protocol for OpenTelemetry.
Key points:
- Unified format for metrics, logs, and traces
- Increasingly adopted by observability backends
- Preferred protocol going forward
OpenTelemetry Collector: When and Why
If you have:
- A few services → exporters may be enough
- Hundreds of services → you need collectors
What Does a Collector Do?
An OpenTelemetry Collector:
- Receives telemetry
- Processes data (filter, batch, enrich)
- Sends data to one or more backends
Collectors are essential for:
- Scalability
- Centralized control
- Multi-backend pipelines
Introducing Grafana Alloy
In this course, the OpenTelemetry collector we use is Grafana Alloy.
Grafana Alloy is:
- Grafana’s distribution of the OpenTelemetry Collector
- Introduced at GrafanaCON 2024
- Designed as a single, unified telemetry agent
Built by Grafana
What Makes Grafana Alloy Special?
Grafana Alloy:
- Fully compatible with OpenTelemetry (OTLP)
- Includes built-in Prometheus optimization
-
Supports:
- Metrics
- Logs
- Traces
- Profiles
It can receive telemetry from:
- OpenTelemetry SDKs
- Prometheus exporters
- Linux / Windows
- Kubernetes
- Java / .NET
- Databases (Postgres, etc.)
- Cloud providers
It can send data to:
- Prometheus
- Grafana Loki (logs)
- Grafana Tempo (traces)
- Other OTLP backends
Alloy replaces Promtail, exporters, and multiple agents with one tool.
Push vs Pull: Key Concept Shift
So far in this course, we mostly used Prometheus, which is:
- Pull-based
- Scrapes targets on intervals
OpenTelemetry is different:
- Push-based
- Telemetry is sent outward
This changes how we design the pipeline.
Why We Start From the Backend
Because OpenTelemetry pushes, we must configure destinations first.
In our case:
- Backend = Prometheus
Prometheus must accept incoming data via Remote Write.
Prometheus Remote Write (Critical Concept)
Remote Write allows Prometheus to:
-
Send metrics to:
- Another Prometheus
- Long-term storage
- OpenTelemetry collectors
How It Works Internally
- Prometheus scrapes metrics
- Metrics are written to disk
- Metrics are duplicated into a Write-Ahead Log (WAL)
- WAL is read by a remote-write process
- Metrics are pushed to another system
Remote Write Endpoint
http://<prometheus-host>:<port>/api/v1/write
Same host and port as the Prometheus UI, just a different path.
How This Fits with OpenTelemetry Collectors
Think of the OpenTelemetry Collector (or Alloy) as:
- A smart receiver
- A processor
- A forwarder
Internally, it can:
- Receive OTLP
- Receive Prometheus metrics
- Forward via remote write
- Fan-out to multiple backends
Mental Model (Very Important)
- Prometheus → Pull-based
- OpenTelemetry → Push-based
- Grafana Alloy → Bridge between worlds
Prometheus scraping still works
OpenTelemetry push pipelines still work
Alloy ties everything together
Installing Grafana Alloy on macOS (Step by Step)
In this lecture, we’ll install Grafana Alloy on a Mac computer and understand how its configuration works.
Most engineers use macOS for local development, so this is a very common setup.
1. Prerequisite: Homebrew
Grafana Alloy is installed using Homebrew, the macOS package manager.
Check if Homebrew is Installed
Open Terminal and run:
brew --version
- If you see
Homebrewwith a version number → you’re good - If not → install Homebrew
Install Homebrew (if missing)
Go to:
https://brew.sh
Follow the instructions shown on the website.
2. Install Grafana Alloy
Once Homebrew is installed, run:
brew tap grafana/grafana
brew install grafana/grafana/alloy
This will:
- Download Grafana Alloy
- Install it as a system binary
The installation may take a few minutes.
3. Start Grafana Alloy as a Service
After installation:
brew services start alloy
This runs Alloy as a background service, which is how we want it for observability components.
You can verify it’s running with:
brew services list
4. Grafana Alloy Configuration File Location
Grafana Alloy reads its configuration from:
/usr/local/etc/alloy/config.alloy
Open it with:
nano /usr/local/etc/alloy/config.alloy
Default State
After installation, the config file:
- Contains only logging configuration
- Does not collect or export anything yet
That’s expected.
We now need to define:
- Receivers
- Processors
- Exporters
5. Grafana Alloy Architecture Refresher
Every OpenTelemetry collector (including Alloy) follows this model:
1. Receivers
- Receive signals (metrics, logs, traces)
2. Processors
- Transform data (batching, filtering, aggregation)
3. Exporters
- Send data to backends (Prometheus, Loki, Tempo, etc.)
Grafana Alloy uses components, and each component has:
- A type
- A name
- An input/output connection
You can chain components together like a pipeline.
6. Example: Alloy Configuration for Metrics (OTLP → Prometheus)
Below is a minimal, working example to receive metrics via OpenTelemetry and send them to Prometheus using remote_write.
Receiver: OTLP (metrics)
otelcol.receiver.otlp "default" {
protocols {
http {}
}
output {
metrics = [otelcol.processor.batch.default.input]
}
}
- Listens on port 4318 (HTTP/Protobuf)
- Receives metrics only
Processor: Batch
otelcol.processor.batch "default" {
output {
metrics = [otelcol.exporter.prometheus.default.input]
}
}
- Groups metrics for efficiency
- Almost always recommended
Exporter: Prometheus Remote Write
otelcol.exporter.prometheus "default" {
forward_to = [prometheus.remote_write.default.receiver]
}
Prometheus Remote Write Target
prometheus.remote_write "default" {
endpoint {
url = "http://localhost:9090/api/v1/write"
basic_auth {
username = "admin"
password = "admin"
}
}
}
Important:
- Uses Prometheus remote write
- Endpoint is always:
/api/v1/write
7. Restart Grafana Alloy
After editing the config file:
brew services restart alloy
Always restart Alloy after config changes.
8. Grafana Alloy Web UI
Grafana Alloy exposes a web interface on:
http://localhost:12345
What You’ll See
- List of configured components
- Health status of each component
- Graph view showing data flow
If everything is configured correctly:
- All components show Healthy
- The graph shows signal flow from receiver → processor → exporter
9. Sending Metrics from a Microservice (OTLP)
To demonstrate OTLP ingestion, we use a simple .NET microservice.
Key Points
- Uses OpenTelemetry .NET SDK
- Sends metrics to Alloy via:
http://localhost:4318/v1/metrics
- Uses HTTP/Protobuf
- Only metrics (to keep it simple)
OTLP Ports Reminder
| Protocol | Port |
|---|---|
| gRPC | 4317 |
| HTTP/Protobuf | 4318 |
Example Metric Behavior
- Counter name:
otel_order - Appears in Prometheus as:
otel_order_total
You can verify in Prometheus:
rate(otel_order_total[5m])
10. Ingesting Logs with Grafana Alloy (Two Approaches)
Grafana Alloy supports two ways to ingest logs into Loki.
Option 1: Native Loki Components (Stable)
Loki Writer
loki.write "local" {
endpoint {
url = "http://loki:3100/loki/api/v1/push"
}
}
File Log Source
loki.source.file "default" {
targets = [{
__path__ = "/var/log/shoehub/*.log",
app = "shoehub"
}]
forward_to = [loki.write.local.receiver]
}
✅ Recommended for production
✅ Stable
✅ Simple
Option 2: OpenTelemetry-Based Log Ingestion (Preview)
⚠️ Important
At the time of recording, this method is experimental / preview.
File Log Receiver (OTel)
otelcol.receiver.filelog "default" {
include = ["/var/log/shoehub/*.log"]
output {
logs = [otelcol.exporter.loki.default.input]
}
}
Loki Exporter (OTel)
otelcol.exporter.loki "default" {
forward_to = [loki.write.local.receiver]
}
When to Use Each
| Method | Use Case |
|---|---|
| Loki native components | Production, simplicity |
| OpenTelemetry logs | Unified OTLP pipelines, experimentation |
11. Key Takeaways
- Grafana Alloy is installed easily on macOS via Homebrew
- Configuration is component-based
- Alloy supports metrics, logs, and traces
- Metrics → Prometheus via remote_write
-
Logs → Loki via:
- Native Loki components (recommended)
- OpenTelemetry pipelines (preview)
Installing Grafana Alloy on Ubuntu (Easy & Safe Method)
In this lecture, we’ll install Grafana Alloy on an Ubuntu server.
The steps themselves are simple, but because Ubuntu requires:
- updating keyrings
- updating APT repositories
- adding Grafana’s package source
there’s always a risk of mistakes due to typos.
To avoid that, I’ve provided a ready-made install script in the course GitHub repository.
1. Use the Provided Installation Script (Recommended)
In the GitHub repository for this course:
Grafana-Udemy/
└── alloy/
└── install.sh
Steps
- Open the file on GitHub
- Click Raw
- Copy the URL from your browser
Now, on your Ubuntu server:
wget <PASTE_RAW_FILE_URL_HERE>
Verify it downloaded:
ls
Now run it:
sudo sh install.sh
This script:
- Adds Grafana’s APT repository
- Updates keyrings
- Installs Grafana Alloy
- Enables and starts Alloy as a service
You can run these commands manually, but the script does everything safely for you.
2. Access the Alloy Configuration Directory
After installation, Alloy is installed under:
/etc/alloy
Check it:
ls /etc
You should see the alloy directory.
Because Alloy was installed with sudo, your current user may not own this directory.
Fix ownership:
sudo chown -R $(whoami):$(whoami) /etc/alloy
Now enter the directory:
cd /etc/alloy
You’ll find:
config.alloy
Open it:
nano config.alloy
This file contains a basic template.
We will extend it with receivers, processors, and exporters.
3. Why Grafana Mimir Exists (Important Context)
Large companies generate millions or billions of metrics per day.
Prometheus:
- Is single-node
- Is not horizontally scalable
- Is not suitable for long-term storage
That’s why Grafana Mimir exists.
What Grafana Mimir Provides
- High availability
- Horizontal scalability
- Long-term storage (S3, GCS, Azure Blob, etc.)
- Extremely fast queries
- Multi-tenancy
- 100% PromQL compatibility
- Remote Write API
Mimir extends Prometheus, it does not replace it.
4. Grafana Mimir – High-Level Architecture
Write Path (Ingestion)
- Prometheus scrapes metrics
- Prometheus remote_write pushes metrics to:
/api/v1/push
- Mimir Distributor receives data
- Data flows to Ingesters
- Data is written to object storage (S3 / filesystem)
- Compactor deduplicates and optimizes blocks
Read Path (Querying)
- Query → Query Frontend
- Cache lookup
- Query Scheduler (optional)
- Querier reads:
- Object storage
- Ingesters (for recent data)
- Results returned to Grafana
5. Installing Grafana Mimir Locally (Monolithic Mode)
This setup is for learning only, not production.
Supported Platforms
- macOS (Intel or Apple Silicon)
- Linux
- ❌ Windows not supported
6. Download Grafana Mimir Binary
Go to:
https://github.com/grafana/mimir/releases
Under Assets, download the correct binary:
| Platform | Binary |
|---|---|
| macOS Intel | darwin-amd64 |
| macOS Apple Silicon | darwin-arm64 |
| Ubuntu | linux-amd64 |
| Debian | .deb |
7. Download Using curl (Recommended)
Create a directory:
mkdir mimir
cd mimir
Download:
curl -L <MIMIR_BINARY_URL> -o mimir
Make it executable:
chmod +x mimir
Test it:
./mimir
Stop it with Ctrl+C.
⚠ macOS users:
If macOS blocks execution, go to:
Settings → Privacy & Security → Allow Anyway
8. Create Mimir Configuration File (Monolithic Mode)
Create a file:
nano config.yaml
Minimal Working Config (Single-Tenant)
multitenancy_enabled: false
server:
http_listen_port: 9000
common:
storage:
backend: filesystem
filesystem:
dir: ./data/common
blocks_storage:
backend: filesystem
filesystem:
dir: ./data/blocks
ingester:
ring:
replication_factor: 1
Why These Settings Matter
-
multitenancy_enabled: false→ no org headers required -
replication_factor: 1→ single-node mode -
filesystembackend → local learning setup
9. Start Mimir with Config
./mimir -config.file=config.yaml
You should see logs indicating:
- HTTP server started
- Listening on port 9000
10. Configure Prometheus Remote Write → Mimir
Edit prometheus.yml:
remote_write:
- url: http://localhost:9000/api/v1/push
Restart Prometheus.
Now:
- Metrics scraped by Prometheus
- Automatically pushed to Mimir
11. Using Grafana with Mimir
Important: Mimir Uses Prometheus API
In Grafana:
- Do NOT create a “Mimir” data source
- Create a Prometheus data source
Data Source URL
http://localhost:9000/prometheus
Grafana uses the Prometheus API compatibility layer.
12. Docker-Based Local Setup (Optional)
The course repository contains:
Grafana-Udemy/
└── docker/
Inside is a wrapper script that:
- Creates directories
- Sets permissions
- Runs Docker Compose safely
Run:
./run.sh
This will start:
- Grafana
- Mimir
- Supporting services
13. Verify Metrics in Grafana
Go to Explore → Data Source → Prometheus (Mimir)
Run:
shoehub_*
You should see all metrics previously scraped by Prometheus.
Key Takeaways
- Grafana Alloy on Ubuntu is easiest via the provided script
- Grafana Mimir solves Prometheus scalability limitations
- Mimir is PromQL compatible
- Prometheus pushes metrics to Mimir via remote_write
- Grafana queries Mimir using the Prometheus API
Grafana Mimir Multi-Tenancy Explained (Step by Step)
Now that we have correctly installed and set up Grafana Mimir, let’s learn how multi-tenancy works and how to configure it properly.
1. What Is Multi-Tenancy in Grafana Mimir?
Multi-tenancy allows multiple isolated organizations (tenants) to store metrics in the same Mimir cluster, while keeping their data fully separated.
Each tenant:
- Has its own metric namespace
- Cannot see data from other tenants
- Is identified by an HTTP header
2. High-Level Architecture Example
Imagine a company with two departments:
- IT Department
- Sales Department
Each department:
- Uses a different application
- Produces different metrics
- Must not see each other’s data
Important Rule
One Prometheus instance can write to only ONE tenant ID
So to separate tenants:
- IT → Prometheus #1
- Sales → Prometheus #2
Both Prometheus instances push metrics to the same Mimir cluster, but with different tenant IDs.
3. How Tenant Identification Works
Grafana Mimir identifies tenants using an HTTP header:
X-Scope-OrgID
⚠️ Case-sensitive
⚠️ Must be identical everywhere it’s used
Examples:
X-Scope-OrgID: it
X-Scope-OrgID: sales
4. Enable Multi-Tenancy in Mimir
By default, in our earlier setup, multi-tenancy was disabled.
Update config.yaml
multitenancy_enabled: true
⚠️ If this remains false:
- Tenant headers are ignored
- All data goes into the anonymous tenant
After changing this:
restart mimir
5. Configure Prometheus Remote Write (Per Tenant)
Each Prometheus instance must include its own tenant header.
Example: IT Prometheus
remote_write:
- url: http://localhost:9000/api/v1/push
headers:
X-Scope-OrgID: it
Example: Sales Prometheus
remote_write:
- url: http://localhost:9000/api/v1/push
headers:
X-Scope-OrgID: sales
Restart Prometheus after changes.
6. Configure Grafana Data Sources (Per Tenant)
Grafana does not auto-detect tenants.
You must create one Prometheus data source per tenant.
Add Data Source → Prometheus
URL:
http://localhost:9000/prometheus
HTTP Headers:
X-Scope-OrgID = it
Save.
Repeat for sales.
7. Result
- IT dashboards → IT metrics only
- Sales dashboards → Sales metrics only
- Same Mimir backend
- Full isolation
Common Storage vs Block Storage in Grafana Mimir
Understanding storage types is critical for production setups.
8. Common Storage (Metadata & Internal State)
Used for:
- Ruler
- Alertmanager
- Compactor metadata
- Admin APIs
- Internal coordination
⚠️ Not metric data
9. Block Storage (Metrics Data)
Used for:
- Time-series metrics
- Long-term retention
- Compaction
- Querying
This is where actual Prometheus metrics live.
10. Storage Backends
You can use:
- Local filesystem (learning only)
- Amazon S3
- Google Cloud Storage
- Azure Blob Storage
👉 In production, S3/GCS/Azure is mandatory
Configuring AWS S3 for Grafana Mimir
11. Required AWS Resources
You need:
- 2 S3 buckets minimum
- Common storage
- Block storage
- IAM user OR IAM role
- IAM policy
Optional:
- Third bucket for Alertmanager
12. IAM Policy
A ready-made policy is provided in the course GitHub repo:
mimir/
└── iam-policy.json
⚠️ Remove comments (#) before pasting into AWS
Create policy:
- IAM → Policies → Create Policy → JSON
- Paste policy
- Save
13. Create IAM User
- IAM → Users → Create User
- Attach policy
- Generate Access Key + Secret Key
If using EC2 or EKS IAM roles, do NOT use access keys.
14. Create S3 Buckets
Example:
grafana-mimir-commongrafana-mimir-blocks
Keep:
- Private access
- No public access
15. Mimir S3 Configuration Example
common:
storage:
backend: s3
s3:
bucket_name: grafana-mimir-common
endpoint: s3.amazonaws.com
region: ap-southeast-2
access_key_id: <KEY>
secret_access_key: <SECRET>
blocks_storage:
backend: s3
s3:
bucket_name: grafana-mimir-blocks
endpoint: s3.amazonaws.com
region: ap-southeast-2
access_key_id: <KEY>
secret_access_key: <SECRET>
16. Optional: S3 Bucket Policy (Extra Security)
You can restrict bucket access only to the IAM user created for Mimir.
Templates are provided in GitHub:
mimir/
└── s3-bucket-policy.json
Grafana Mimir Microservices Architecture (Production)
17. Why Microservices Mode?
Benefits:
- Scale reads and writes independently
- High availability
- Fault isolation
- Production-grade reliability
Core components:
- Distributor
- Ingester
- Querier
- Query Frontend
- Ruler
- Compactor
- Store Gateway
18. Service Discovery Options
Option 1: Load Balancers
- Each service behind an LB
- Simpler but expensive
Option 2: KV Store (Recommended)
- Consul
- etcd
- Memberlist (testing only)
Memberlist = labs only
Consul/etcd = production
19. Kubernetes Is the Recommended Platform
Why?
- Built-in service discovery
- Load balancing
- Scaling
- Secrets
- Security
- Observability
In this course:
- We use Kubernetes
- We deploy Mimir via Helm
- We use memberlist for labs
Preparing Kubernetes Locally (macOS)
20. Required Tools
Homebrew
brew --version
Install if missing.
Docker Desktop
Required for Minikube.
Minikube
brew install minikube
minikube start --driver=docker
If HyperKit error:
minikube delete
Enable Add-ons
minikube addons enable ingress
minikube addons enable dashboard
Verify Cluster
kubectl get nodes
Helm
brew install helm
21. Kubernetes Dashboard (Optional)
minikube dashboard
1. Add Grafana Helm Repository
Helm needs access to Grafana’s official charts.
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
If the repo already exists, Helm will skip adding it.
2. Create a Custom values.yaml
Grafana Mimir must not be deployed with default values.
We must provide a custom values file that includes:
- Mimir configuration (
structuredConfig) - Storage configuration
- Replicas for microservices
File name (example)
custom-values.yaml
3. Add Mimir Configuration to Helm Values
Inside custom-values.yaml:
mimir:
structuredConfig:
# Paste your existing Mimir config here
This config is the same one you used earlier when running Mimir locally, but now embedded into Helm.
4. Configure Microservice Replicas
Because we are deploying distributed Mimir, we must define replicas at the root level of the values file.
Example:
ingester:
replicas: 2
querier:
replicas: 2
distributor:
replicas: 2
store_gateway:
replicas: 2
compactor:
replicas: 1
These are Kubernetes Pods—each component scales independently.
5. Install Mimir Using Helm
Navigate to the folder containing custom-values.yaml and run:
helm install mimir grafana/mimir-distributed \
--namespace mimir \
--create-namespace \
-f custom-values.yaml
If successful, Helm will print:
Welcome to Grafana Mimir
6. Verify Kubernetes Resources
Check Pods
kubectl get pods -n mimir
You should see:
- distributor
- ingester
- querier
- query-frontend
- compactor
- store-gateway
- alertmanager
- ruler
Each pod is one Mimir microservice.
Check Services
kubectl get svc -n mimir
All services are ClusterIP because:
- They are internal Kubernetes services
- External access requires port-forwarding or ingress
7. Access Mimir Services
Option 1: Port Forward (Recommended for Labs)
Example: Expose distributor
kubectl port-forward svc/mimir-distributor 9009:9009 -n mimir
Prometheus remote_write must point to:
http://localhost:9009/api/v1/push
Option 2: Minikube Service URL
minikube service mimir-distributor -n mimir
Minikube will generate a temporary URL.
8. Update Prometheus Remote Write
In prometheus.yml:
remote_write:
- url: http://localhost:9009/api/v1/push
Restart Prometheus.
Now Prometheus writes metrics directly to Grafana Mimir.
Why Use Grafana Alerting Instead of Prometheus Alertmanager?
Prometheus cannot scale alerting properly at enterprise scale and does not support multi-tenancy.
Grafana Mimir:
- Scales alert evaluation
- Supports multi-tenant alert isolation
- Avoids duplicate alerts
- Centralizes alert routing
9. Alerting Architecture in Mimir
Components
| Component | Responsibility |
|---|---|
| Ruler | Evaluates alert rules |
| Alertmanager | Deduplicates and routes alerts |
Flow:
Metrics → Ruler → Alertmanager → Slack / Email / PagerDuty
10. Alert Rules & Alertmanager Files
Mimir uses standard Prometheus files:
- Rule files (
rules.yaml) - Alertmanager config (
alertmanager.yaml)
Nothing new to learn—same syntax.
11. Storage Requirements for Alerting
Ruler and Alertmanager must persist state.
They need storage:
- Filesystem (labs only)
- S3 / GCS / Azure (production)
12. Required Configuration Sections (Critical)
These are not clearly documented by Grafana, but mandatory.
You must configure four sections:
ruler:
ruler_storage:
alertmanager:
alertmanager_storage:
13. Example: Ruler Configuration
ruler:
enable_api: true
ruler_storage:
backend: filesystem
filesystem:
dir: /data/ruler
enable_api: true is required to push rules using mimirtool.
14. Example: Alertmanager Configuration
alertmanager:
enable_api: true
fallback_config_file: /configs/alertmanager.yaml
alertmanager_storage:
backend: filesystem
filesystem:
dir: /data/alertmanager
15. Rule File Structure (Per Tenant)
Example: tenant-one-rules.yaml
groups:
- name: traffic-alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: |
rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 1m
labels:
severity: critical
annotations:
summary: "High HTTP error rate detected"
Each tenant has its own rule file.
16. Alertmanager Configuration Example (Slack)
global:
slack_api_url: https://hooks.slack.com/services/XXXX
route:
receiver: main
receivers:
- name: main
slack_configs:
- channel: "#alerts"
Loading Rules & Alertmanager Using mimirtool
17. Download mimirtool
From:
https://github.com/grafana/mimir/releases
Choose:
- macOS → Darwin
- Intel → amd64
- Apple Silicon → arm64
Make executable:
chmod +x mimirtool
18. Start Mimir with All Components (Monolithic Mode)
mimir \
-target=alertmanager,distributor,ingester,querier,ruler,store-gateway,compactor \
-config.file=config.yaml
19. Load Rules into Ruler
mimirtool rules sync \
--address=http://localhost:9009 \
--id=tenant-one \
tenant-one-rules.yaml
Output example:
1 group created, 0 updated, 0 deleted
20. Verify Rules
mimirtool rules list \
--address=http://localhost:9009 \
--id=tenant-one
21. Query Alerts via API (Current Workaround)
Due to a known API issue:
curl http://localhost:9009/alertmanager/api/v2/alerts \
-H "X-Scope-OrgID: tenant-one"
States:
- inactive
- pending
- firing
22. Alerts Delivered (Slack Example)
Slack receives alerts as expected once conditions are met.
Final Summary
You now know how to:
- Deploy Grafana Mimir on Kubernetes
- Configure multi-tenant metrics
- Enable enterprise-grade alerting
- Load rules and alertmanager configs via API
- Use GitOps-style automation
Creating Alert Rules in Grafana
To work with alert rules in Grafana:
- Open the Grafana menu
- Go to Alerting
- Click Alert rules
You will notice that Grafana already has several alert rules.
These are internal alerts used by Grafana to monitor the health of its own components.
Creating a New Alert Rule
To create your own alert rule:
- Click New alert rule
- Select a data source
- Write a query manually or use the Query Builder
You can also reuse existing dashboard queries, which is often the best approach.
Creating an Alert from an Existing Panel
If you already have a panel with the query you want:
- Open the dashboard
- Click the panel
- Choose Edit
- Go to the Alert tab
- Click Create alert rule
Grafana will automatically:
- Copy the query into the alert rule
- Use the panel name as the alert name (you can change it)
This ensures the alert logic stays consistent with the visualization.
Query Evaluation Window & Frequency
When you click Run queries, Grafana shows how the query behaves.
Key concepts:
- Evaluation window (default: last 5 minutes) Grafana evaluates data over a time range to avoid false alerts.
- Evaluation interval (default: 15 seconds) How often Grafana checks the condition.
You can change these:
- 5 minutes → 15 min / 30 min
- 15s → 30s / 1m
Why We Need Expressions
Alerts need a single value to evaluate.
Raw queries usually return multiple data points, so we must reduce them.
Grafana alert rules typically require:
- Reduce expression
- Threshold or Math expression
Adding a Reduce Expression
- Click Add expression
- Select Reduce
- Input: A (your query)
- Reducer:
-
Last(default) -
Mean,Sum,Count, etc.- Mode: Strict
This converts many data points into one value.
Grafana assigns this expression the name B.
Adding a Threshold Expression
- Click Add expression
- Select Threshold
- Input: B (the reduced value)
- Example:
- Condition: Below
- Value: 400
Grafana now draws a red threshold line.
- Expression evaluates to 1 → alert fires
- Expression evaluates to 0 → no alert
Previewing Alert Behavior
Click Preview to see:
- Reduced values
- Threshold evaluation results
This helps verify that the alert logic works before saving.
Evaluation Group & Pending Period
Alerts must belong to an Evaluation Group.
- Create a new group (example:
card-payments) - Evaluation interval: e.g.
20s - Pending period: e.g.
1m
Why this matters:
- Prevents alerts caused by short spikes
- Condition must be violated continuously before firing
Labels, Summary, and Runbooks
You should always add labels:
team = tech
Labels are critical because:
- Notification policies use them for routing
- Silences match against labels
You can also add:
- Summary
- Description
- Runbook URL (Confluence, SharePoint, GitHub)
Saving the Alert Rule
After saving:
- Green heart → healthy
- Orange → pending evaluation
- Red broken heart → alert firing
Once the condition remains violated long enough, the alert becomes firing.
Sending Alert Notifications
Alerts do not send notifications automatically.
You must configure:
- Contact points
- Notification policies
Email Notifications (Using Mailtrap)
If you don’t have a real SMTP server, use Mailtrap.
Configure Grafana SMTP
Edit grafana.ini (macOS example):
/usr/local/etc/grafana/grafana.ini
Under [smtp]:
- Remove
; - Set
enabled = true - Provide host, username, password from Mailtrap
Restart Grafana.
Creating a Contact Point
- Go to Alerting → Contact points
- Create new contact point
- Type: Email
- Name:
L2-support - Add email addresses
- Save
Status will show Unused until a policy references it.
Creating a Notification Policy
- Go to Alerting → Notification policies
- Do not modify the default policy
- Create a nested policy
Example matchers:
team = tech
Choose contact point:
L2-support
Save the policy.
Alerts now route correctly.
Slack Notifications (Recommended)
Email is slow. Teams usually use Slack, Teams, PagerDuty, or OpsGenie.
Creating a Slack Webhook
- Open Slack workspace
- Go to Apps
- Search Incoming Webhooks
- Install
- Select channel
- Copy Webhook URL
Create Slack Contact Point in Grafana
- Go to Contact points
- Create new
- Type: Slack
- Channel name
- Paste Webhook URL
- Send test message
- Save
Update Notification Policy
Edit your policy:
- Replace email contact point
- Select Slack
Now all firing alerts appear in Slack.
Silences (Suppress Notifications Temporarily)
Silences:
- Do not stop alert evaluation
- Only suppress notifications
You can silence by:
- Time range
- Labels
- Alert name
- Team
- Country code
Example:
team = tech
__alertname__ = LowCardPayment
Annotations in Grafana
Annotations do not trigger notifications.
They are used to mark events on graphs.
Creating Annotations
Single Point Annotation
- Windows: Ctrl + Click
- Mac: Cmd + Click
Example:
Bad deployment
Tags: api, deployment
Range Annotation
- Hold Ctrl (Windows) / Cmd (Mac)
- Drag across graph
- Example:
Marketing campaign
Managing Annotations
Annotations are stored in:
-- Grafana --
You can:
- Disable annotations in Dashboard settings
- Re-enable anytime
Important:
- Save the dashboard first
- Unsaved dashboards cannot store annotations
Key Takeaways
- Alert rules evaluate queries → expressions → thresholds
- Reduce expressions are mandatory
- Notifications require contact points + policies
- Labels control routing and silencing
- Slack is preferred over email
- Silences suppress notifications, not alerts
- Annotations document events, not incidents
AI in Observability and Grafana
One of the biggest technological breakthroughs of recent years—if not decades—is Large Language Models (LLMs) in machine learning, commonly referred to as AI.
Tools such as ChatGPT, Google Gemini, and Microsoft Copilot are interfaces that allow us to interact with these models.
Many companies are actively exploring how to integrate AI into their products to make them smarter, more efficient, and easier to operate—and Grafana is no exception.
In this section, we explore:
- What AI features exist in Grafana
- What is available in Grafana Cloud vs Open Source
- How we can use AI even without Grafana Cloud
Challenges in Traditional Observability
Large, distributed systems—especially microservice architectures—face major observability challenges:
1. High Signal Volume
- Massive amounts of metrics, logs, and traces
- Difficult to manually analyze or correlate
2. Alert Fatigue
- Static thresholds generate too many alerts
- Engineers start ignoring alerts
3. Root Cause Analysis Is Hard
- Alerts tell what failed, not why
- Correlating signals across systems is time-consuming
4. Static Thresholds Are Inaccurate
- Workloads are non-linear
-
Example:
- E-commerce traffic spikes during holidays
- Quiet periods during off-season
-
Fixed thresholds cause:
- False positives
- Missed incidents
How AI Improves Observability
AI brings capabilities beyond traditional threshold-based monitoring:
- Real-time anomaly detection
- Pattern recognition over historical data
- Correlation across metrics, logs, and traces
- Predictive incident detection
- Log summarization
- Actionable insights and recommendations
Large Language Models are especially good at prediction and pattern recognition, which makes them well-suited for observability use cases.
AI Capabilities in Grafana
Grafana Cloud (Managed / Enterprise)
Out-of-the-box AI features include:
- Adaptive alerts using ML models
- Automatic anomaly detection
- Alert tuning to reduce false positives
- Intelligent alert correlation
- Incident noise reduction
- ML-powered insights across metrics, logs, and traces
These features are not available in Open Source Grafana.
Grafana Open Source (OSS) Limitations
Grafana OSS:
- ❌ No built-in machine learning
- ❌ No adaptive alerts
- ❌ No automatic anomaly detection
- ❌ No native log summarization
However…
Grafana OSS is extensible, and this is where AI can still be leveraged.
Using AI with Grafana Open Source
There are two practical approaches:
1. External AI Tools (No Coding)
You can use:
- ChatGPT
- Google Gemini
- Microsoft Copilot
Use them to:
- Analyze metrics
- Explain PromQL queries
- Interpret logs
- Generate alert rules
- Suggest root causes
This approach requires good prompt engineering.
2. Grafana Plugin + AI APIs (Advanced)
If you know coding:
- Build a Grafana plugin
- Integrate with OpenAI, Gemini, etc.
- Display AI insights directly in Grafana
Grafana provides Grafana LLM (LM) plugin for secure AI access.
Prompt Engineering for Observability
To get useful results from AI, structure your prompts properly.
Key Prompting Techniques
1. Contextual Framing
Provide system context:
Metric: http_request_duration_seconds
Source: Node Exporter
99th percentile is very high
What could cause this?
2. Few-Shot Prompting
Provide examples:
This is a valid PromQL:
rate(http_requests_total[5m])
Now create one filtering status=500 and method=POST
3. Chain-of-Thought
Ask for step-by-step reasoning:
Explain step by step how to debug CPU usage flatlining in Grafana
4. Output Format Control
Specify format:
Output must be PromQL only
5. Persona Prompting
Assign a role:
You are a Site Reliability Engineer
How would you configure disk IO alerts?
6. Scoped Prompting
Limit responses:
Give only one PromQL query to detect CPU > 80% per pod
Practical Example: Metric Analysis with ChatGPT
You expose metrics at:
/metrics
Instead of manually reading hundreds of metrics, you can:
Prompt:
You are a Prometheus and .NET expert.
Analyze these metrics and list the most important ones with explanations.
Result:
- Metric name
- Type (counter, gauge, histogram)
- Meaning
- Usage recommendations
This can save 30–60 minutes of manual analysis.
AI via Grafana Plugin (Advanced)
Grafana plugins:
- Frontend: React
- Backend: Go
Grafana provides Grafana LLM (LM):
- Stores API keys securely
- Plugins never access AI APIs directly
- Supports OpenAI, Gemini, etc.
Example Plugin: Alert AI Assistant
What the plugin does:
- Uses Grafana APIs to:
- Fetch firing alerts
- Extract alert queries and thresholds
- Builds a structured AI prompt:
You are a senior SRE.
An alert fired with these details...
- Sends request through Grafana LM
- Displays:
- Severity classification
- Root cause analysis
- Remediation steps
Result:
Faster MTTR and better on-call experience.
Grafana Administration Overview
Grafana is organized around Organizations.
Core Concepts
-
Organizations
- Isolation boundary
-
Users
- Belong to organizations
-
Teams
- Group users
-
Dashboards
- Organization-scoped
-
Data Sources
- Organization-scoped
-
Service Accounts
- Replace API keys for automation
Managing Organizations
- Default org: Main Org
- You can create additional orgs (e.g., DevOps)
Each organization has:
- Separate dashboards
- Separate data sources
- Separate users
Switch orgs using the dropdown.
Managing Users
Two ways:
- Create users manually
- Invite users via email
Roles:
- Viewer
- Editor
- Admin
- No basic role
Users can belong to multiple organizations with different roles.
Teams
Teams:
- Group users
- Simplify permissions
- Can override preferences (theme, timezone)
Admins can:
- Add/remove users
- Delete teams
Organization Isolation
Dashboards and data sources are not shared across orgs.
This is critical for:
- Multi-team environments
- Security boundaries
LDAP / Active Directory Authentication
Grafana supports LDAP authentication, commonly used with:
- Active Directory
- Apache Directory Services
- Other LDAP-compatible directories
LDAP Authentication Flow
- User enters credentials in Grafana
- Grafana binds to LDAP using bind user
- LDAP validates credentials
- Grafana logs user in
Required LDAP Components
- Domain Controller (or LDAP server)
- Bind user (non-admin)
- User accounts
- Optional: group-based role mapping
Grafana LDAP Configuration
- Enable LDAP in
grafana.ini - Configure
ldap.toml - Restart Grafana
Key fields:
- Host
- Port (389 default)
- Bind DN
- Bind password
- Search filter
- Base DN
Group-Based Role Mapping
Example:
CN=grafana-admins,DC=grafana,DC=local → Admin
This allows:
- Automatic role assignment
- Organization mapping
Final Outcome
After configuration:
- Users authenticate via Active Directory
- Roles assigned automatically
- Grafana access controlled centrally
- No local password management needed
Summary
- AI enhances observability beyond static thresholds
- Grafana Cloud offers built-in ML
-
Grafana OSS can still leverage AI via:
- External tools
- Custom plugins
Prompt engineering is critical
-
Grafana supports strong enterprise administration:
- Orgs
- Teams
- LDAP
- Service accounts
External Authentication, High Availability, Scalability, and Grafana Playground
External Authentication in Grafana (Google OAuth)
Grafana supports external authentication, which means you do not have to create local Grafana users manually.
Instead, users can authenticate using external identity providers such as:
- Google (Google Workspace / Gmail)
- GitHub
- LDAP / Active Directory
- Other OAuth providers
In this lecture, we focus on Google authentication.
This is especially useful if your company uses Google Workspace, because administrators do not need to manage Grafana credentials manually.
Step 1: Create OAuth Credentials in Google
Make sure you have sufficient permissions.
From the left menu, select Credentials
Click Create Credentials → OAuth Client ID
Choose Web application
Set:
-
Name:
Grafana -
Authorized JavaScript origins
Example:
http://localhost:3000 -
Authorized redirect URIs
http://localhost:3000/login/google
- Click Create
You will receive:
- Client ID
- Client Secret
Step 2: Configure Grafana for Google Authentication
Edit your Grafana configuration file:
grafana.ini- or
custom.ini
Find the Google auth section:
[auth.google]
enabled = true
allow_sign_up = true
client_id = YOUR_CLIENT_ID
client_secret = YOUR_CLIENT_SECRET
Restrict Access to Your Organization (Recommended)
Without restrictions, any Google user could log in.
To limit access to your Google Workspace domain:
allowed_domains = mycompany.com
Only users with emails like:
user@mycompany.com
will be allowed.
Step 3: Restart Grafana
After making changes, restart Grafana.
When you open Grafana again, you will see:
- Sign in with Google
Users logging in via Google will:
- Be created automatically
- Default to Viewer role
- Require admin approval for editor/admin privileges
Benefits of External Authentication
- No manual user creation
- Centralized identity management
- Improved security
- Less administrative overhead
High Availability (HA) in Grafana
When running Grafana in production, high availability is critical.
High availability means:
Grafana continues working even if one instance fails.
Grafana HA Architecture
A standard HA setup includes:
- 2 or more Grafana instances
- Load balancer in front
- Shared database (PostgreSQL or MySQL)
- Identical configuration across all instances
Flow:
Browser → Load Balancer → Grafana Instance
Shared Database Requirement
Grafana stores:
- Dashboards
- Alert rules
- Users
- Preferences
All Grafana instances must use the same database.
⚠️ The database itself must also be highly available
(e.g., primary + replica / failover).
Shared Configuration Requirement
Each Grafana instance reads grafana.ini.
If instances have different configs, behavior will be inconsistent.
Solution:
- Store
grafana.inion a shared network location - All instances read from the same file
Unified Alerting in HA Mode
Grafana includes an internal Alert Engine (Alertmanager).
Problem in HA:
- Each Grafana instance evaluates alerts
- Multiple instances may send duplicate notifications
Enable Alertmanager Clustering
In grafana.ini, enable unified alerting:
[unified_alerting]
enabled = true
Configure HA peers:
ha_peers = grafana1:9094,grafana2:9094
⚠️ Port 9094 is the Alertmanager port, not Grafana UI.
This allows Alertmanagers to:
- Communicate
- Elect a leader
- Send only one notification per alert
Limitation of Static HA Peers
If Grafana instances:
- Scale dynamically
- Use auto-scaling
- Have changing IPs
Hardcoding peers becomes impractical.
Scalability with Redis (Dynamic HA)
For dynamic environments, use Redis.
Redis acts as:
- Shared state store
- Peer discovery mechanism
Redis-Based HA Configuration
Instead of hardcoding peers, configure Redis:
[unified_alerting]
ha_redis_address = redis:6379
ha_redis_username = grafana
ha_redis_password = password
ha_redis_db = 0
ha_redis_prefix = grafana
Benefits:
- Supports auto-scaling
- Dynamic Grafana instances
- No manual peer management
Scaling Grafana
VM-Based Scaling
- AWS Auto Scaling Groups
- Azure VM Scale Sets
Container-Based Scaling
- Kubernetes
- Amazon ECS
- Azure Container Apps
Grafana Docker images scale well in container platforms.
Grafana Playground (Hands-On Lab)
To practice without installing anything locally, use the Grafana Playground hosted on Killer Coder.
Key Characteristics
- Temporary environment (≈ 60 minutes)
- Ubuntu-based
- Docker-powered
-
Preconfigured stack:
- Grafana
- Prometheus
- Loki
- Tempo
- Example dashboards
How the Playground Works
- Open the playground URL
- Select Setup Grafana Stack
- Read instructions on the left
- Execute commands using clickable tooltips
- Validate each step
- Open Grafana and Prometheus via provided links
Login Credentials
-
Username:
admin -
Password:
admin
(No need to change password – environment is temporary)
What You Can Explore
- Dashboards with real data
- Prometheus metrics
- Loki logs
- Tempo traces
- Service graph visualization
Everything works end-to-end.
Editing Configuration in the Playground
Use nano editor:
-
Ctrl + W→ Search -
Ctrl + O→ Save -
Ctrl + X→ Exit
After editing:
docker restart <container-name>
Example:
docker restart grafana
docker restart prometheus
Final Thoughts
Prometheus + Grafana form an excellent open-source observability stack, but they require:
- Careful deployment
- Ongoing maintenance
- Scaling considerations
They are ideal for:
- Backend services
- Infrastructure monitoring
However, frontend observability and zero-maintenance setups can be complex.
Top comments (0)