What You Already Built
You already built a REAL observability platform:
```text id="4ck4gm"
Nginx
↓
Access/Error Logs
↓
Grafana Alloy
↓
Grafana Loki
↓
Grafana
↓
SRE Engineer
This is extremely close to what companies use in production.
You now have:
* centralized logs
* log querying
* observability
* troubleshooting platform
* real SRE workflow
---
# What Is Observability?
Observability means:
```text id="3z8t0x"
Understanding what is happening inside systems
by using:
- metrics
- logs
- traces
Three pillars:
| Pillar | Tool |
|---|---|
| Metrics | Prometheus |
| Logs | Loki |
| Traces | Tempo/OpenTelemetry |
Without observability:
- engineers guess
- outages take hours
- root cause unknown
With observability:
- engineers detect incidents fast
- correlate failures
- reduce downtime
What Is Loki?
Grafana Loki is a centralized log storage system.
It stores:
- application logs
- nginx logs
- Kubernetes logs
- Docker logs
- Linux logs
- authentication logs
- API logs
Instead of SSHing into 100 servers:
```bash id="42phlg"
cat /var/log/nginx/access.log
You centralize everything into Loki.
---
# Why Companies Use Loki
Before centralized logging:
```text id="lkgrvg"
Server1 logs
Server2 logs
Server3 logs
Kubernetes pod logs
Docker logs
Impossible to troubleshoot quickly.
With Loki:
```text id="f75qsi"
All logs centralized
Engineers search:
```logql id="nhv1ya"
{job="nginx"} |= "500"
Meaning:
- show nginx errors
- across all infrastructure
What Is Alloy?
Grafana Alloy is the collector.
VERY IMPORTANT:
```text id="8k6r92"
Alloy DOES NOT STORE LOGS
It:
* reads
* collects
* forwards
Think of Alloy like:
```text id="g8m0pl"
Log shipping agent
Alloy:
- reads files
- watches logs
- sends logs to Loki
Real Production Pipeline
```text id="yq3h3f"
Application
↓
Log file created
↓
Alloy watches file
↓
Alloy ships logs
↓
Loki stores logs
↓
Grafana visualizes logs
---
# Why SRE Engineers Need Loki + Alloy
As SRE engineer, your job is:
| Responsibility | Why Logs Matter |
| ----------------------- | --------------------- |
| Incident response | identify failures |
| Root cause analysis | determine WHY |
| Security investigations | detect attacks |
| Performance debugging | trace slow systems |
| Compliance | audit logs |
| Kubernetes debugging | inspect pod failures |
| API debugging | analyze requests |
| Authentication issues | detect login failures |
---
# Difference Between Metrics And Logs
## Metrics
Metrics answer:
```text id="01t1wc"
WHAT is wrong?
Example:
```text id="qg1ig0"
CPU = 95%
## Logs
Logs answer:
```text id="jlwmws"
WHY is it wrong?
Example:
```text id="m2gd06"
database connection timeout
---
# Example Real Incident
Users say:
```text id="y6m7r4"
Website slow
Prometheus shows:
```text id="nvvjlwm"
CPU high
## Loki logs show:
```text id="1g9byc"
database timeout
Root cause found.
THIS is real SRE workflow.
Why Loki Became Popular
Compared to ELK stack:
| ELK | Loki |
|---|---|
| Heavy | Lightweight |
| Expensive indexing | Label-based |
| High RAM | Lower RAM |
| Complex | Simpler |
| Expensive storage | Cheaper |
Loki indexes labels only.
VERY important.
MOST IMPORTANT LOKI CONCEPT — Labels
Labels organize logs.
Example:
```text id="dms9zl"
job="nginx"
env="prod"
instance="server1"
Labels make logs searchable.
Example query:
```logql id="1q5pw2"
{job="nginx"}
VERY IMPORTANT SRE KNOWLEDGE
BAD labels destroy Loki performance.
NEVER use dynamic labels like:
- request_id
- session_id
- timestamp
Why?
Because Loki creates massive indexes.
This is a VERY common interview question.
Understanding Your Current Environment
Your Alloy config:
```text id="87v60n"
Reads nginx access.log
Reads nginx error.log
Reads syslog
Reads auth.log
Then forwards to Loki.
Your Grafana query:
```logql id="b6pv2o"
{job="nginx_access"}
returns nginx traffic logs.
That means:
- pipeline works
- observability works
Understanding Nginx Logs
Example:
```text id="khyfvg"
"GET / HTTP/1.1" 200
| Part | Meaning |
| ----------- | -------------- |
| GET | HTTP method |
| / | requested page |
| HTTP/1.1 | protocol |
| 200 | success |
| curl/8.18.0 | client |
---
# HTTP Status Codes SRE Must Know
| Code | Meaning |
| ---- | --------------------- |
| 200 | success |
| 301 | redirect |
| 403 | forbidden |
| 404 | not found |
| 500 | internal server error |
| 502 | bad gateway |
| 503 | service unavailable |
---
# What SRE Engineers Watch In Logs
## 1. 500 Errors
```logql id="70rmjf"
{job="nginx_access"} |~ "5[0-9][0-9]"
Detect:
- backend crashes
- application failures
2. 404 Errors
```logql id="b7v1yz"
{job="nginx_access"} |= "404"
Detect:
* broken pages
* scanners
* attacks
---
## 3. Authentication Failures
```logql id="a5w3iy"
{job="auth"} |= "Failed password"
Detect:
- brute force attacks
- credential failures
4. Kubernetes Pod Crashes
```logql id="qvqihn"
{kubernetes_namespace="prod"} |= "CrashLoopBackOff"
---
## 5. Database Failures
```logql id="sl7z7h"
{job="backend"} |= "connection timeout"
What Alloy Can Collect
Alloy can collect:
| Source | Example |
|---|---|
| Linux logs | syslog |
| Auth logs | auth.log |
| Nginx logs | access.log |
| Docker logs | containers |
| Kubernetes logs | pods |
| Journald | systemd |
| CloudWatch | AWS |
| APIs | telemetry |
| OpenTelemetry | traces |
Why Alloy Is Important
Before Alloy:
- Promtail for logs
- Grafana Agent for metrics
- OpenTelemetry Collector for traces
Now:
- Alloy combines all
Very important modern observability concept.
Difference Between Loki And Prometheus
| Loki | Prometheus |
|---|---|
| logs | metrics |
| text | numeric |
| debugging | monitoring |
| WHY | WHAT |
Understanding Your Real Pipeline
You successfully built:
```text id="8n14qv"
curl request
↓
Nginx access.log
↓
Alloy reads file
↓
Alloy forwards logs
↓
Loki stores logs
↓
Grafana queries logs
This is REAL observability engineering.
---
# Real Production SRE Workflow
## Incident Starts
Users complain:
* website slow
* login failing
* API timeout
---
## SRE Workflow
### Step 1 — Metrics
Check:
* CPU
* memory
* disk
* latency
---
### Step 2 — Logs
Search:
```logql id="u5u2c9"
{job="nginx_access"} |= "500"
Step 3 — Correlate
Search backend:
```logql id="b7rwk7"
{job="backend"} |= "database timeout"
---
### Step 4 — Root Cause
Database overloaded.
---
# MOST IMPORTANT SRE SKILL
Correlation.
Example:
```text id="7n8j39"
High CPU
+
Timeout logs
+
Restart events
=
Root cause
FULL HANDS-ON LABS
LAB 1 — Generate Traffic
Run:
```bash id="ejt8x6"
for i in {1..100}
do
curl localhost
done
Observe:
* access logs
* Loki queries
* Grafana live logs
---
# LAB 2 — Find Traffic
Query:
```logql id="g6c3nl"
{job="nginx_access"}
Understand:
- requests
- clients
- timestamps
LAB 3 — Generate 404 Errors
Run:
```bash id="u37qiw"
curl localhost/fakepage
curl localhost/admin
curl localhost/test
Now query:
```logql id="povkgx"
{job="nginx_access"} |= "404"
Understand:
- broken URLs
- scanners
- attack attempts
LAB 4 — Simulate Attack Traffic
Run:
```bash id="wvlxaq"
for i in {1..500}
do
curl localhost/login
done
Query:
```logql id="fd6l7h"
{job="nginx_access"} |= "/login"
Understand:
- brute force detection
- traffic spikes
LAB 5 — Live Log Streaming
In Grafana:
- click LIVE
Run:
```bash id="cv5p3f"
curl localhost
Watch logs appear live.
Production use:
* deployment monitoring
* live incident debugging
---
# LAB 6 — Break Nginx
Stop nginx:
```bash id="u5w6v2"
sudo systemctl stop nginx
Now:
- website unavailable
- curl fails
- logs stop
Observe:
- metrics
- logs
- alerts
Restart:
```bash id="08px7z"
sudo systemctl start nginx
---
# LAB 7 — Watch Authentication Logs
Attempt SSH login failures.
Then query:
```logql id="a4rdmv"
{job="auth"}
Search:
```logql id="e7m0qx"
{job="auth"} |= "Failed password"
Understand:
* security monitoring
* intrusion attempts
---
# LAB 8 — Monitor System Logs
Query:
```logql id="gm9ln0"
{job="syslog"}
Observe:
- services
- system events
- daemon activity
LAB 9 — Create Dashboard
Create panels for:
- request count
- 404 count
- login failures
- live logs
This is real observability dashboarding.
LAB 10 — Correlate Metrics + Logs
Run stress:
```bash id="sm0d4m"
stress --cpu 2 --timeout 300
Check:
* Prometheus CPU metrics
* Loki logs
Understand:
* metric/log correlation
---
# Common Production Problems
| Problem | Cause |
| ------------------ | ------------------- |
| No logs | Alloy stopped |
| Missing labels | bad config |
| Empty Grafana | wrong query |
| High Loki storage | too many logs |
| Slow queries | bad labels |
| Missing nginx logs | wrong file path |
| Duplicate logs | multiple collectors |
---
# What You Must Know For Interviews
## Loki
* labels
* LogQL
* centralized logging
* storage
* retention
* troubleshooting
## Alloy
* collectors
* pipelines
* log shipping
* OpenTelemetry
* forwarding
## SRE Concepts
* observability
* root cause analysis
* metrics vs logs
* incident response
* correlation
---
# VERY IMPORTANT INTERVIEW QUESTIONS
## Why Loki instead of ELK?
Answer:
* cheaper
* lightweight
* label indexing
* easier scaling
---
## What does Alloy do?
Answer:
```text id="f3t1uh"
Collects and forwards telemetry:
- logs
- metrics
- traces
Difference between Alloy and Loki?
| Alloy | Loki |
|---|---|
| collector | storage |
| ships logs | stores logs |
| reads files | indexes logs |
FINAL UNDERSTANDING
You successfully built a REAL observability system used by:
- DevOps engineers
- SRE engineers
- platform engineers
- cloud engineers
This is NOT beginner work anymore.
You are now doing:
- centralized logging
- observability engineering
- incident investigation
- production troubleshooting
- log analytics
- root cause analysis
Top comments (0)