Aisalkyn Aidarova

Posted on May 22

Full Lecture — Grafana Loki + Grafana Alloy for DevOps & SRE Engineers

What You Already Built

You already built a REAL observability platform:

```text id="4ck4gm"
Nginx
↓
Access/Error Logs
↓
Grafana Alloy
↓
Grafana Loki
↓
Grafana
↓
SRE Engineer




This is extremely close to what companies use in production.

You now have:

* centralized logs
* log querying
* observability
* troubleshooting platform
* real SRE workflow

---

# What Is Observability?

Observability means:



```text id="3z8t0x"
Understanding what is happening inside systems
by using:
- metrics
- logs
- traces

Three pillars:

Pillar	Tool
Metrics	Prometheus
Logs	Loki
Traces	Tempo/OpenTelemetry

Without observability:

engineers guess
outages take hours
root cause unknown

With observability:

engineers detect incidents fast
correlate failures
reduce downtime

What Is Loki?

Grafana Loki is a centralized log storage system.

It stores:

application logs
nginx logs
Kubernetes logs
Docker logs
Linux logs
authentication logs
API logs

Instead of SSHing into 100 servers:

```bash id="42phlg"
cat /var/log/nginx/access.log




You centralize everything into Loki.

---

# Why Companies Use Loki

Before centralized logging:



```text id="lkgrvg"
Server1 logs
Server2 logs
Server3 logs
Kubernetes pod logs
Docker logs

Impossible to troubleshoot quickly.

With Loki:

```text id="f75qsi"
All logs centralized




Engineers search:



```logql id="nhv1ya"
{job="nginx"} |= "500"

Meaning:

show nginx errors
across all infrastructure

What Is Alloy?

Grafana Alloy is the collector.

VERY IMPORTANT:

```text id="8k6r92"
Alloy DOES NOT STORE LOGS




It:

* reads
* collects
* forwards

Think of Alloy like:



```text id="g8m0pl"
Log shipping agent

Alloy:

reads files
watches logs
sends logs to Loki

Real Production Pipeline

```text id="yq3h3f"
Application
↓
Log file created
↓
Alloy watches file
↓
Alloy ships logs
↓
Loki stores logs
↓
Grafana visualizes logs




---

# Why SRE Engineers Need Loki + Alloy

As SRE engineer, your job is:

| Responsibility          | Why Logs Matter       |
| ----------------------- | --------------------- |
| Incident response       | identify failures     |
| Root cause analysis     | determine WHY         |
| Security investigations | detect attacks        |
| Performance debugging   | trace slow systems    |
| Compliance              | audit logs            |
| Kubernetes debugging    | inspect pod failures  |
| API debugging           | analyze requests      |
| Authentication issues   | detect login failures |

---

# Difference Between Metrics And Logs

## Metrics

Metrics answer:



```text id="01t1wc"
WHAT is wrong?

Example:

```text id="qg1ig0"
CPU = 95%




## Logs

Logs answer:



```text id="jlwmws"
WHY is it wrong?

Example:

```text id="m2gd06"
database connection timeout




---

# Example Real Incident

Users say:



```text id="y6m7r4"
Website slow

Prometheus shows:

```text id="nvvjlwm"
CPU high




## Loki logs show:



```text id="1g9byc"
database timeout

Root cause found.

THIS is real SRE workflow.

Why Loki Became Popular

Compared to ELK stack:

ELK	Loki
Heavy	Lightweight
Expensive indexing	Label-based
High RAM	Lower RAM
Complex	Simpler
Expensive storage	Cheaper

Loki indexes labels only.

VERY important.

MOST IMPORTANT LOKI CONCEPT — Labels

Labels organize logs.

Example:

```text id="dms9zl"
job="nginx"
env="prod"
instance="server1"




Labels make logs searchable.

Example query:



```logql id="1q5pw2"
{job="nginx"}

VERY IMPORTANT SRE KNOWLEDGE

BAD labels destroy Loki performance.

NEVER use dynamic labels like:

request_id
session_id
timestamp

Why?

Because Loki creates massive indexes.

This is a VERY common interview question.

Understanding Your Current Environment

Your Alloy config:

```text id="87v60n"
Reads nginx access.log
Reads nginx error.log
Reads syslog
Reads auth.log




Then forwards to Loki.

Your Grafana query:



```logql id="b6pv2o"
{job="nginx_access"}

returns nginx traffic logs.

That means:

pipeline works
observability works

Understanding Nginx Logs

Example:

```text id="khyfvg"
"GET / HTTP/1.1" 200




| Part        | Meaning        |
| ----------- | -------------- |
| GET         | HTTP method    |
| /           | requested page |
| HTTP/1.1    | protocol       |
| 200         | success        |
| curl/8.18.0 | client         |

---

# HTTP Status Codes SRE Must Know

| Code | Meaning               |
| ---- | --------------------- |
| 200  | success               |
| 301  | redirect              |
| 403  | forbidden             |
| 404  | not found             |
| 500  | internal server error |
| 502  | bad gateway           |
| 503  | service unavailable   |

---

# What SRE Engineers Watch In Logs

## 1. 500 Errors



```logql id="70rmjf"
{job="nginx_access"} |~ "5[0-9][0-9]"

Detect:

backend crashes
application failures

2. 404 Errors

```logql id="b7v1yz"
{job="nginx_access"} |= "404"




Detect:

* broken pages
* scanners
* attacks

---

## 3. Authentication Failures



```logql id="a5w3iy"
{job="auth"} |= "Failed password"

Detect:

brute force attacks
credential failures

4. Kubernetes Pod Crashes

```logql id="qvqihn"
{kubernetes_namespace="prod"} |= "CrashLoopBackOff"




---

## 5. Database Failures



```logql id="sl7z7h"
{job="backend"} |= "connection timeout"

What Alloy Can Collect

Alloy can collect:

Source	Example
Linux logs	syslog
Auth logs	auth.log
Nginx logs	access.log
Docker logs	containers
Kubernetes logs	pods
Journald	systemd
CloudWatch	AWS
APIs	telemetry
OpenTelemetry	traces

Why Alloy Is Important

Before Alloy:

Promtail for logs
Grafana Agent for metrics
OpenTelemetry Collector for traces

Now:

Alloy combines all

Very important modern observability concept.

Difference Between Loki And Prometheus

Loki	Prometheus
logs	metrics
text	numeric
debugging	monitoring
WHY	WHAT

Understanding Your Real Pipeline

You successfully built:

```text id="8n14qv"
curl request
↓
Nginx access.log
↓
Alloy reads file
↓
Alloy forwards logs
↓
Loki stores logs
↓
Grafana queries logs




This is REAL observability engineering.

---

# Real Production SRE Workflow

## Incident Starts

Users complain:

* website slow
* login failing
* API timeout

---

## SRE Workflow

### Step 1 — Metrics

Check:

* CPU
* memory
* disk
* latency

---

### Step 2 — Logs

Search:



```logql id="u5u2c9"
{job="nginx_access"} |= "500"

Step 3 — Correlate

Search backend:

```logql id="b7rwk7"
{job="backend"} |= "database timeout"




---

### Step 4 — Root Cause

Database overloaded.

---

# MOST IMPORTANT SRE SKILL

Correlation.

Example:



```text id="7n8j39"
High CPU
+
Timeout logs
+
Restart events
=
Root cause

FULL HANDS-ON LABS

LAB 1 — Generate Traffic

Run:

```bash id="ejt8x6"
for i in {1..100}
do
curl localhost
done




Observe:

* access logs
* Loki queries
* Grafana live logs

---

# LAB 2 — Find Traffic

Query:



```logql id="g6c3nl"
{job="nginx_access"}

Understand:

requests
clients
timestamps

LAB 3 — Generate 404 Errors

Run:

```bash id="u37qiw"
curl localhost/fakepage
curl localhost/admin
curl localhost/test




Now query:



```logql id="povkgx"
{job="nginx_access"} |= "404"

Understand:

broken URLs
scanners
attack attempts

LAB 4 — Simulate Attack Traffic

Run:

```bash id="wvlxaq"
for i in {1..500}
do
curl localhost/login
done




Query:



```logql id="fd6l7h"
{job="nginx_access"} |= "/login"

Understand:

brute force detection
traffic spikes

LAB 5 — Live Log Streaming

In Grafana:

click LIVE

Run:

```bash id="cv5p3f"
curl localhost




Watch logs appear live.

Production use:

* deployment monitoring
* live incident debugging

---

# LAB 6 — Break Nginx

Stop nginx:



```bash id="u5w6v2"
sudo systemctl stop nginx

Now:

website unavailable
curl fails
logs stop

Observe:

metrics
logs
alerts

Restart:

```bash id="08px7z"
sudo systemctl start nginx




---

# LAB 7 — Watch Authentication Logs

Attempt SSH login failures.

Then query:



```logql id="a4rdmv"
{job="auth"}

Search:

```logql id="e7m0qx"
{job="auth"} |= "Failed password"




Understand:

* security monitoring
* intrusion attempts

---

# LAB 8 — Monitor System Logs

Query:



```logql id="gm9ln0"
{job="syslog"}

Observe:

services
system events
daemon activity

LAB 9 — Create Dashboard

Create panels for:

request count
404 count
login failures
live logs

This is real observability dashboarding.

LAB 10 — Correlate Metrics + Logs

Run stress:

```bash id="sm0d4m"
stress --cpu 2 --timeout 300




Check:

* Prometheus CPU metrics
* Loki logs

Understand:

* metric/log correlation

---

# Common Production Problems

| Problem            | Cause               |
| ------------------ | ------------------- |
| No logs            | Alloy stopped       |
| Missing labels     | bad config          |
| Empty Grafana      | wrong query         |
| High Loki storage  | too many logs       |
| Slow queries       | bad labels          |
| Missing nginx logs | wrong file path     |
| Duplicate logs     | multiple collectors |

---

# What You Must Know For Interviews

## Loki

* labels
* LogQL
* centralized logging
* storage
* retention
* troubleshooting

## Alloy

* collectors
* pipelines
* log shipping
* OpenTelemetry
* forwarding

## SRE Concepts

* observability
* root cause analysis
* metrics vs logs
* incident response
* correlation

---

# VERY IMPORTANT INTERVIEW QUESTIONS

## Why Loki instead of ELK?

Answer:

* cheaper
* lightweight
* label indexing
* easier scaling

---

## What does Alloy do?

Answer:



```text id="f3t1uh"
Collects and forwards telemetry:
- logs
- metrics
- traces

Difference between Alloy and Loki?

Alloy	Loki
collector	storage
ships logs	stores logs
reads files	indexes logs

FINAL UNDERSTANDING

You successfully built a REAL observability system used by:

DevOps engineers
SRE engineers
platform engineers
cloud engineers

This is NOT beginner work anymore.

You are now doing:

centralized logging
observability engineering
incident investigation
production troubleshooting
log analytics
root cause analysis

DEV Community

Full Lecture — Grafana Loki + Grafana Alloy for DevOps & SRE Engineers

What You Already Built

What Is Loki?

What Is Alloy?

Real Production Pipeline

Prometheus shows:

Why Loki Became Popular

MOST IMPORTANT LOKI CONCEPT — Labels

VERY IMPORTANT SRE KNOWLEDGE

Understanding Your Current Environment

Understanding Nginx Logs

2. 404 Errors

4. Kubernetes Pod Crashes

What Alloy Can Collect

Why Alloy Is Important

Difference Between Loki And Prometheus

Understanding Your Real Pipeline

Step 3 — Correlate

FULL HANDS-ON LABS

LAB 1 — Generate Traffic

LAB 3 — Generate 404 Errors

LAB 4 — Simulate Attack Traffic

LAB 5 — Live Log Streaming

LAB 9 — Create Dashboard

LAB 10 — Correlate Metrics + Logs

Difference between Alloy and Loki?

FINAL UNDERSTANDING

Top comments (0)