DEV Community

Aisalkyn Aidarova
Aisalkyn Aidarova

Posted on

Full Lecture — Grafana Loki + Grafana Alloy for DevOps & SRE Engineers

What You Already Built

You already built a REAL observability platform:

```text id="4ck4gm"
Nginx

Access/Error Logs

Grafana Alloy

Grafana Loki

Grafana

SRE Engineer




This is extremely close to what companies use in production.

You now have:

* centralized logs
* log querying
* observability
* troubleshooting platform
* real SRE workflow

---

# What Is Observability?

Observability means:



```text id="3z8t0x"
Understanding what is happening inside systems
by using:
- metrics
- logs
- traces
Enter fullscreen mode Exit fullscreen mode

Three pillars:

Pillar Tool
Metrics Prometheus
Logs Loki
Traces Tempo/OpenTelemetry

Without observability:

  • engineers guess
  • outages take hours
  • root cause unknown

With observability:

  • engineers detect incidents fast
  • correlate failures
  • reduce downtime

What Is Loki?

Grafana Loki is a centralized log storage system.

It stores:

  • application logs
  • nginx logs
  • Kubernetes logs
  • Docker logs
  • Linux logs
  • authentication logs
  • API logs

Instead of SSHing into 100 servers:

```bash id="42phlg"
cat /var/log/nginx/access.log




You centralize everything into Loki.

---

# Why Companies Use Loki

Before centralized logging:



```text id="lkgrvg"
Server1 logs
Server2 logs
Server3 logs
Kubernetes pod logs
Docker logs
Enter fullscreen mode Exit fullscreen mode

Impossible to troubleshoot quickly.

With Loki:

```text id="f75qsi"
All logs centralized




Engineers search:



```logql id="nhv1ya"
{job="nginx"} |= "500"
Enter fullscreen mode Exit fullscreen mode

Meaning:

  • show nginx errors
  • across all infrastructure

What Is Alloy?

Grafana Alloy is the collector.

VERY IMPORTANT:

```text id="8k6r92"
Alloy DOES NOT STORE LOGS




It:

* reads
* collects
* forwards

Think of Alloy like:



```text id="g8m0pl"
Log shipping agent
Enter fullscreen mode Exit fullscreen mode

Alloy:

  • reads files
  • watches logs
  • sends logs to Loki

Real Production Pipeline

```text id="yq3h3f"
Application

Log file created

Alloy watches file

Alloy ships logs

Loki stores logs

Grafana visualizes logs




---

# Why SRE Engineers Need Loki + Alloy

As SRE engineer, your job is:

| Responsibility          | Why Logs Matter       |
| ----------------------- | --------------------- |
| Incident response       | identify failures     |
| Root cause analysis     | determine WHY         |
| Security investigations | detect attacks        |
| Performance debugging   | trace slow systems    |
| Compliance              | audit logs            |
| Kubernetes debugging    | inspect pod failures  |
| API debugging           | analyze requests      |
| Authentication issues   | detect login failures |

---

# Difference Between Metrics And Logs

## Metrics

Metrics answer:



```text id="01t1wc"
WHAT is wrong?
Enter fullscreen mode Exit fullscreen mode

Example:

```text id="qg1ig0"
CPU = 95%




## Logs

Logs answer:



```text id="jlwmws"
WHY is it wrong?
Enter fullscreen mode Exit fullscreen mode

Example:

```text id="m2gd06"
database connection timeout




---

# Example Real Incident

Users say:



```text id="y6m7r4"
Website slow
Enter fullscreen mode Exit fullscreen mode

Prometheus shows:

```text id="nvvjlwm"
CPU high




## Loki logs show:



```text id="1g9byc"
database timeout
Enter fullscreen mode Exit fullscreen mode

Root cause found.

THIS is real SRE workflow.


Why Loki Became Popular

Compared to ELK stack:

ELK Loki
Heavy Lightweight
Expensive indexing Label-based
High RAM Lower RAM
Complex Simpler
Expensive storage Cheaper

Loki indexes labels only.

VERY important.


MOST IMPORTANT LOKI CONCEPT — Labels

Labels organize logs.

Example:

```text id="dms9zl"
job="nginx"
env="prod"
instance="server1"




Labels make logs searchable.

Example query:



```logql id="1q5pw2"
{job="nginx"}
Enter fullscreen mode Exit fullscreen mode

VERY IMPORTANT SRE KNOWLEDGE

BAD labels destroy Loki performance.

NEVER use dynamic labels like:

  • request_id
  • session_id
  • timestamp

Why?

Because Loki creates massive indexes.

This is a VERY common interview question.


Understanding Your Current Environment

Your Alloy config:

```text id="87v60n"
Reads nginx access.log
Reads nginx error.log
Reads syslog
Reads auth.log




Then forwards to Loki.

Your Grafana query:



```logql id="b6pv2o"
{job="nginx_access"}
Enter fullscreen mode Exit fullscreen mode

returns nginx traffic logs.

That means:

  • pipeline works
  • observability works

Understanding Nginx Logs

Example:

```text id="khyfvg"
"GET / HTTP/1.1" 200




| Part        | Meaning        |
| ----------- | -------------- |
| GET         | HTTP method    |
| /           | requested page |
| HTTP/1.1    | protocol       |
| 200         | success        |
| curl/8.18.0 | client         |

---

# HTTP Status Codes SRE Must Know

| Code | Meaning               |
| ---- | --------------------- |
| 200  | success               |
| 301  | redirect              |
| 403  | forbidden             |
| 404  | not found             |
| 500  | internal server error |
| 502  | bad gateway           |
| 503  | service unavailable   |

---

# What SRE Engineers Watch In Logs

## 1. 500 Errors



```logql id="70rmjf"
{job="nginx_access"} |~ "5[0-9][0-9]"
Enter fullscreen mode Exit fullscreen mode

Detect:

  • backend crashes
  • application failures

2. 404 Errors

```logql id="b7v1yz"
{job="nginx_access"} |= "404"




Detect:

* broken pages
* scanners
* attacks

---

## 3. Authentication Failures



```logql id="a5w3iy"
{job="auth"} |= "Failed password"
Enter fullscreen mode Exit fullscreen mode

Detect:

  • brute force attacks
  • credential failures

4. Kubernetes Pod Crashes

```logql id="qvqihn"
{kubernetes_namespace="prod"} |= "CrashLoopBackOff"




---

## 5. Database Failures



```logql id="sl7z7h"
{job="backend"} |= "connection timeout"
Enter fullscreen mode Exit fullscreen mode

What Alloy Can Collect

Alloy can collect:

Source Example
Linux logs syslog
Auth logs auth.log
Nginx logs access.log
Docker logs containers
Kubernetes logs pods
Journald systemd
CloudWatch AWS
APIs telemetry
OpenTelemetry traces

Why Alloy Is Important

Before Alloy:

  • Promtail for logs
  • Grafana Agent for metrics
  • OpenTelemetry Collector for traces

Now:

  • Alloy combines all

Very important modern observability concept.


Difference Between Loki And Prometheus

Loki Prometheus
logs metrics
text numeric
debugging monitoring
WHY WHAT

Understanding Your Real Pipeline

You successfully built:

```text id="8n14qv"
curl request

Nginx access.log

Alloy reads file

Alloy forwards logs

Loki stores logs

Grafana queries logs




This is REAL observability engineering.

---

# Real Production SRE Workflow

## Incident Starts

Users complain:

* website slow
* login failing
* API timeout

---

## SRE Workflow

### Step 1 — Metrics

Check:

* CPU
* memory
* disk
* latency

---

### Step 2 — Logs

Search:



```logql id="u5u2c9"
{job="nginx_access"} |= "500"
Enter fullscreen mode Exit fullscreen mode

Step 3 — Correlate

Search backend:

```logql id="b7rwk7"
{job="backend"} |= "database timeout"




---

### Step 4 — Root Cause

Database overloaded.

---

# MOST IMPORTANT SRE SKILL

Correlation.

Example:



```text id="7n8j39"
High CPU
+
Timeout logs
+
Restart events
=
Root cause
Enter fullscreen mode Exit fullscreen mode

FULL HANDS-ON LABS

LAB 1 — Generate Traffic

Run:

```bash id="ejt8x6"
for i in {1..100}
do
curl localhost
done




Observe:

* access logs
* Loki queries
* Grafana live logs

---

# LAB 2 — Find Traffic

Query:



```logql id="g6c3nl"
{job="nginx_access"}
Enter fullscreen mode Exit fullscreen mode

Understand:

  • requests
  • clients
  • timestamps

LAB 3 — Generate 404 Errors

Run:

```bash id="u37qiw"
curl localhost/fakepage
curl localhost/admin
curl localhost/test




Now query:



```logql id="povkgx"
{job="nginx_access"} |= "404"
Enter fullscreen mode Exit fullscreen mode

Understand:

  • broken URLs
  • scanners
  • attack attempts

LAB 4 — Simulate Attack Traffic

Run:

```bash id="wvlxaq"
for i in {1..500}
do
curl localhost/login
done




Query:



```logql id="fd6l7h"
{job="nginx_access"} |= "/login"
Enter fullscreen mode Exit fullscreen mode

Understand:

  • brute force detection
  • traffic spikes

LAB 5 — Live Log Streaming

In Grafana:

  • click LIVE

Run:

```bash id="cv5p3f"
curl localhost




Watch logs appear live.

Production use:

* deployment monitoring
* live incident debugging

---

# LAB 6 — Break Nginx

Stop nginx:



```bash id="u5w6v2"
sudo systemctl stop nginx
Enter fullscreen mode Exit fullscreen mode

Now:

  • website unavailable
  • curl fails
  • logs stop

Observe:

  • metrics
  • logs
  • alerts

Restart:

```bash id="08px7z"
sudo systemctl start nginx




---

# LAB 7 — Watch Authentication Logs

Attempt SSH login failures.

Then query:



```logql id="a4rdmv"
{job="auth"}
Enter fullscreen mode Exit fullscreen mode

Search:

```logql id="e7m0qx"
{job="auth"} |= "Failed password"




Understand:

* security monitoring
* intrusion attempts

---

# LAB 8 — Monitor System Logs

Query:



```logql id="gm9ln0"
{job="syslog"}
Enter fullscreen mode Exit fullscreen mode

Observe:

  • services
  • system events
  • daemon activity

LAB 9 — Create Dashboard

Create panels for:

  • request count
  • 404 count
  • login failures
  • live logs

This is real observability dashboarding.


LAB 10 — Correlate Metrics + Logs

Run stress:

```bash id="sm0d4m"
stress --cpu 2 --timeout 300




Check:

* Prometheus CPU metrics
* Loki logs

Understand:

* metric/log correlation

---

# Common Production Problems

| Problem            | Cause               |
| ------------------ | ------------------- |
| No logs            | Alloy stopped       |
| Missing labels     | bad config          |
| Empty Grafana      | wrong query         |
| High Loki storage  | too many logs       |
| Slow queries       | bad labels          |
| Missing nginx logs | wrong file path     |
| Duplicate logs     | multiple collectors |

---

# What You Must Know For Interviews

## Loki

* labels
* LogQL
* centralized logging
* storage
* retention
* troubleshooting

## Alloy

* collectors
* pipelines
* log shipping
* OpenTelemetry
* forwarding

## SRE Concepts

* observability
* root cause analysis
* metrics vs logs
* incident response
* correlation

---

# VERY IMPORTANT INTERVIEW QUESTIONS

## Why Loki instead of ELK?

Answer:

* cheaper
* lightweight
* label indexing
* easier scaling

---

## What does Alloy do?

Answer:



```text id="f3t1uh"
Collects and forwards telemetry:
- logs
- metrics
- traces
Enter fullscreen mode Exit fullscreen mode

Difference between Alloy and Loki?

Alloy Loki
collector storage
ships logs stores logs
reads files indexes logs

FINAL UNDERSTANDING

You successfully built a REAL observability system used by:

  • DevOps engineers
  • SRE engineers
  • platform engineers
  • cloud engineers

This is NOT beginner work anymore.

You are now doing:

  • centralized logging
  • observability engineering
  • incident investigation
  • production troubleshooting
  • log analytics
  • root cause analysis

Top comments (0)