DEV Community

Aisalkyn Aidarova
Aisalkyn Aidarova

Posted on

Observability, Reliability, and Incident Management (Production-Level)

1. What SRE Actually Does (Real World)

After networking (VPC, subnets, routing), your system is running.

Now SRE responsibility starts:

  • Is the system working?
  • Is it fast?
  • Is it reliable?
  • Can we detect problems early?
  • Can we recover quickly?

This is called reliability engineering.

A simple way to think:

DevOps builds system
SRE keeps it alive under stress


2. Observability — Deep Explanation

Observability is not just “monitoring.”
Monitoring tells you something is wrong
Observability tells you why it is wrong

AWS and Google SRE books define observability as:

Ability to understand system state using external outputs

These outputs are:

  • metrics
  • logs
  • traces

2.1 Metrics (Deep)

Metrics are numerical time-series data

Examples:

  • CPU usage = 70%
  • requests/sec = 200
  • error rate = 5%

Why we use metrics

  • detect anomalies
  • trigger alerts
  • track performance trends
  • capacity planning

Tool: Amazon CloudWatch

What it does

  • collects metrics from AWS services (EC2, ALB, RDS)
  • stores time-series data
  • creates alarms

How we use it (real scenario)

Example:

You deploy application on EC2

CloudWatch automatically gives:

  • CPUUtilization
  • NetworkIn/Out
  • DiskReadOps

Then you create alarm:

IF CPU > 80% for 5 minutes → trigger alert


When to use CloudWatch

  • AWS native monitoring
  • quick setup
  • infrastructure-level metrics

Limitations (SRE thinking)

  • not very strong for custom application metrics
  • limited visualization compared to Prometheus + Grafana

2.2 Metrics Tool (Advanced): Prometheus

What it does

  • pulls metrics from applications
  • stores time-series data
  • supports powerful queries (PromQL)

Why SRE prefers Prometheus

  • better for microservices
  • supports custom metrics
  • integrates with Kubernetes

How we use it

  1. Application exposes metrics endpoint:
    /metrics

  2. Prometheus scrapes it

  3. You query:

  • request latency
  • error rates
  • DB connections

Example

You detect:

  • high latency
  • normal CPU

→ issue is NOT infrastructure
→ issue is application


Troubleshooting using metrics

Case:

Website slow

Check:

  • CPU high → scaling issue
  • latency high → app issue
  • error rate high → bug or DB problem

2.3 Logs (Deep)

Logs are detailed events

Example:

  • user login failed
  • DB connection error
  • API returned 500

Tool: ELK Stack

Components:

  • Elasticsearch → storage
  • Logstash → processing
  • Kibana → visualization

Why logs are critical

Metrics say:
“Error rate increased”

Logs say:
“Database connection timeout”


How we use logs

Good logs must include:

  • timestamp
  • service name
  • log level (INFO, ERROR)
  • request ID

AWS logging: CloudWatch Logs

  • collects logs from EC2, Lambda
  • integrates with CloudWatch metrics

Troubleshooting using logs

Case:

App returns 500

Steps:

  1. check logs
  2. find error message
  3. identify root cause

Example:

  • “connection refused” → DB issue
  • “timeout” → network issue

2.4 Tracing (Deep)

Tracing tracks request across services

Example:

User request path:

User → ALB → API → Service → DB


Tool: AWS X-Ray


Why tracing matters

In microservices:

You don’t know where latency happens

Tracing shows:

  • which service is slow
  • where failure occurs

Example

Request takes 3 seconds

Tracing shows:

  • API: 50ms
  • Service: 100ms
  • DB: 2.8s

→ problem is DB


3. SLI, SLO, SLA (Deep Understanding)


SLI (Indicator)

What you measure:

  • uptime
  • latency
  • error rate

SLO (Objective)

Target:

  • 99.9% uptime

SLA (Agreement)

Legal commitment:

  • if broken → compensation

Why SRE uses this

Because:

Without SLO → no reliability target
Without SLI → no measurement


4. Alerting (Real SRE Thinking)

Alerting is where most teams fail


Bad alerts

  • CPU 80%
  • disk usage 70%

These create noise


Good alerts

  • user cannot login
  • API error rate > 5%
  • latency > threshold

Tool: Prometheus Alertmanager


How alert works

  1. metric collected
  2. condition evaluated
  3. alert fired
  4. notification sent (Slack, email)

SRE rule

Alert on user impact, not infrastructure


5. Incident Management (Production Flow)

Incident = service disruption


Real steps

  1. detection (monitoring)
  2. alert triggered
  3. engineer responds
  4. mitigation (temporary fix)
  5. resolution (root cause fix)
  6. postmortem

Example

Issue:
Website down

Actions:

  • restart service (mitigation)
  • fix DB connection (resolution)

6. Postmortem (Critical SRE Practice)

After incident:

You must answer:

  • what happened?
  • why?
  • how to prevent?

Rule

No blame

Focus on system failure, not people


7. Error Budget (Advanced Concept)

If SLO = 99.9%

Allowed downtime:

≈ 43 minutes/month


Why important

Balance:

  • innovation (deploy fast)
  • stability (avoid downtime)

8. High Availability (HA)

System must survive failure


AWS tools

  • Elastic Load Balancer
  • Multi-AZ deployment

Example

If one AZ fails:

Traffic shifts to another AZ


9. Auto Scaling (Reliability + Cost)

Automatically adjust capacity


AWS service

Auto Scaling Group


Example

Traffic spike:

  • add EC2 instances

Traffic drop:

  • remove instances

10. Health Checks

Check system status


In Kubernetes

  • readiness probe → ready to serve
  • liveness probe → alive

Tool: Kubernetes


Why important

Without health checks:

Load balancer sends traffic to broken app


11. Caching (Performance)

Store frequently used data


Tool: Redis


Why use caching

  • reduce DB load
  • faster response

12. Disaster Recovery

Plan for failure


Strategies

  • backup restore
  • multi-region
  • active-active

13. Troubleshooting Mindset (MOST IMPORTANT)

When something breaks:

DO NOT GUESS

Follow layers:

  1. DNS
  2. Network
  3. Load balancer
  4. App
  5. DB

Example

App not working

Check:

  • DNS resolves?
  • ALB healthy?
  • EC2 running?
  • logs show error?
  • DB reachable?

FINAL SRE INTERVIEW ANSWER

You say:

As an SRE, I focus on system reliability by implementing observability using metrics, logs, and traces, defining SLOs, setting up meaningful alerts, ensuring high availability with load balancing and auto scaling, and handling incidents with structured troubleshooting and postmortems.

Top comments (0)