Aisalkyn Aidarova

Posted on Apr 29

Observability, Reliability, and Incident Management (Production-Level)

1. What SRE Actually Does (Real World)

After networking (VPC, subnets, routing), your system is running.

Now SRE responsibility starts:

Is the system working?
Is it fast?
Is it reliable?
Can we detect problems early?
Can we recover quickly?

This is called reliability engineering.

A simple way to think:

DevOps builds system
SRE keeps it alive under stress

2. Observability — Deep Explanation

Observability is not just “monitoring.”
Monitoring tells you something is wrong
Observability tells you why it is wrong

AWS and Google SRE books define observability as:

Ability to understand system state using external outputs

These outputs are:

metrics
logs
traces

2.1 Metrics (Deep)

Metrics are numerical time-series data

Examples:

CPU usage = 70%
requests/sec = 200
error rate = 5%

Why we use metrics

detect anomalies
trigger alerts
track performance trends
capacity planning

Tool: Amazon CloudWatch

What it does

collects metrics from AWS services (EC2, ALB, RDS)
stores time-series data
creates alarms

How we use it (real scenario)

Example:

You deploy application on EC2

CloudWatch automatically gives:

CPUUtilization
NetworkIn/Out
DiskReadOps

Then you create alarm:

IF CPU > 80% for 5 minutes → trigger alert

When to use CloudWatch

AWS native monitoring
quick setup
infrastructure-level metrics

Limitations (SRE thinking)

not very strong for custom application metrics
limited visualization compared to Prometheus + Grafana

2.2 Metrics Tool (Advanced): Prometheus

What it does

pulls metrics from applications
stores time-series data
supports powerful queries (PromQL)

Why SRE prefers Prometheus

better for microservices
supports custom metrics
integrates with Kubernetes

How we use it

Application exposes metrics endpoint:
/metrics
Prometheus scrapes it
You query:

request latency
error rates
DB connections

Example

You detect:

high latency
normal CPU

→ issue is NOT infrastructure
→ issue is application

Troubleshooting using metrics

Case:

Website slow

Check:

CPU high → scaling issue
latency high → app issue
error rate high → bug or DB problem

2.3 Logs (Deep)

Logs are detailed events

Example:

user login failed
DB connection error
API returned 500

Tool: ELK Stack

Components:

Elasticsearch → storage
Logstash → processing
Kibana → visualization

Why logs are critical

Metrics say:
“Error rate increased”

Logs say:
“Database connection timeout”

How we use logs

Good logs must include:

timestamp
service name
log level (INFO, ERROR)
request ID

AWS logging: CloudWatch Logs

collects logs from EC2, Lambda
integrates with CloudWatch metrics

Troubleshooting using logs

Case:

App returns 500

Steps:

check logs
find error message
identify root cause

Example:

“connection refused” → DB issue
“timeout” → network issue

2.4 Tracing (Deep)

Tracing tracks request across services

Example:

User request path:

User → ALB → API → Service → DB

Tool: AWS X-Ray

Why tracing matters

In microservices:

You don’t know where latency happens

Tracing shows:

which service is slow
where failure occurs

Example

Request takes 3 seconds

Tracing shows:

API: 50ms
Service: 100ms
DB: 2.8s

→ problem is DB

3. SLI, SLO, SLA (Deep Understanding)

SLI (Indicator)

What you measure:

uptime
latency
error rate

SLO (Objective)

Target:

99.9% uptime

SLA (Agreement)

Legal commitment:

if broken → compensation

Why SRE uses this

Because:

Without SLO → no reliability target
Without SLI → no measurement

4. Alerting (Real SRE Thinking)

Alerting is where most teams fail

Bad alerts

CPU 80%
disk usage 70%

These create noise

Good alerts

user cannot login
API error rate > 5%
latency > threshold

Tool: Prometheus Alertmanager

How alert works

metric collected
condition evaluated
alert fired
notification sent (Slack, email)

SRE rule

Alert on user impact, not infrastructure

5. Incident Management (Production Flow)

Incident = service disruption

Real steps

detection (monitoring)
alert triggered
engineer responds
mitigation (temporary fix)
resolution (root cause fix)
postmortem

Example

Issue:
Website down

Actions:

restart service (mitigation)
fix DB connection (resolution)

6. Postmortem (Critical SRE Practice)

After incident:

You must answer:

what happened?
why?
how to prevent?

Rule

No blame

Focus on system failure, not people

7. Error Budget (Advanced Concept)

If SLO = 99.9%

Allowed downtime:

≈ 43 minutes/month

Why important

Balance:

innovation (deploy fast)
stability (avoid downtime)

8. High Availability (HA)

System must survive failure

AWS tools

Elastic Load Balancer
Multi-AZ deployment

Example

If one AZ fails:

Traffic shifts to another AZ

9. Auto Scaling (Reliability + Cost)

Automatically adjust capacity

AWS service

Auto Scaling Group

Example

Traffic spike:

add EC2 instances

Traffic drop:

remove instances

10. Health Checks

Check system status

In Kubernetes

readiness probe → ready to serve
liveness probe → alive

Tool: Kubernetes

Why important

Without health checks:

Load balancer sends traffic to broken app

11. Caching (Performance)

Store frequently used data

Tool: Redis

Why use caching

reduce DB load
faster response

12. Disaster Recovery

Plan for failure

Strategies

backup restore
multi-region
active-active

13. Troubleshooting Mindset (MOST IMPORTANT)

When something breaks:

DO NOT GUESS

Follow layers:

DNS
Network
Load balancer
App
DB

Example

App not working

Check:

DNS resolves?
ALB healthy?
EC2 running?
logs show error?
DB reachable?

FINAL SRE INTERVIEW ANSWER

You say:

As an SRE, I focus on system reliability by implementing observability using metrics, logs, and traces, defining SLOs, setting up meaningful alerts, ensuring high availability with load balancing and auto scaling, and handling incidents with structured troubleshooting and postmortems.