1. What SRE Actually Does (Real World)
After networking (VPC, subnets, routing), your system is running.
Now SRE responsibility starts:
- Is the system working?
- Is it fast?
- Is it reliable?
- Can we detect problems early?
- Can we recover quickly?
This is called reliability engineering.
A simple way to think:
DevOps builds system
SRE keeps it alive under stress
2. Observability — Deep Explanation
Observability is not just “monitoring.”
Monitoring tells you something is wrong
Observability tells you why it is wrong
AWS and Google SRE books define observability as:
Ability to understand system state using external outputs
These outputs are:
- metrics
- logs
- traces
2.1 Metrics (Deep)
Metrics are numerical time-series data
Examples:
- CPU usage = 70%
- requests/sec = 200
- error rate = 5%
Why we use metrics
- detect anomalies
- trigger alerts
- track performance trends
- capacity planning
Tool: Amazon CloudWatch
What it does
- collects metrics from AWS services (EC2, ALB, RDS)
- stores time-series data
- creates alarms
How we use it (real scenario)
Example:
You deploy application on EC2
CloudWatch automatically gives:
- CPUUtilization
- NetworkIn/Out
- DiskReadOps
Then you create alarm:
IF CPU > 80% for 5 minutes → trigger alert
When to use CloudWatch
- AWS native monitoring
- quick setup
- infrastructure-level metrics
Limitations (SRE thinking)
- not very strong for custom application metrics
- limited visualization compared to Prometheus + Grafana
2.2 Metrics Tool (Advanced): Prometheus
What it does
- pulls metrics from applications
- stores time-series data
- supports powerful queries (PromQL)
Why SRE prefers Prometheus
- better for microservices
- supports custom metrics
- integrates with Kubernetes
How we use it
Application exposes metrics endpoint:
/metricsPrometheus scrapes it
You query:
- request latency
- error rates
- DB connections
Example
You detect:
- high latency
- normal CPU
→ issue is NOT infrastructure
→ issue is application
Troubleshooting using metrics
Case:
Website slow
Check:
- CPU high → scaling issue
- latency high → app issue
- error rate high → bug or DB problem
2.3 Logs (Deep)
Logs are detailed events
Example:
- user login failed
- DB connection error
- API returned 500
Tool: ELK Stack
Components:
- Elasticsearch → storage
- Logstash → processing
- Kibana → visualization
Why logs are critical
Metrics say:
“Error rate increased”
Logs say:
“Database connection timeout”
How we use logs
Good logs must include:
- timestamp
- service name
- log level (INFO, ERROR)
- request ID
AWS logging: CloudWatch Logs
- collects logs from EC2, Lambda
- integrates with CloudWatch metrics
Troubleshooting using logs
Case:
App returns 500
Steps:
- check logs
- find error message
- identify root cause
Example:
- “connection refused” → DB issue
- “timeout” → network issue
2.4 Tracing (Deep)
Tracing tracks request across services
Example:
User request path:
User → ALB → API → Service → DB
Tool: AWS X-Ray
Why tracing matters
In microservices:
You don’t know where latency happens
Tracing shows:
- which service is slow
- where failure occurs
Example
Request takes 3 seconds
Tracing shows:
- API: 50ms
- Service: 100ms
- DB: 2.8s
→ problem is DB
3. SLI, SLO, SLA (Deep Understanding)
SLI (Indicator)
What you measure:
- uptime
- latency
- error rate
SLO (Objective)
Target:
- 99.9% uptime
SLA (Agreement)
Legal commitment:
- if broken → compensation
Why SRE uses this
Because:
Without SLO → no reliability target
Without SLI → no measurement
4. Alerting (Real SRE Thinking)
Alerting is where most teams fail
Bad alerts
- CPU 80%
- disk usage 70%
These create noise
Good alerts
- user cannot login
- API error rate > 5%
- latency > threshold
Tool: Prometheus Alertmanager
How alert works
- metric collected
- condition evaluated
- alert fired
- notification sent (Slack, email)
SRE rule
Alert on user impact, not infrastructure
5. Incident Management (Production Flow)
Incident = service disruption
Real steps
- detection (monitoring)
- alert triggered
- engineer responds
- mitigation (temporary fix)
- resolution (root cause fix)
- postmortem
Example
Issue:
Website down
Actions:
- restart service (mitigation)
- fix DB connection (resolution)
6. Postmortem (Critical SRE Practice)
After incident:
You must answer:
- what happened?
- why?
- how to prevent?
Rule
No blame
Focus on system failure, not people
7. Error Budget (Advanced Concept)
If SLO = 99.9%
Allowed downtime:
≈ 43 minutes/month
Why important
Balance:
- innovation (deploy fast)
- stability (avoid downtime)
8. High Availability (HA)
System must survive failure
AWS tools
- Elastic Load Balancer
- Multi-AZ deployment
Example
If one AZ fails:
Traffic shifts to another AZ
9. Auto Scaling (Reliability + Cost)
Automatically adjust capacity
AWS service
Auto Scaling Group
Example
Traffic spike:
- add EC2 instances
Traffic drop:
- remove instances
10. Health Checks
Check system status
In Kubernetes
- readiness probe → ready to serve
- liveness probe → alive
Tool: Kubernetes
Why important
Without health checks:
Load balancer sends traffic to broken app
11. Caching (Performance)
Store frequently used data
Tool: Redis
Why use caching
- reduce DB load
- faster response
12. Disaster Recovery
Plan for failure
Strategies
- backup restore
- multi-region
- active-active
13. Troubleshooting Mindset (MOST IMPORTANT)
When something breaks:
DO NOT GUESS
Follow layers:
- DNS
- Network
- Load balancer
- App
- DB
Example
App not working
Check:
- DNS resolves?
- ALB healthy?
- EC2 running?
- logs show error?
- DB reachable?
FINAL SRE INTERVIEW ANSWER
You say:
As an SRE, I focus on system reliability by implementing observability using metrics, logs, and traces, defining SLOs, setting up meaningful alerts, ensuring high availability with load balancing and auto scaling, and handling incidents with structured troubleshooting and postmortems.
Top comments (0)