My SRE Starter Pack: Tools and Practices I Wish I Knew Sooner

Why did nobody warn me that CloudWatch dashboards would become my second home?

Being an SRE isn’t just about uptime, it’s about building systems that can tell you what’s wrong, where, and why, long before your customers notice.

When I started in SRE, I knew Linux, AWS, and had a vague idea of “monitoring.” But it wasn’t until I got thrown into a few 5 AM incidents that I realized just how critical some tools and habits are.

Here’s a look into the toolkit I wish I had mastered earlier, especially if you’re working with AWS-native infrastructure.

🟢 1. CloudWatch: The Silent Sentinel
CloudWatch is the first place I look when things go sideways. But let’s be honest, it’s not the most intuitive tool to start with. What I rely on:

CloudWatch Alarms for thresholds on CPU, disk, memory, latency.
Metric Math to combine multiple data points into one composite insight
Dashboards with saved filters per service or environment
Anomaly Detection for smarter alerting

🚨 2. PagerDuty: Alert Me, But Nicely
PagerDuty is like that colleague who yells your name when something’s broken, except it can escalate, snooze, and tell the right person.
🔔 What I set up:
Routing by environment or service type (dev vs prod, app vs infra).
Escalation policies so critical issues don’t go unnoticed.
Suppressing flappy alerts with event rules.

🌐 3. StatusPage: Letting the World Know (Calmly)
When things break, customers aren’t looking for excuses — just clarity.

StatusPage can help us:
Communicate incident timelines publicly.
Track uptime history per system.
Build trust with transparency.

💡 Pro Tip: Ask your users to subscribe to statuspage, this will alert them timely, and they can keep track of issue.

🛠 4. Terraform (and CloudFormation): Infra As You Code It
I started with the AWS Console. Then someone deleted an S3 bucket manually. Never again.

📦 My stack:
Terraform for new infra (version-controlled, modular).
CloudFormation for AWS-native services or legacy templates.
Drift detection to catch untracked changes.

Tools like tfsec, checkov, and pre-commit for validation.

🧑‍💻 5. Linux & SSH: Still the Last Resort
Even with great observability, you’ll sometimes need to jump onto the box.

What I keep in my toolbox:
htop, iftop, iotop for system resource inspection.
journalctl -xe, accesslogs, and tail -f for logs.
SSH bastion hosts + IP whitelisting + key-only login.
🔐 And yes, disable root login. Always.

🎯 Wrapping It Up
If you’re starting out in SRE (or even DevOps), you’ll figure things out as you go, but I hope this list gives you a few shortcuts.

You don’t need a huge team to be reliable — you just need to be intentional about visibility, ownership, and communication.

💬 What’s in Your Starter Pack?
I’d love to know what tools or lessons made the biggest difference in your SRE journey.
Drop them in the comments — let’s compare toolboxes!

DEV Community

My SRE Starter Pack: Tools and Practices I Wish I Knew Sooner

Top comments (0)