DEV Community

Cover image for Hiring SREs: What I Look For After Interviewing 100+ Candidates
Samson Tanimawo
Samson Tanimawo

Posted on

Hiring SREs: What I Look For After Interviewing 100+ Candidates

The SRE Hiring Problem

SRE roles are notoriously hard to fill. The intersection of software engineering, systems administration, and operational wisdom is narrow. After interviewing over 100 candidates across three companies, here's what I've learned.

What I Don't Care About

  • Years of experience with a specific tool
  • Certifications (CKA, AWS SA, etc.)
  • Memorized answers about CAP theorem
  • Whether they've used Terraform or Pulumi

Tools change. Fundamentals don't.

What I Actually Look For

1. Systematic Debugging (The #1 Skill)

I give every candidate the same scenario:

"Users report the app is slow. What do you do?"

Bad answer: "I'd check the database."

Good answer: "First, I'd define 'slow' — which endpoints, which users, since when? Then I'd check the golden signals: is latency up? Error rate up? Traffic unusual? Saturation? I'd compare current metrics to baseline and narrow down from there."

The best candidates think in systems, not components.

2. Comfort With Ambiguity

SRE work is fundamentally ambiguous. The page fires, you don't know why, and people are waiting.

I ask: "Tell me about a time you had to make a decision with incomplete information."

I want to hear about:

  • How they scoped the unknowns
  • What heuristics they used
  • How they communicated uncertainty
  • Whether they revised their approach as they learned more

3. Writing Ability

SREs write post-mortems, runbooks, design docs, and incident updates. If you can't write clearly under pressure, you'll struggle.

I include a writing exercise: "Write a 3-sentence status update for a customer-facing outage."

Good example:

"We identified an issue affecting login for approximately 15% of users starting at 2:30 PM UTC. Our team has identified the root cause and deployed a fix. We're monitoring to confirm full resolution, which we expect within the next 10 minutes."

4. Automation Mindset

I ask: "What's the last thing you automated, and why?"

The "why" matters more than the "what." I want to hear about:

  • Identifying repetitive toil
  • Calculating the ROI of automation
  • Choosing the right level of automation
  • Knowing when NOT to automate

5. Empathy

Surprising for a technical role? SREs are the bridge between development, operations, and business. Empathy means:

  • Understanding why a developer shipped that sketchy PR (deadline pressure)
  • Knowing that a product manager asking "when will it be fixed?" is scared, not annoying
  • Recognizing that the junior engineer who caused the outage feels terrible already

My Interview Structure

Round 1 (45 min): Technical screen
  - Systems design: "Design monitoring for a checkout service"
  - Debugging scenario: "Walk me through investigating latency"

Round 2 (60 min): Incident simulation
  - Live scenario: simulated page with fake dashboards
  - Evaluate: systematic thinking, communication, tool usage

Round 3 (45 min): Culture and collaboration
  - Post-mortem discussion: review a real (anonymized) incident
  - Conflict resolution: "Dev team pushes back on your SLO proposal"

Round 4 (30 min): Writing exercise
  - Write a runbook for a given scenario
  - Write a status update for a given incident
Enter fullscreen mode Exit fullscreen mode

Red Flags

  • Blames people instead of systems
  • Can't explain things simply
  • Never says "I don't know"
  • Only talks about tools, never principles
  • No interest in learning from incidents

If you're building an SRE team and want AI to handle the routine work so your engineers focus on what matters, check out what we're building at Nova AI Ops.


Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com

Top comments (0)