Security Monitoring for SRE Teams

#devops #sre #monitoring #security

Security used to be a separate team. Increasingly, SRE teams are being asked to own the monitoring side of it. Here's a practical framework that doesn't turn you into a SOC analyst.

What you already have

Your existing observability stack is half of a security monitoring solution. You already collect logs, metrics, and traces. You already have alerts. The missing piece is usually what to look at.

What to add

1. Authentication anomalies. Alert on impossible logins (user from Tokyo, then from Paris 10 minutes later), brute force patterns, and unusual session durations.

2. Privilege escalation patterns. New admin role granted. Service account added to a sensitive group. Kubernetes role binding changed.

3. Unusual data access. A user or service reading 10x the normal volume of records. Downloads from sensitive S3 buckets. Queries against PII tables by accounts that don't normally touch them.

4. Outbound traffic anomalies. A process that has never called an external IP suddenly connecting to one. Large egress volumes during off-hours.

5. Failed auth spikes. Not just the login endpoint. Internal auth, API keys, mTLS — anywhere auth happens.

What not to do

Don't try to build a SIEM yourself. You'll burn out chasing alerts. Either use a managed SIEM or stick to a small set of high-signal alerts.

Don't treat security alerts like reliability alerts. Security alerts often need investigation, not immediate fix. The triage workflow is different.

The handoff

If you're on the SRE side owning security monitoring, you need a clear escalation path to someone who does security full-time. Your job is detection; theirs is response.

The worst pattern: SRE team gets a suspicious alert, doesn't know what to do with it, and tables it. Weeks later, a real incident traces back to that alert. Define the handoff up front.

The quick win

If you do nothing else, implement #1 and #5. Failed auth spikes and login anomalies catch 70% of opportunistic attacks. The rest is shoring up for the 30%.

Security monitoring is just reliability monitoring with a different threat model. Treat it the same way: start with high-signal basics, prune noise aggressively, and escalate when you're unsure.

Written by Dr. Samson Tanimawo
BSc · MSc · MBA · PhD
Founder & CEO, Nova AI Ops. https://novaaiops.com