How Python Automation Supercharged Our SRE Workflow: Real Use Cases & Lessons Learned

Introduction

As Site Reliability Engineers, we often find ourselves repeating the same tasks: restarting pods, cleaning up disk space, verifying service health and parsing logs. While tools like Ansible, Terraform and Kubernetes CLIs help, nothing beats Python when it comes to custom automation and fast scripting.

In this post, I’ll be walking you through how we use Python automation in our SRE toolkit to save hours of manual effort, catch issues early and ensure system reliability.

Why Python for DevOps/SRE?

1) Simple syntax and huge community
2) Excellent libraries (requests, paramiko, boto3, subprocess, etc.)
3) Easy to integrate with APIs, cloud services, shell tools
4) Ideal for fast POCs and production-grade workflows

Use Case 1: Auto-Restart Kubernetes Pods with CrashLoopBackOff

import subprocess
import json

def get_crashing_pods(namespace="default"):
    result = subprocess.run(
        ["kubectl", "get", "pods", "-n", namespace, "-o", "json"],
        capture_output=True, text=True
    )
    pods = json.loads(result.stdout)["items"]
    crashing_pods = [
        pod["metadata"]["name"]
        for pod in pods
        if pod["status"]["phase"] != "Running"
        and any(c.get("reason") == "CrashLoopBackOff" for c in pod["status"].get("containerStatuses", []))
    ]
    return crashing_pods

def restart_pods(pods, namespace="default"):
    for pod in pods:
        subprocess.run(["kubectl", "delete", "pod", pod, "-n", namespace])
        print(f"Restarted pod: {pod}")

if __name__ == "__main__":
    pods = get_crashing_pods("app-namespace")
    if pods:
        restart_pods(pods, "app-namespace")
    else:
        print("No crashing pods found.")

This script helped us cut down MTTR on recurring pod issues by 80%.

Use Case 2: Daily EC2 Health Check in AWS

import boto3

def check_ec2_health(region='us-west-1'):
    ec2 = boto3.client('ec2', region_name=region)
    statuses = ec2.describe_instance_status(IncludeAllInstances=True)['InstanceStatuses']
    for status in statuses:
        instance_id = status['InstanceId']
        system_status = status['SystemStatus']['Status']
        instance_status = status['InstanceStatus']['Status']
        print(f"{instance_id}: System={system_status}, Instance={instance_status}")

if __name__ == "__main__":
    check_ec2_health()

We run this via cron and send a Slack alert if any instance is impaired.

Use Case 3: Slack Notification on Service Downtime

import requests

def send_slack_alert(message, webhook_url):
    payload = {"text": message}
    requests.post(webhook_url, json=payload)

# Example usage
send_slack_alert("Production Service is Down!", "https://hooks.slack.com/services/...")

Works well when paired with custom monitoring scripts or Jenkins jobs.

Tips for Effective Python Automation

Use .env or config.yaml for secrets and configs
Modularize your scripts so they can be reused
Add logging and error handling from day one
Use argparse to accept CLI arguments
Test on staging before letting automation touch production

How to Get Started

Learn the basics of subprocess, requests, os, and argparse
Explore APIs you frequently use (Kubernetes, AWS, GitHub, Datadog, etc.)
Start with internal tools like:
1. Log fetcher
2. Disk cleanup
3. Alert summary report generator
4. On-call helper bot

Conclusion
Python is a DevOps engineer’s best friend — especially when tailored for the unique, repetitive and often tedious tasks that come with maintaining infrastructure. By building small but impactful automation, you can transform your SRE workflow from reactive to proactive.

DEV Community

How Python Automation Supercharged Our SRE Workflow: Real Use Cases & Lessons Learned

Top comments (0)