Introduction
As Site Reliability Engineers, we often find ourselves repeating the same tasks: restarting pods, cleaning up disk space, verifying service health and parsing logs. While tools like Ansible, Terraform and Kubernetes CLIs help, nothing beats Python when it comes to custom automation and fast scripting.
In this post, I’ll be walking you through how we use Python automation in our SRE toolkit to save hours of manual effort, catch issues early and ensure system reliability.
Why Python for DevOps/SRE?
1) Simple syntax and huge community
2) Excellent libraries (requests, paramiko, boto3, subprocess, etc.)
3) Easy to integrate with APIs, cloud services, shell tools
4) Ideal for fast POCs and production-grade workflows
Use Case 1: Auto-Restart Kubernetes Pods with CrashLoopBackOff
import subprocess
import json
def get_crashing_pods(namespace="default"):
result = subprocess.run(
["kubectl", "get", "pods", "-n", namespace, "-o", "json"],
capture_output=True, text=True
)
pods = json.loads(result.stdout)["items"]
crashing_pods = [
pod["metadata"]["name"]
for pod in pods
if pod["status"]["phase"] != "Running"
and any(c.get("reason") == "CrashLoopBackOff" for c in pod["status"].get("containerStatuses", []))
]
return crashing_pods
def restart_pods(pods, namespace="default"):
for pod in pods:
subprocess.run(["kubectl", "delete", "pod", pod, "-n", namespace])
print(f"Restarted pod: {pod}")
if __name__ == "__main__":
pods = get_crashing_pods("app-namespace")
if pods:
restart_pods(pods, "app-namespace")
else:
print("No crashing pods found.")
This script helped us cut down MTTR on recurring pod issues by 80%.
Use Case 2: Daily EC2 Health Check in AWS
import boto3
def check_ec2_health(region='us-west-1'):
ec2 = boto3.client('ec2', region_name=region)
statuses = ec2.describe_instance_status(IncludeAllInstances=True)['InstanceStatuses']
for status in statuses:
instance_id = status['InstanceId']
system_status = status['SystemStatus']['Status']
instance_status = status['InstanceStatus']['Status']
print(f"{instance_id}: System={system_status}, Instance={instance_status}")
if __name__ == "__main__":
check_ec2_health()
We run this via cron and send a Slack alert if any instance is impaired.
Use Case 3: Slack Notification on Service Downtime
import requests
def send_slack_alert(message, webhook_url):
payload = {"text": message}
requests.post(webhook_url, json=payload)
# Example usage
send_slack_alert("Production Service is Down!", "https://hooks.slack.com/services/...")
Works well when paired with custom monitoring scripts or Jenkins jobs.
Tips for Effective Python Automation
- Use .env or config.yaml for secrets and configs
- Modularize your scripts so they can be reused
- Add logging and error handling from day one
- Use argparse to accept CLI arguments
- Test on staging before letting automation touch production
How to Get Started
- Learn the basics of subprocess, requests, os, and argparse
- Explore APIs you frequently use (Kubernetes, AWS, GitHub, Datadog, etc.)
- Start with internal tools like:
- Log fetcher
- Disk cleanup
- Alert summary report generator
- On-call helper bot
Conclusion
Python is a DevOps engineer’s best friend — especially when tailored for the unique, repetitive and often tedious tasks that come with maintaining infrastructure. By building small but impactful automation, you can transform your SRE workflow from reactive to proactive.
Top comments (0)