Author: Meena Nukala
Tags: #sre #python #prometheus #ai #tutorial
As an owner and engineer, I’ve realized that the biggest bottleneck to scaling isn't the hardware—it's the manual intervention required to keep that hardware healthy.
In my previous post, I discussed why AI is the "force multiplier" for SRE. Today, I want to get hands-on. We are going to build a simple but effective AI Anomaly Detector that scrapes metrics from Prometheus and uses the Isolation Forest algorithm to flag system health issues before they trigger a critical alert.
- The Stack
We’ll be using:
- Prometheus: Our time-series data source.
- Python (Pandas & Scikit-learn): For data processing and the ML model.
- Isolation Forest: An unsupervised learning algorithm perfect for detecting anomalies in system metrics (like CPU spikes or memory leaks).
Setting Up the Python Environment
First, let’s get our dependencies ready. You'll need the Prometheus client and our ML libraries.
pip install prometheus-api-client pandas scikit-learn matplotlibFetching Data from Prometheus
We need to pull historical data so our model knows what "normal" looks like. In SRE, context is everything.
from prometheus_api_client import PrometheusConnect
import pandas as pd
Connect to your Prometheus instance
prom = PrometheusConnect(url="http://localhost:9090", disable_ssl=True)
Fetch CPU usage for the last 1 hour
query = '100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)'
data = prom.get_metric_range_data(
metric_name=query,
start_time=parse_datetime("1h ago"),
end_time=parse_datetime("now"),
)
Convert to Dataframe
df = pd.DataFrame(data[0]['values'], columns=['timestamp', 'value'])
df['value'] = df['value'].astype(float)
- Training the "Isolation Forest" Isolation Forest works by "isolating" observations. Anomalies are easier to isolate and therefore have shorter paths in the tree structure. from sklearn.ensemble import IsolationForest
Initialize the model
'contamination' is the expected % of anomalies (e.g., 5%)
model = IsolationForest(contamination=0.05, random_state=42)
Fit the model on our CPU values
df['anomaly'] = model.fit_predict(df[['value']])
IsolationForest returns -1 for anomalies and 1 for normal data
anomalies = df[df['anomaly'] == -1]
print(f"Detected {len(anomalies)} anomalies in the last hour.")
- Bridging the Gap: Making it Actionable In a true SRE environment, detecting the anomaly is only half the battle. You want this script to run as a Sidecar or a CronJob that pushes a "Silence" request to Alertmanager or sends a high-priority Slack notification if the anomaly score exceeds a certain threshold. > Pro-Tip from Meena: Don't just alert on every anomaly. Use a "Cooldown" period. AI models can be sensitive; you want to ensure the anomaly persists for at least 3-5 minutes before waking up an engineer. > Why this matters for your Business By implementing even a basic script like this, you move your team away from static thresholds (which are often wrong) and toward dynamic baselining. This reduces "Mean Time to Detection" (MTTD) and, more importantly, protects your team from the burnout of false-positive alerts. What’s Next? This is a "V1" approach. In a production environment, you’d want to wrap this in a Flask API and use OpenTelemetry to trace the anomaly back to a specific microservice. Would you like me to share the specialized Prometheus recording rules I use to optimize these queries for high-scale environments? Let me know in the comments!
Top comments (0)