Hima Reddy

Posted on May 13

Turning Production Incidents Into Testing Postmortems — With a Local LLM and No API Key

#ollama #python #ai #llm

Your team raised a P1. The dev postmortem is done. But where's the testing perspective?

Most incident postmortems answer: what broke and how do we fix it?

They rarely answer: what should have caught this? What test coverage was missing? What signals did we have that we ignored?

That gap is where this tool lives.

Prod Incident Test Analyzer takes raw incident data — logs, alerts, Slack threads, error dumps — and generates a structured postmortem from a tester's perspective, then narrates it as audio using a free neural TTS engine. No API key. No cloud. Runs entirely on your machine with LLaMA 3 via Ollama.

Here's exactly how it works under the hood.

The Problem With Standard Postmortems

A production incident happens. The dev team writes the RCA. It covers infrastructure failures, deployment mistakes, config drift. The testing section, if it exists at all, says something like: "Add more tests."

That's not useful. What tests? Covering what? At which layer?

The tool simulates a senior Test Engineer independently investigating the same incident — one who wasn't in the room when it happened, has no ego invested in the decisions, and is specifically looking for what the testing and observability layer missed.

Architecture at a Glance

Incident Text
     │
     ▼
build_prompt()  ──►  Ollama (LLaMA 3, local)
                           │
                           ▼
              Structured Markdown Report
              ┌────────────────────────┐
              │ # Incident Summary     │
              │ # Investigation        │
              │ # Root Cause           │
              │ # Prevention Plan      │
              │ # Recommended Tests    │
              │ # Voice Summary        │
              └────────────────────────┘
                           │
               extract_voice_summary()
                           │
                           ▼
                    edge-tts (free, local)
                           │
                           ▼
                      Audio Playback

Three components: a prompt, a local LLM, and a TTS engine. No external services.

Step 1: The Prompt — Where the Testing Perspective Comes From

The most important piece of the whole tool isn't the LLM — it's the prompt. This is what makes the output useful to a tester rather than generic.

def build_prompt(incident_text: str) -> str:
    return f"""
You are a senior Test Engineer leading a production incident postmortem.

Turn the input into a structured engineering conversation.

Rules:
- Two engineers discussing the incident
- Focus on debugging steps, investigation, and root cause analysis
- Mention what tests or signals should have caught this earlier
- Include assumptions, mistakes, and validation steps
- Be technical and realistic
- Keep conversation natural like an engineering meeting

- YOU MUST identify a single most likely root cause
- If multiple causes exist, rank them and pick the primary one
- Include at least one concrete technical failure
- Avoid filler phrases

You MUST end your response with this section using exactly this heading:
# Voice Summary
Write 150-200 words summarising the incident, root cause, and prevention steps 
in a natural, conversational tone with no markdown.

Structure:
# Incident Summary
# Investigation Discussion
# Root Cause
# Prevention Plan
# Recommended Tests
# Voice Summary

Incident:
{incident_text}
"""

A few deliberate decisions here:

"Two engineers discussing the incident" — a conversation surfaces assumptions and disagreements that a single-voice summary would flatten. You get "wait, did anyone check the connection pool exhaustion alerts?" rather than "connection pool exhaustion was observed."

"YOU MUST identify a single most likely root cause" — without this constraint, LLMs hedge. They give you five equally-weighted causes and call it analysis. Forcing a single primary cause mirrors how real RCAs work.

The Voice Summary section — this is intentionally separate from the rest of the report. The full report is for reading. The Voice Summary is written in plain prose specifically for audio narration — no markdown, no bullet points, no headings that would sound bizarre when spoken aloud.

The system prompt does the heavy lifting on domain coverage:

"content": ("You are an expert Test Engineer, production incident investigator, "
            "performance bottleneck analyst, and distributed systems debugging assistant. "
            "You specialize in root cause analysis, scalability failures, observability gaps, "
            "resource exhaustion, race conditions, retry storms, caching failures, "
            "database bottlenecks, thread starvation, concurrency bugs, "
            "Kubernetes incidents, and CI/CD failures. "
            "Prioritize evidence-based reasoning, rank possible causes, "
            "identify the single most likely root cause, "
            "and prefer concrete technical explanations over vague summaries.")

This matters. A generic "you are a helpful assistant" system prompt gets you generic output. Naming the failure modes explicitly — retry storms, thread starvation, HikariPool exhaustion — primes the model to look for these patterns in your incident text.

Step 2: Connecting to Ollama — Local LLM, OpenAI-Compatible API

Ollama exposes an OpenAI-compatible API at localhost:11434. This means you can use the standard OpenAI Python client with zero code changes — just swap the base URL:

ollama_client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"   # dummy value, Ollama doesn't use auth
)

Calling the model is then identical to any OpenAI call:

response = ollama_client.chat.completions.create(
    model=model_name,      # "llama3", "mistral", "qwen2.5"
    temperature=temperature,
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": prompt}
    ]
)

output = response.choices[0].message.content

The model runs entirely on your machine. No tokens consumed. No data leaving your network. For incident analysis — where logs often contain sensitive infrastructure details, credentials in error messages, internal service names — this matters.

Step 3: Extracting the Voice Summary Reliably

LLMs are inconsistent with heading formatting. Sometimes you get # Voice Summary, sometimes ## Voice Summary, sometimes **Voice Summary**. A naive split("# Voice Summary") fails silently on half your runs.

The solution is a regex that handles all common variants:

def extract_voice_summary(text: str) -> str:
    match = re.search(
        r"#{1,3}\s*\*{0,2}Voice Summary\*{0,2}\s*\n+(.*?)(\n#{1,3}\s|\Z)",
        text,
        re.DOTALL | re.IGNORECASE
    )
    if match:
        return match.group(1).strip()
    return ""

This handles #, ##, ###, with or without surrounding **, case-insensitive, and stops at the next heading or end of string.

But extraction can still fail — especially with smaller models or unusual output. Rather than crashing silently, there's a fallback that makes a second LLM call asking specifically for a 150-200 word plain-prose summary:

def get_voice_summary(output: str, model: str, temp: float) -> str:
    summary = extract_voice_summary(output)
    if summary:
        return summary

    st.info("Voice Summary section not found — running fallback summarisation...")

    fallback = ollama_client.chat.completions.create(
        model=model,
        temperature=temp,
        messages=[
            {
                "role": "user",
                "content": (
                    "Summarise this incident report in 150-200 words. "
                    "Use a conversational tone as if explaining to a colleague. "
                    "No markdown, no bullet points, plain prose only:\n\n"
                    + output
                )
            }
        ]
    )
    return fallback.choices[0].message.content.strip()

Two-stage extraction: try the structured section first, fall back to a targeted summarisation call. You always get audio.

Step 4: Text-to-Speech With edge-tts — No API Key, Neural Quality

Most TTS integrations in hobby projects hit a paid API. edge-tts uses Microsoft's neural voices and is completely free — it piggybacks on the same infrastructure that powers Edge browser's Read Aloud feature.

The audio generation is async, so it runs through a temp file to avoid streaming complexity:

def generate_audio(text: str, voice_name: str) -> bytes:
    async def _run() -> bytes:
        communicate = edge_tts.Communicate(text, voice=voice_name)
        with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as f:
            tmp_path = f.name
        try:
            await communicate.save(tmp_path)
            with open(tmp_path, "rb") as f:
                return f.read()
        finally:
            os.unlink(tmp_path)

    return asyncio.run(_run())

Save to a temp .mp3, read back the bytes, clean up. The bytes go straight into Streamlit's st.audio(). The temp file approach is necessary here — edge-tts doesn't support writing directly to a bytes buffer, so the intermediate file is unavoidable.

Available voices include US, British, and Australian accents in both male and female. For postmortem review on a commute, the British voices tend to sound most natural for technical content.

What the Output Looks Like

Paste in a real incident — even a rough description:

Payments API went down at 02:13 UTC after a Kubernetes deployment.
HTTP 500s, DB CPU spike, Kafka consumer lag grew, HikariPool timeouts,
Redis cache miss rate jumped from 5% to 68%.
Async reconciliation workers retried failed jobs aggressively.
K6 load tests only covered steady-state traffic.
Playwright checkout tests were failing intermittently but marked as flaky.

You get back five structured sections plus audio:

Incident Summary — what happened, timeline, blast radius
Investigation Discussion — two engineers walking through observations, questioning assumptions
Root Cause — a single primary cause with evidence, secondary causes ranked
Prevention Plan — concrete steps, not generic advice
Recommended Tests — specific test cases: load tests with burst scenarios, connection pool exhaustion alerts, retry backoff tests, synthetic checkout monitoring with failure thresholds

The Recommended Tests section is the one that justifies building this. It's the section no standard postmortem includes.

Running It Locally

# Prerequisites: Python 3.9+, Ollama installed and running

git clone https://github.com/hbkandhi12/prod-incident-test-analyzer.git
cd prod-incident-test-analyzer

python -m venv venv
source venv/bin/activate

pip install -r requirements.txt
ollama pull llama3

streamlit run prod_incident_to_podcast_agent.py

Open localhost:8501. Paste your incident. Hit Generate.

The sidebar lets you swap models — llama3:70b gives noticeably better root cause analysis if you have the VRAM, mistral is faster for quick iterations.

What This Catches That Standard Postmortems Miss

The value isn't in the audio — that's just a delivery mechanism. The value is in forcing a structured testing perspective on every incident:

Load tests that only covered steady state, not burst traffic
Synthetic tests marked as flaky that were actually signalling real failures
Missing alerts for resource exhaustion that would have given earlier warning
Aggressive retry mechanisms that nobody stress-tested under failure conditions

These patterns repeat across incidents at different companies. The root cause changes. The testing gaps are always the same.

Source

Full code on GitHub: github.com/hbkandhi12/prod-incident-test-analyzer

If you're a tester who's ever read an incident report and thought "we had signals for this" — this is for you.

DEV Community