maryu0

Posted on May 15

I built an AI debugging assistant with Llama 3.3 — here's what actually worked

#ai #llm #productivity #showdev

Every developer has been there. It's 2am, your CI pipeline is red, and you're staring at a wall of error logs trying to figure out which of the 47 things that could be wrong is actually wrong.

That pain is what made me build FailSense — an AI debugging assistant that ingests error logs and returns ranked, actionable fixes using Llama 3.3. Here's an honest breakdown of what I built, the mistakes I made, and what I'd do differently.

~40% reduction in debugging time · ~99% uptime on AWS · 2 services, one pipeline

The problem with debugging + LLMs

The naive approach is obvious: dump the error into ChatGPT and hope for the best. It kind of works. But it breaks down quickly when:

Your error spans multiple files and stack frames
The root cause is buried 3 levels deep in a dependency
You need ranked fixes, not a monologue
You want this in your own pipeline, not a chat UI So I decided to build something purpose-built for error log analysis — with structured output, confidence-ranked fixes, and a real deployment.

Architecture: keep it boring

The stack is deliberately simple. Two services. One job each.

Next.js (Frontend) → FastAPI (Backend) → Llama 3.3 via Groq

The Next.js frontend handles log input and renders ranked fixes. The FastAPI backend owns all the prompt logic, output parsing, and error handling. Llama 3.3 runs on Groq for low-latency inference — this matters more than you'd think when users are already frustrated.

Lesson learned: Don't add a third service just because you can. Every hop between services is a new failure point, a new auth layer, and a new thing to monitor at 2am.

The prompt that actually works

This took the most iteration. The first version just said "here's an error, fix it." The output was verbose, unstructured, and hard to parse programmatically. Here's the version that works:

system_prompt = """
You are a senior software engineer debugging production errors.
Given an error log, return ONLY a JSON array of fixes, ranked by likelihood.
Each fix must have:
  - rank (int): 1 = most likely cause
  - cause (str): one sentence root cause
  - fix (str): exact steps to resolve
  - confidence (float): 0.0 to 1.0

Return nothing else. No preamble. No markdown. Raw JSON only.
"""

Three things made this work:

Explicit output format — telling the model to return raw JSON (not markdown-wrapped JSON) saved me a ton of parsing headaches
Role framing — "senior software engineer" shifts the model toward precise, opinionated output over safe hedging

3. Ranked by likelihood — forcing a ranking means the most actionable fix is always first, which is what a tired developer actually wants

Parsing LLM output without going insane

LLMs are not deterministic JSON machines. Sometimes Llama 3.3 returns perfect JSON. Sometimes it adds a sentence before it. Sometimes the confidence is a string instead of a float. Here's the defensive parsing layer I built:

import json, re

def parse_fixes(raw: str) -> list:
    # Strip markdown fences if present
    clean = re.sub(r"```

(?:json)?|

```", "", raw).strip()

    try:
        fixes = json.loads(clean)
    except json.JSONDecodeError:
        # Try to extract the JSON array from within a larger string
        match = re.search(r'\[.*\]', clean, re.DOTALL)
        fixes = json.loads(match.group()) if match else []

    # Normalize confidence to float
    for f in fixes:
        f["confidence"] = float(f.get("confidence", 0.5))

    return sorted(fixes, key=lambda x: x["rank"])

Hot take: If you're not writing a fallback parser for LLM output, you're writing a bug. Models drift, prompts drift, and what works today breaks next month.

Deployment: boring is good

Next.js on Vercel. FastAPI on Railway. Both wired up with GitHub Actions for CI/CD. Every push to main triggers a deploy. The whole thing costs under $5/month to run.

The ~99% uptime wasn't magic — it was just not doing anything clever. No custom load balancers, no exotic infra. Just two managed services that restart themselves when they crash.

What I'd do differently

Add evals from day one. I had no systematic way to know if a prompt change made things better or worse. I was eyeballing it. Don't eyeball it.
Stream the response. Waiting 3-4 seconds for the full JSON response feels slow. Streaming partial results — even just a loading state with intermediate tokens — makes it feel snappy.

- Log everything. What errors are users pasting in? What fixes are they ignoring? This data is gold for improving the prompt and I threw it away by not logging it.

The takeaway

Building production AI tools is less about the model and more about the scaffolding around it. The prompt, the output parser, the fallback handling, the latency — that's where the real engineering happens.

FailSense isn't magic. It's a well-prompted LLM with a defensive parser and a boring deployment. That's enough to cut debugging time by ~40% and actually ship something people use.

Check out the full source on GitHub · Built with Next.js, FastAPI, Groq, and Llama 3.3

DEV Community