How I Built a Self-Healing Website That Fixes Its Own P0s

#webdev #aws #devops #aiagents

At 2:17 AM on a Thursday, my website broke.

Static assets — CSS, JS, the favicon — all started serving as text/html instead of their actual content types. The browser was getting HTML error pages instead of stylesheets. Every page on whoffagents.com was broken.

I didn't find out until 8 AM when I woke up.

But by then it was already fixed.

What Actually Happened

This is the story of Hyperion — a watchdog agent in our Pantheon multi-agent system — and how it caught, diagnosed, and patched a production P0 while I slept.

Here's the sequence:

We deployed a new build to AWS Amplify.
The deployment introduced a catch-all rewrite rule that matched before static asset routes.
Amplify started serving the index.html fallback for every request — including requests for .css, .js, and .ico files.
Every page loaded with broken styles and a broken favicon.
Hyperion's smoke test caught it on the next scheduled check.

The bug wasn't obvious. Amplify silently "succeeded" on the deployment. CI passed. The problem only showed up when a browser actually tried to load assets — and even then, the visual breakage depended on browser caching.

The Smoke Test That Caught It

Hyperion runs a content-type validation suite every 15 minutes against a fixed set of URLs. Here's the core of it:

import httpx
import sys

CHECKS = [
    ("https://whoffagents.com/", "text/html"),
    ("https://whoffagents.com/assets/index.css", "text/css"),
    ("https://whoffagents.com/assets/index.js", "application/javascript"),
    ("https://whoffagents.com/favicon.ico", "image/"),
]

async def smoke_test():
    results = []
    async with httpx.AsyncClient(follow_redirects=True, timeout=10) as client:
        for url, expected_type in CHECKS:
            try:
                r = await client.get(url)
                content_type = r.headers.get("content-type", "")
                ok = expected_type in content_type
                results.append({
                    "url": url,
                    "expected": expected_type,
                    "got": content_type,
                    "status": r.status_code,
                    "pass": ok,
                })
            except Exception as e:
                results.append({"url": url, "error": str(e), "pass": False})

    failures = [r for r in results if not r["pass"]]
    passed = len(results) - len(failures)
    print(f"{passed}/{len(results)} checks passed")

    if failures:
        for f in failures:
            print(f"FAIL: {f['url']} — expected '{f.get('expected')}', got '{f.get('got', 'ERROR')}'")
        sys.exit(1)

    sys.exit(0)

This is intentionally minimal. No fancy assertions. No retry logic on the smoke test itself — if it fails once, it escalates immediately. The watchdog's job is to notice, not to retry until it stops noticing.

When Hyperion ran this at 2:22 AM, it got back:

0/4 checks passed
FAIL: https://whoffagents.com/assets/index.css — expected 'text/css', got 'text/html; charset=utf-8'
FAIL: https://whoffagents.com/assets/index.js — expected 'application/javascript', got 'text/html; charset=utf-8'
FAIL: https://whoffagents.com/favicon.ico — expected 'image/', got 'text/html; charset=utf-8'

P0 confirmed.

The Root Cause: Amplify Rewrite Rules

AWS Amplify uses a _redirects file (or the Amplify console rewrites) to control routing. For SPAs, you typically add a catch-all that sends all unmatched routes to index.html — enabling client-side routing.

The problem is ordering. Our _redirects file looked like this after the bad deploy:

# SPA catch-all (THIS WAS TOO BROAD)
/*    /index.html    200

That single rule matched everything, including /assets/index.css. Amplify served index.html for the CSS file — with a 200 status code, so no obvious error signal in logs.

The fix is to add explicit pass-through rules for static assets before the catch-all:

# Static asset pass-through — must come BEFORE the SPA catch-all
/assets/*    /assets/:splat    200
/favicon.ico    /favicon.ico    200
/robots.txt    /robots.txt    200
/sitemap.xml    /sitemap.xml    200

# SPA catch-all for client-side routes
/*    /index.html    200

This tells Amplify: if the path starts with /assets/, serve the actual file. Don't redirect it to index.html.

Hyperion generated this diff, applied it to the repo, and triggered a new Amplify deployment via CLI:

amplify publish --yes

The Repair Loop

Hyperion's repair loop is straightforward:

Detect — smoke test fails, P0 flag raised
Diagnose — check which URLs are failing, infer root cause from pattern (all assets serving HTML → rewrite rule issue)
Patch — apply known fix for this failure class
Deploy — trigger Amplify publish
Verify — re-run smoke test after deploy, wait for CDN propagation
Report — post result to Discord channel, log to session file

The verification step matters. Amplify deployments take 2-4 minutes. Hyperion waits, then re-runs the full 15-check suite before declaring the incident resolved.

Final output at 2:41 AM:

15/15 checks passed
P0 resolved. Deploy commit: f2d5885. Duration: 24 min.

What This Actually Takes to Build

A self-healing system sounds complex. The implementation is less scary than it seems.

The hard parts aren't technical — they're operational:

1. You need runbooks encoded as code, not docs.
The patch Hyperion applied existed because we'd hit this exact failure before manually and written down what fixed it. The watchdog is just that runbook, automated.

2. Your smoke tests need to be fast and opinionated.
A smoke test that passes on a broken system is worse than no smoke test. Test the actual contract — content-types, response codes, specific content — not just "did the server respond."

3. Deployments need to be scriptable without human confirmation.
This is the part most teams skip. If your deploy requires clicking a button in a dashboard, your watchdog can detect but not repair. We use Amplify CLI with --yes flags for non-interactive deploys.

4. The watchdog needs its own alerting.
Hyperion reports to a Discord channel regardless of whether the fix worked. If it patches and the patch fails, I still get a ping. The repair loop can fail gracefully.

The Bigger Picture

Hyperion is one agent in a larger system called Pantheon — a fleet of specialized AI workers that runs whoffagents.com with about 95% autonomy. Atlas is the planner. Hyperion is ops. Other agents handle content, outreach, and research.

We document this entire architecture in our weekly newsletter — the exact prompts, the failures, the lessons. If you're building with AI agents or Claude Code and want the unfiltered playbook, it's free:

👉 Subscribe to The Atlas Letter