Building FailureDNA: an agent memory that knows when not to trust itself

#qwen #alibabacloud #ai #hackathon

Submitted for the Global AI Hackathon Series with Qwen Cloud — Track 1: MemoryAgent.

The bug that started it

Give an incident-response agent a vector database of past incidents and it will do something that looks smart and is quietly dangerous: when a new outage resembles an old one, it retrieves the most similar past incident and reuses whatever action it finds there.

The problem is that similarity is not applicability. The most similar past incident might be the one where restart_service failed. Or where increase_connection_pool worked — but only because the database driver was psycopg2 and the topology was single-region, both of which have since changed. A cosine score of 1.0 tells you the symptoms rhyme. It tells you nothing about whether the fix still holds.

In incident response, that gap is expensive. Repeating a remediation that already failed burns the most costly minutes of an outage; reusing a fix whose preconditions have drifted can make the incident worse. So I built FailureDNA: a persistent memory that accumulates real outcomes and reasons about whether past experience should be used, inspected, or avoided — before the model is allowed to act on it.

The core idea: an LLM over a deterministic gate

The architecture has one opinionated rule: the model selects; it never decides what's valid.

Incident
  -> embed symptoms (Qwen text-embedding-v3)
  -> pgvector semantic search on Alibaba Cloud RDS
  -> fuse semantic + keyword scores
  -> DETERMINISTIC validity gate  <- the important part
  -> Qwen picks one allowlisted action (validated JSON)
  -> execute -> persist the real outcome back to memory

The validity gate is deliberately boring and deterministic:

Prior outcome	Environment match	Disposition
failure	any	avoid
success	full match	use
success	driver / topology / config hash changed	inspect

No model decides whether a memory is trustworthy. And critically, avoid is enforced, not advised: an action with a symptom-matching prior failure is removed from the candidate list before the model sees it. The agent cannot repeat a known failure even if it wanted to — which matters, because a live LLM handed the same memories as a "hint" will sometimes ignore the hint. The creative part (which action, given the evidence) goes to Qwen; the part that must never hallucinate (is this memory valid? did this action succeed?) stays in deterministic code.

Why Qwen Cloud fit

I used Qwen Cloud through its OpenAI-compatible DashScope endpoint, which made two things nearly free:

Embeddings for recall. text-embedding-v3 turns incident symptoms into 1024-d vectors for pgvector search. Hybrid retrieval fuses semantic similarity (weight 0.70) with keyword overlap (0.30), so it catches both paraphrased and exact-token symptoms.
Validated action selection. A Qwen chat model receives the symptoms, the environment fingerprint, the (gated) candidate actions, and the memories, and returns a single JSON decision at temperature=0 with thinking disabled — fast, deterministic-ish output that I validate before anything executes.

Because it's OpenAI-compatible, the whole client is a thin, well-typed wrapper with explicit timeouts and one retry — no exotic SDK to fight.

Proving it helps (without fooling myself)

A demo where the new thing wins is easy to fake, so FailureDNA ships a benchmark designed to be hard on itself: three modes (no_memory, naive, failuredna) on identical seeded history, hidden simulator outcomes, evaluator-only safe/unsafe labels, isolated memory per mode, and static shortcut baselines (always_inspect_downstream, …) to check it isn't just rediscovering that one action is usually right.

FailureDNA lifts first-action resolution well above the naive agent, cuts unsafe first actions sharply, resolves in fewer actions, and repeats zero historical failures and zero stale successes. The honest caveat I left in the open: in this small scenario set, a static always-inspect policy also scores well — which is exactly why the shortcut audit exists. FailureDNA's value isn't a magic action; it's that it never repeats a known failure and never blindly reuses a stale fix as environments change — the behavior that generalizes beyond a fixed benchmark.

Running it on Alibaba Cloud (and the sharp edges)

The backend runs as a custom container on Alibaba Cloud Function Compute (FastAPI, port 9000), memory persists in ApsaraDB RDS for PostgreSQL + pgvector (HNSW), and the image lives in ACR Personal Edition. A few things bit, and are worth writing down for the next person:

ACR tiers matter. Enterprise Economy can't back an FC custom-container function (no image processing). Personal Edition (free) works.
The default *.fcapp.run domain forces downloads. It adds Content-Disposition: attachment to HTML and JSON responses, so a browser downloads your dashboard or health JSON instead of rendering it. I serve the UI from GitHub Pages and added a small API status page that fetches and displays /health/ready.
CORS is the gateway's job, not the app's. The FC gateway already injects Access-Control-* headers (it even reflects the request origin). The app's only CORS responsibility is to return 200 on OPTIONS — if it 405s the preflight, the browser blocks every POST. Adding app-level CORS headers on top just produces duplicates the browser also rejects.
Watch for stale images. Pushing a tag to ACR doesn't update a running function; you must repoint it, and FC can serve a cached image. I added a /health/cors-debug endpoint and a build marker so "is my new code actually live?" is a one-glance check.

What I'd do next

The most interesting open problem is the inspect disposition. Today the deterministic gate hard-removes avoid actions but leaves inspect ones available with a warning. The right next step is a real verification tool behind inspect — so a stale success is checked against the current environment, not just flagged. That keeps the thesis intact: let the model be creative where creativity helps, and let deterministic code (and real checks) hold the line where being wrong is expensive.

Try it: Live dashboard · API status · GitHub (MIT)

Built with Qwen Cloud + Alibaba Cloud Function Compute and RDS pgvector.