Avnish Raj

Posted on Jun 5

How I Made Deployment Reviews Remember Incidents With Hindsight

#ai #agents #nextjs #devops

Most deployment review agents have the same weakness: they can inspect the current change, but they forget what the organization already learned the hard way.

That is the problem I wanted SCAR to solve. SCAR is a deployment-risk agent that reviews proposed production changes against persistent incident memory. If a team previously caused an outage, diagnosed it, rolled it back, and wrote down the real root cause, SCAR should not treat the next similar deployment as a clean slate.

The core idea is simple: production incidents should become active guardrails, not passive documents.

What SCAR Does

SCAR reviews a proposed deployment and decides whether to approve or block it. The example I built around is a retry-policy change.

At first glance, the change looks reasonable:

- backoff: fixed(1)
+ concurrency: 800

The tests pass. The diff is small. A normal review agent might approve it.

But the same shape of change can be dangerous if the organization has already seen fixed retries synchronize thousands of workers, exhaust a shared dependency, and turn a local provider issue into a wider outage.

That is where Hindsight persistent memory comes in. SCAR stores production incident corrections in Hindsight, recalls relevant lessons during future deployment reviews, and only changes the decision when the recalled evidence is causally related to the proposed change.

The user-facing flow is intentionally strict:

Cold start: SCAR has no relevant memory and approves the deployment.
Correction: an engineer provides the real root cause and successful resolution.
Retention: Hindsight stores that lesson as persistent memory.
Future review: SCAR recalls the lesson and blocks the same failure mechanism.

The important part is that the fourth step must be earned. Memory existing somewhere is not enough. The remembered incident has to match the risk mechanism.

The Memory Loop

SCAR creates an isolated Hindsight memory bank for each run. That keeps the proof clean: no old state, no hidden global memory, no accidental leakage between sessions.

The Hindsight client setup lives in src/lib/hindsight.ts:

await client.createBank(bankId, {
  name: "SCAR Production Safety Memory",
  reflectMission:
    "Protect production by learning causal lessons from incidents, failed fixes, rollbacks, and engineer corrections.",
  retainMission:
    "Extract root causes, incorrect diagnoses, successful resolutions, environmental blind spots, causal chains, and future deployment guardrails.",
  enableObservations: true,
});

That bank is not just a log bucket. The missions describe what the agent should retain and generalize: root causes, incorrect diagnoses, environmental blind spots, causal chains, and future guardrails.

When an engineer provides a correction, SCAR retains a structured incident record:

const content = [
  `Incident ${incident.id}: ${incident.title}`,
  `Service: ${incident.service}`,
  `Incorrect diagnosis: ${incident.incorrectDiagnosis}`,
  `Root cause: ${incident.rootCause}`,
  `Resolution: ${incident.resolution}`,
  `Causal chain: ${incident.causalChain.join(" -> ")}`,
  `Future guardrails: ${incident.futureGuardrails.join("; ")}`,
].join("\n");

Then it asks Hindsight to reflect on the reusable lesson:

await client.reflect(
  bankId,
  `Generalize the reusable production safety lesson from ${incident.id} so it can prevent a similar failure in a different service.`,
);

That reflection step is what makes the interaction feel different from storing a row in a database. The agent is not just remembering that "payment-api broke." It is learning that synchronized retry waves against constrained dependencies can exhaust shared resources.

Why I Added an Evidence Gate

The first version of an agent like this is easy to fake accidentally.

If the agent blocks every future deployment after memory exists, it looks like learning in a shallow demo, but it is not useful. If it relies only on an LLM to explain the decision, it can cite memory loosely or hallucinate a connection. If it matches keywords, it misses paraphrases and overreacts to unrelated text.

So I added a Hindsight Evidence Gate.

The evidence gate checks for causal overlap between the proposed deployment and the recalled memory. In src/lib/risk-engine.ts, SCAR groups signals into families:

const causalSignalFamilies = {
  "retry synchronization": ["backoff", "fixed interval", "retry", "thundering herd"],
  "resource exhaustion": ["capacity", "connection", "concurrency", "exhaust", "pool"],
  "dependency throttling": ["429", "rate limit", "throttle"],
  "safe retry guardrail": ["canary", "exponential", "jitter", "randomized"],
};

This makes the agent more robust than simple keyword matching. A remembered incident can say "thundering herd saturated database sessions" while the deployment says "fixed one-second retries and higher concurrency." The exact words differ, but the mechanism is the same.

The verdict logic is deliberately conservative:

if (!hasRelevantMemory) {
  return {
    verdict: "APPROVE",
    decisionBasis: memories.length ? "insufficient-evidence" : "empty-memory",
    citedMemoryIds: [],
  };
}

return {
  verdict: "BLOCK",
  decisionBasis: "causal-evidence",
  citedMemoryIds: relevantMemories.map((memory) => memory.id),
};

That gives SCAR a useful failure mode. If it recalls memory but cannot connect it causally, it does not block. It explains that the evidence is insufficient.

The Before And After Behavior

The cleanest example is the proof lab flow.

In the first run, SCAR sees a deployment that replaces randomized backoff with fixed retries and increases concurrency. The memory bank is empty. The result is:

Before memory: APPROVE
0 memories cited
Decision basis: empty-memory

Then the engineer provides the incident lesson:

Fixed-interval retries synchronized thousands of payment requests after the provider returned HTTP 429 responses.
The retry wave exhausted the shared database connection pool.
The team rolled back the change, restored randomized exponential backoff with jitter, capped concurrent retries, and added a rate-limited canary test.

SCAR stores that correction in Hindsight. When the exact same deployment is analyzed again, the result changes:

After memory: BLOCK
Matched signals: resource exhaustion, retry synchronization, safe retry guardrail
Recalled memory IDs passed the citation gate

This is the behavior I wanted: not "the model feels worried now," but "the organization has already seen this failure mode, and here is the evidence."

The Negative Control Matters

I added a negative-control path because it is the easiest way to catch fake learning.

In that path, SCAR stores an unrelated incident:

A missing design token caused low contrast in a settings button.
The team restored the design token and added a visual regression test.

Then it reviews the same risky retry deployment again.

The correct answer is not BLOCK. The correct answer is:

After memory: APPROVE
BLOCK rejected: recalled memory is unrelated
0 recalled memory IDs passed the citation gate

That is the point. Persistent memory should make an agent more careful, not more paranoid. If unrelated memory changes the decision, the agent is not learning. It is just accumulating noise.

What I Learned

The main lesson is that agent memory needs a decision contract.

It is not enough to retain information. The system has to define when memory is allowed to affect behavior. For SCAR, the rule is strict: a block requires causally relevant recalled memory and verified cited IDs.

Second, memory is more useful when it stores corrections, not just events. The raw outage is important, but the engineer's corrected root cause and resolution are what make the future review better.

Third, negative controls should be part of the product experience. They make the claim falsifiable. In SCAR, a user can store an unrelated lesson and see that the agent does not falsely block.

Finally, agent memory feels most valuable when it changes a future decision at the exact moment the user can verify why. Hindsight gives SCAR the retain, recall, and reflection loop. The evidence gate makes that loop safe enough to trust.

The result is a deployment review agent that does not start every review from zero. Past incidents become reusable production knowledge, and future risky changes are judged against what the system has already survived.

For the implementation, the project is available on GitHub: SCAR production immune system repository. The memory layer uses the Hindsight agent memory GitHub repository and the Hindsight documentation is the best place to understand the retain, recall, and reflect workflow.