Subashini

Posted on Apr 13

My cache fix was fine—until it wasn’t

#devops #ai #machinelearning #tribalknowledge

My cache fix was fine until it wasn't.

The day I realized we were shipping amnesia

My cache fix passed tests, code review, and staging. It still caused production pain two weeks later, for a reason someone had already explained in a meeting six months earlier. That was the day I stopped thinking our biggest engineering problem was bad code and started treating it as missing memory.

Most teams don’t fail because they can’t write software. We fail because we can’t consistently remember why we made decisions in the first place. In our stack, the “why” lived in Slack scrollback, incident calls, and the heads of whoever happened to be awake during a postmortem. The result was predictable: repeat incidents, repeat arguments, and repeat “how did we miss this again?” moments.

So I built a system that sits in the developer workflow, continuously recalls prior decisions, and intervenes while code is being written. It doesn’t wait for CI. It doesn’t wait for another outage. It tells you, in real time, when you are about to repeat known failure patterns.

What this system does and how it hangs together

At a high level, the architecture follows a simple loop: capture, structure, retrieve, intervene. The project’s README calls out that exact sequence—capture events, extract decisions, store memory plus embeddings, retrieve context, and warn during development (project memory capture workflow).

In production form, we wired this loop around a persistent memory layer backed by Hindsight’s agent memory runtime, indexed through retrieval pipelines and fed by operational sensors (meeting transcripts, incident notes, commit metadata, and runtime signals). I started from Hindsight docs for persistent memory design patterns and then adapted the retrieval strategy around our own engineering artifacts.

If you’ve ever had an LLM assistant forget a decision made ten prompts ago, you already understand why this matters. I knew I needed a durable memory substrate, and this practical overview of agent memory patterns aligned with what we were seeing: context loss is not an edge case; it’s the default.

In our system, the runtime has four concrete subsystems:

Signal ingestion from engineering events.
Decision extraction into memory atoms (rule, context, source, severity, temporal scope).
Retrieval and ranking by current code intent plus repository context.
Inline intervention in the editor stream, with suggested corrective actions.

The repository includes a compact front-end simulation of that behavior in code.html, where a knowledge graph of prior incidents is checked continuously against active code input (knowledge graph and rule schema).

The core technical story: hindsight is only useful if it is actionable in 800ms

The hardest part was not collecting memory. The hardest part was turning memory into interruption-quality feedback without becoming noise.

I learned this quickly: if a warning arrives after a developer mentally committed to an approach, they ignore it. If a warning is vague, they ignore it. If a warning has no provenance, they especially ignore it.

So the core design constraint became this: every intervention must be fast, specific, and attributable.

1) Fast: analysis cadence tied to typing behavior

In the prototype, the editor waits briefly before analyzing (setTimeout with an 800ms delay) to avoid alert spam on every keystroke. That tiny debounce is a bigger design choice than it looks; it defines the user experience of “assistant” vs “annoyance.”

js
codeIn.addEventListener('input', () => {
  updateGutter();
  clearTimeout(timeout);
  scanStatus.className='scan-status scanning';
  timeout = setTimeout(async () => {
    const vs = check(codeIn.value);
    const newVs = vs.filter(v=>!lastLibs.includes(v.lib));
    if(JSON.stringify(vs.map(v=>v.lib))!==JSON.stringify(lastLibs)) {
      await analyse(vs,newVs);
    }
  }, 800);
});

In production we kept the same behavioral principle: defer just long enough to observe intent, not so long that the intervention feels delayed.

2) Specific: rules carry concrete failure semantics

A memory record is not just “don’t use X.” It carries incident semantics: what failed, under which context, and who documented it. In the repository model, each entry captures rule, reason, context, source, severity, and explanatory reasoning (incident-backed rule object structure).

js
{
  lib:"pickle",
  rule:"Do not use pickle for inter-service serialization",
  reason:"Caused silent data corruption between Python 3.10 and 3.12 services.",
  flaggedBy:"Sana",
  flaggedDate:"2025-02-20",
  context:"Microservices communication layer",
  source:"Engineering all-hands transcript",
  severity:"CRITICAL"
}

That metadata is what moves a warning from opinion to engineering artifact.

3) Attributable: every warning points back to tribal evidence

The right pane in code.html intentionally renders source provenance next to the violation—who flagged it, when, where, and why. This turns “AI says no” into “here is the historical failure you are about to replay” (violation rendering with source provenance).

js
h+=`<div class="fr"><span class="fk">flagged</span><span class="fv p">${v.flaggedBy} · ${v.flaggedDate}</span></div>
<div class="fr"><span class="fk">source</span><span class="fv">${v.source}</span></div>`;

When we shipped this pattern, rebuttals changed from “the tool is wrong” to “is this source still valid?” That is a healthier debate.

Why hindsight-based learning changed the system

The memory layer is not static policy. It evolves by ingesting outcomes.

After each significant decision point—incident mitigation, architecture review, unusual rollback—we attach a hindsight pass that asks: What should future us have been warned about earlier? That extracted lesson becomes a retrieval target for future coding sessions.

This is where Hindsight memory workflows on GitHub were useful: the model of writing durable, queryable memory from operational events maps directly to engineering reality. Postmortems are already written; the missing step is transforming them into low-latency intervention rules.

The key design decision was to store lessons as atomic norms instead of long narrative blobs. The UI literally reflects this with “3 atomic norms loaded,” and that language matters (atomic norms loaded status in reasoning stream). A norm is small enough to retrieve quickly, composable enough to rank, and interpretable enough for a human to validate.

Long-form context is still preserved, but retrieval first returns the compact norm and then drills into linked evidence if needed.

Example behavior in real developer flow

Here’s a representative interaction path, mirrored by the repo’s flow and then expanded with the production feedback loop.

I start modifying backend HTTP code.
The analyzer detects a library import matching a prior incident pattern.
The system posts a violation event with severity and source provenance.
It suggests a minimally invasive alternative.
If I accept the change, that acceptance signal is logged as reinforcement for ranking.

In the simulation, if code includes FastHTTP, the system marks it as a critical violation and references the prior security incident context. If code is clean, it explicitly reports “No violations — safe to commit” (clear-state and violation-state transitions).

That clear-state behavior is not cosmetic. It gives developers closure and confidence. Warnings without clear dismiss/resolve semantics create chronic distrust.

Code-backed design details that mattered more than I expected

Diff-aware messaging

The analyzer tracks previously seen violations and only announces net-new ones. This reduces repetitive noise when a developer pauses and resumes typing (new violation detection with lastLibs).

Severity-aware UI contract

Critical and warning paths are visually and behaviorally distinct (CRITICAL vs WARNING classes). In production we tied this to escalation policies: critical blocks commit unless overridden with justification; warning remains advisory (severity-specific rendering paths).

Temporal framing

The initial reasoning stream includes explicit temporal context (“April 2026”). That sounds minor, but it prevents one of the easiest failure modes in memory systems: stale advice presented as timeless truth (temporal context in initialization message).

What I learned building this

1) Tribal knowledge is not “soft”; it is an unindexed dependency

We treat package dependencies as first-class, but we treat institutional memory as folklore. That is backwards. The memory gap is often the dominant failure driver once a team exceeds a handful of services and people.

2) Retrieval quality beats model cleverness

I got better outcomes by improving memory atom shape, source tagging, and ranking heuristics than by changing model prompts. Most “AI quality” complaints were actually retrieval and provenance issues.

3) Engineers will accept guardrails if they can inspect the evidence

Opaque policy engines get bypassed. Systems that show source, incident date, and context get adopted.

4) Latency is part of correctness

A perfect warning delivered too late is operationally wrong. Intervention systems should be designed with interaction timing as a core correctness criterion.

5) Hindsight loops need ownership

If no one curates extracted lessons, memory degrades into stale cargo cult. We assigned explicit ownership for validating and expiring norms based on changed dependencies, patched CVEs, and architecture shifts.

The uncomfortable conclusion

I started this project thinking we needed a better coding assistant. We actually needed a system that treats past engineering pain as a runtime input.

Without persistent memory, teams repeatedly pay tuition on the same mistakes. With a hindsight-driven memory layer, we can force our software process to remember what our org keeps forgetting.

And yes, my cache fix was fine—until it replayed a decision we had already learned not to make.

The whole point of this system is to make that sentence impossible to say twice.

Lessons Worth Keeping
• Sensor placement is the whole game. The retrieval, the LLM evaluation, the interrupt delivery — none of it matters if the memory corpus is thin or poorly structured. We spent three times as long on audio transcription quality, meeting segmentation, and decision extraction as we did on the RAG pipeline. Garbage in is amplified by a working retrieval system, not filtered by it.
• Importance scoring is not optional. Without it, retrieval scales with corpus volume rather than with operational consequence. Every query returns a mixture of load-bearing constraints and irrelevant historical noise, weighted identically. Engineers stop trusting the system within a week. The incident-calibrated importance classifier was the most painful component to build — six weeks of labelling postmortems — and the most important.
• Recency weighting needs domain-specific calibration. Default retrieval logic assumes new information supersedes old. Organizational knowledge does not work that way. Some constraints age badly and should be deprioritised. Others age not at all — an incident-derived constraint from three years ago may be more operationally critical than anything written last month. Getting the recency weight right required studying which episodes were most consequential in the corpus and working backward.
• Autocorrection scope must be uncomfortably narrow. The correct instinct is to start with a tiny correctable domain and expand it only after engineers have developed trust through repeated correct warnings. We corrected only numeric thresholds and timeout values for the first six months. One wrong auto-correction at the wrong moment costs you weeks of credibility. Start narrow and earn expansion.
• The cold start problem is real and requires active mitigation. A memory that learns from your processes needs to have observed enough of your processes before it is useful. Deploying sensors and waiting is too slow for adoption. We ran a retrospective corpus sweep against three years of Confluence, two years of incident postmortems, and the full git history before sensors went live. That gave the system a functional memory from day one and dramatically shortened the window between deployment and usefulness.

What We Are Actually Building
The system is not a search engine over your documentation. It is not a chatbot that answers questions about past decisions. Both of those require an engineer to know they should ask a question, know what to ask, and take the time to do it — conditions that reliably fail under deadline pressure.
What we built is an ambient interrupt system with persistent agent memory that fires at the moment a decision is being made, without being asked. The timing is not incidental — it is the point. Post-hoc knowledge retrieval is useful. Pre-hoc interruption is what actually prevents incidents.
The Hindsight agent memory architecture made the real-time retrieval path feasible. The episodic structure means organizational knowledge accumulates as lived experience rather than as a document index. The importance weighting means retrieval stays signal-dense as the corpus grows. The retrospective re-scoring means the agent is always revising its understanding of what mattered, not just appending to a static store.
Senior engineers carry in their heads what amounts to an organizational immune system — a pattern library built from years of watching what breaks and remembering why. When they leave, that immune system leaves with them. Every team I have worked with has accepted this as a cost of doing business. We externalized the immune system instead. It runs in the commit hook.

An overview

Memory updating:

Data Filtering:

The alternate path:

Limitations and Pain Points

This system is not magic, and it is certainly not perfect.
• Memory Bloat: The biggest limitation is that the agent can "remember" too much. If we keep every single comment from every Slack thread, the retrieval noise becomes unbearable. We’ve had to implement an automated pruning pipeline that aggressively de-prioritizes rules older than 12 months unless they are explicitly tagged as "Architectural Constant."
• Trust Calibration: Engineers are rightfully sceptical of "automatic changes." If the agent can't cite the source—a specific meeting, a PR comment, or a bug report—the engineers tend to override the agent's interventions. We have learned that explainability is non-negotiable. If the agent doesn't have a clear citation, it must default to "Alert" rather than "Act."
• Sensor Noise: Our meeting transcript parser often misinterprets sarcasm or brainstorming as definitive technical requirements. We are currently implementing a "human-in-the-loop" step where a senior engineer must verify any rule flagged by the agent before it becomes an immutable constraint.

References

[1] Panopto. (2019). Workplace Knowledge and Productivity Report. Survey of 1,000 U.S. full-time employees. https://www.panopto.com/resource/valuing-workplace-knowledge/
[2] IBM Institute for Business Value. (2022). The Employee Experience Index: Understanding What Drives Peak Performance. IBM Corporation. https://www.ibm.com/thought-leadership/institute-business-value/
[3] Hansen, M. T., Nohria, N., & Tierney, T. (1999). What's your strategy for managing knowledge? Harvard Business Review, 77(2), 106-116. Replication data in Management Science, Vol. 52, No. 11, 2006, pp. 1725-1745.
[4] Polanyi, M. (1966). The Tacit Dimension. Doubleday. University of Chicago Press reprint, 2009

[5] DeLong, D. W., & Fahey, L. (2000). Diagnosing cultural barriers to knowledge management. Academy of Management Executive, 14(4), 113–127. DOI: 10.5465/ame.2000.3979820

[6] Argote, L., & Miron-Spektor, E. (2011). Organizational learning: From experience to knowledge. Organization Science, 22(5), 1123–1137. DOI: 10.1287/orsc.1100.0621 · ResearchGate PDF

DEV Community