Anu Alleshwaram

Posted on Jun 29

I Reworked AI Memory Using Hindsight and Cascadeflow

#ai #hindsight #cascadeflow #memory

When I first started wiring this system together, the hardest problem wasn’t the dashboard or the React animations — it was making the incident story feel useful.

This repo is an operations platform built around incident intelligence, document memory, and an enterprise AI copilot. The backend is a FastAPI service that keeps incidents, KB articles, documents, connector state, and chat history in a Mongo-like memory layer. The frontend is a React app that lets you triage incidents, generate RCA, search knowledge, upload documents, and ask a copilot questions.

What the system does and how it hangs together

At a high level, SentinelAI is an incident triage workspace with three major flows:

Incident management: create and update incidents, track timelines, surface AI confidence and risk scores.
Enterprise memory: fetch similar incidents, related knowledge base articles, documents, and a recommended resolution from the same incident context.
Copilot and document AI: stream chat responses, summarize documents, answer document questions, and generate structured RCA.

The backend is a single FastAPI app in backend/server.py. It also has a lightweight in-memory Mongo API shim for local dev, but the feature logic is in routes such as /incidents/{iid}/memory, /incidents/{iid}/rca-structured, /copilot/chat/stream, and /documents/{did}/insights.

The frontend has an Incidents page, a Copilot page, and a Documents page. The copilot UI is not just a chat window — it lets you attach documents, stream responses, and show suggested follow-up actions. The incident detail screen combines an overview, timeline, AI confidence, and a structured RCA component.

The core technical story: incident context as memory

The thing I ended up focusing on was the incident memory model. In an operations system, the value of an AI assistant drops quickly if it can’t answer “what happened before, and what has solved this before?”

In this repo, that story is split into two pieces:

A historical incident similarity engine.
An enterprise memory endpoint that blends incidents, KB, and documents.

The endpoint /incidents/{iid}/memory is the core of that story. It is an aggregated feature designed to answer a simple question: for this active incident, what past context matters?

From backend/server.py:

@api.get("/incidents/{iid}/memory")
async def enterprise_memory(iid: str, user=Depends(current_user)):
    inc = await db.incidents.find_one({"id": iid}, {"_id": 0})
    # ...
    pool = await db.incidents.find({"id": {"$ne": iid}}, {"_id": 0}).to_list(400)
    target_tags = set(inc.get("tags", []))
    scored = []
    for p in pool:
        score = 0
        if p.get("service") == inc.get("service"): score += 35
        if p.get("severity") == inc.get("severity"): score += 18
        if p.get("department") == inc.get("department"): score += 10
        overlap = len(set(p.get("tags", [])) & target_tags)
        score += min(35, overlap * 12)
        tt = set(inc.get("title", "").lower().split())
        pt = set(p.get("title", "").lower().split())
        score += min(10, len(tt & pt))
        if score >= 30:
            scored.append({**p, "match_score": min(99, score)})
    scored.sort(key=lambda x: x["match_score"], reverse=True)
    similar = scored[:6]

That code is intentionally simple. It is not a production semantic search engine. It is a deliberate tradeoff: use deterministic matching on service, severity, department, tags, and shared title tokens, then surface the best candidates. I leaned into this because it is easy to explain, easy to debug, and it keeps the “memory” signal from being a black box.

Once the similar incidents are available, the endpoint synthesizes an estimated time to resolve and a confidence score:

resolved_similar = [s for s in similar if s.get("status") == "resolved"]
if resolved_similar:
    eta_min = int(sum(max(15, 90 - (s["match_score"] // 2)) for s in resolved_similar) / len(resolved_similar))
else:
    eta_min = {"critical": 60, "high": 45, "medium": 30, "low": 20}.get(inc.get("severity", "medium"), 30)

This is not a real SLA model, but it is useful because it turns similarity into an operational prediction. It is the kind of pragmatic “closure” that incident responders actually care about.

Why Hindsight and Cascadeflow matter in this design

I couldn’t build this without thinking in terms of two ideas: memory as context and flow as a decision path.

Hindsight is useful here because it frames incident recovery as a retrieval problem. A good memory system should answer: which past incidents, documents, and KB articles are relevant to this current failure? That is exactly what the /incidents/{iid}/memory endpoint is doing. It is a small, operational memory layer.

Cascadeflow is relevant because the UI and the agent need to feel like a flow, not a random chat. The Copilot page is not just sending text to an LLM; it is managing session state and wiring suggested actions back into the interface. Here’s the action suggestion route:

@api.post("/copilot/suggest-actions")
async def suggest_actions(payload: Dict[str, str], user=Depends(current_user)):
    prompt = (
        "Return STRICT JSON array of 3-4 short actionable next-step chip labels (max 6 words each).\n"
        "Schema: [{\"label\":\"...\",\"intent\":\"summarize|investigate|notify|automate|escalate\"}]\n\n"
        f"Topic: {text}\n"
    )
    raw = await llm_chat("You are SentinelAI. Always return strict JSON.", prompt, session_id=f"sa-{user['id']}")

That is a kind of flow orchestration: the system reads the user’s last query and returns structured next steps, rather than just a paragraph of text. You can imagine this as a cascade of decisions: user asks a question, the agent replies, the system suggests a next action, and the UI turns that into a button.

Code-backed behavior: how it plays out in the UI

There are three concrete interactions I’d point to.

1. Incident triage with memory

The incident list page is labeled “Triage with AI memory.” It loads incidents from /incidents and when you open one, the incident detail page gives you the structured RCA button and the enterprise memory summary.

The code in frontend/src/pages/Incidents.jsx is very explicit about that expectation:

<SectionLabel>Incident Intelligence</SectionLabel>
<h1 className="mt-1 text-3xl font-bold tracking-tight">Triage with <span className="gradient-text">AI memory</span></h1>

Then on the detail page, the same incident object is enhanced with AI-generated analysis and timeline events.

2. Copilot with streaming responses

The Copilot page uses SSE from /copilot/chat/stream. It parses events, accumulates text, and updates the UI in near real time. That gives the experience of a live assistant.

The frontend implementation is here:

const resp = await fetch(url, {
  method: "POST",
  headers: { "Content-Type": "application/json", Authorization: `Bearer ${token}` },
  body: JSON.stringify({ session_id: sid, message: uploadedExcerpt ? `${msg}\n\nAttached document:\n\n${uploadedExcerpt}` : msg }),
});
const reader = resp.body.getReader();
const decoder = new TextDecoder();
let buf = "";
while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  buf += decoder.decode(value, { stream: true });
  // parse SSE events
}

That flow is important because it keeps the assistant from feeling like a static submit button.

3. Document intelligence and related incidents

Uploaded documents can be summarized and queried. The backend uses the same llm_chat wrapper for document insights and document Q&A, which keeps the experience consistent.

In backend/server.py:

@api.post("/documents/{did}/insights")
async def document_insights(did: str, user=Depends(current_user)):
    excerpt = (doc.get("excerpt") or "")[:3000]
    prompt = (
        "Return STRICT JSON (no markdown) with this shape:\n"
        "{\"keywords\":[\"...\"],\"entities\":[{\"text\":\"...\",\"type\":\"PERSON|ORG|SERVICE|TECH|DATE|LOCATION\"}],\"auto_tags\":[\"...\"],\"sentiment\":\"positive|neutral|negative\",\"confidence\":0.0_to_1.0}\n\n"
        f"Document: {doc.get('name')}\nContent excerpt:\n{excerpt}\n\nReturn JSON only."
    )

Then the frontend shows related incidents linked from the document insights panel.

Concrete example interactions

If I were using this system in a real incident, the workflow would look like this:

A new incident is created for payments-api with critical severity and tags like latency and db.
I open the incident detail and see “AI Confidence” plus a timeline of events.
I hit “Generate RCA.” The backend calls Gemini with a prompt that asks for “Root Cause, Contributing Factors, Suggested Resolution, Preventive Actions.”
I use /incidents/{iid}/memory and see similar episodes, related KB, and a recommended resolution pulled from a past resolved incident.
I switch to Copilot, paste the incident details or attach a support document, and get a streaming assistant response with suggested follow-up actions.

That is the pattern I want: incident, memory, and a guided AI flow.

Why this matters to me as an engineer

There are a few design decisions here that I think are worth calling out.

Don’t overcomplicate similarity. The memory engine is not pretending to be semantic search. It uses structured fields and token overlap to make the result predictable.
Keep the prompt layer consistent. Every AI feature shares the same llm_chat guardrail and the same “strict JSON” style when it needs structured output.
Treat documents as first-class context. Document upload is not just a sidebar feature; it is actually available in copilot sessions and incident memory.
Make the assistant actionable. That /copilot/suggest-actions endpoint is the smallest thing that turns chat from “answer this” into “what should I do next?”

Limitations I would fix first

This repo is promising, but it also makes a clear tradeoff: realism is synthetic. A lot of the backend is intentionally simple, and the “enterprise memory” endpoint blends actual incident data with fallback heuristics and synthetic confidence.

For example:

The similarity model uses only direct field matches and title tokens.
The ETA calculation is a deterministic function of match score, not a real historical distribution.
The document extraction layer can fall back to naive keywords if the LLM output is malformed.

Those are not bad decisions for a minimal system, but they are the things I would harden next if I wanted this to run in a real SRE workflow.

Lessons learned

Incident memory is more useful when it is narrow. A small feature like “similar incidents + related KB + documents” is easier to ship than a full search engine.
Structured LLM output reduces brittleness. When you ask for strict JSON in prompts, the backend can recover with a fallback parser instead of failing entirely.
Streaming chat changes the user expectation. If the assistant responds token-by-token and the UI shows suggested next steps, the product feels like a flow instead of a form.
Don’t guess at high confidence. The backend exposes a confidence score and keeps it conservative, which is a better posture than pretending the AI is always right.
Keep the UI language aligned with the architecture. The frontend calls the page “AI memory” because the strongest story here is context retrieval, not just “AI assistance.”

Final thought

I built this repo around one idea: AI should make incident context visible, not replace the incident process.

That meant accepting a few deliberate simplifications in the memory engine, and building the UI so the copilot behaves like a guided flow. If I were to continue this project, I would keep that same tension: use memory for context, use flows for decisions, and keep the system explainable.

This is our work - https://sentinel-ai-kq5i.onrender.com

Github Link - https://github.com/sahil-0404/Sentinel-AI