"I Made Hindsight Ask Before It Remembered"

Rohini Adibatti — Tue, 19 May 2026 17:36:56 +0000

## I Made Hindsight Ask Before It Remembered

The hard part of building a clinical memory system was not storing what a doctor said. The hard part was deciding when the system should ask one more question before committing that memory.

Shift handoff software usually treats the handoff as a document problem: capture a note, file it, maybe make it searchable. I found that framing too weak. The clinically interesting data often lives in phrases that look imprecise to software but meaningful to a doctor: "something feels off," "guarded on paper," "not fully comfortable," "oxygen dipped but recovered with support." Those are not clean fields. They are uncertainty signals.

ShiftBrain is my attempt to build around that reality. It combines voice capture, structured handoff extraction, patient-specific memory retrieval, cross-shift pattern detection, and runtime model routing. The two ideas that shaped the architecture were Hindsight agent memory and cascadeflow model routing. Hindsight gave me the mental model for persistent patient memory. cascadeflow gave me the runtime pattern: do the cheap, fast thing first, then escalate when the input deserves it.

The opinionated part is this: I do not want an agent that only answers questions. For handoffs, I want an agent that interrupts politely before it stores a dangerously vague memory.

What ShiftBrain Does

ShiftBrain has two primary workflows.

The outgoing doctor uses a voice bot to create a handoff. They select a patient, click "Start Handoff Conversation," and speak naturally. The browser records audio with MediaRecorder, the backend transcribes it with Groq Whisper, then /api/handoffs/extract-draft turns the transcript into structured fields:

formal note
gut concern
things not in the chart
watch-outs
shift
follow-up questions

The incoming doctor uses a separate voice bot or patient Q&A page. They select the same patient and ask a question like, "Anything unusual I should watch for tonight?" The backend fetches the patient row, saved handoffs, extracted memories, and cross-shift patterns, then asks Groq for a patient-grounded answer through /api/memory/ask.

The architecture is deliberately boring in the best way:

FastAPI backend
Next.js frontend
Supabase for relational data and pgvector memory lookup
Groq for LLM calls and Whisper transcription
browser-native speech synthesis for TTS
Hindsight-style memory storage
cascadeflow-style runtime routing and audit logging

I did not want a pile of agent abstractions hiding the control flow. In clinical handoff, the boring path is often the path you can debug at 3 AM.

The Real Problem Was Ambiguity

My first version had a predictable flaw. It could fill the fields, then confidently say:

Thanks. I've structured the handoff. Please review and save.

That looked good until I tried a handoff like this:

45 year old male with pulmonary tuberculosis, currently guarded on paper, somewhat stable, but I am not fully comfortable with how he looks. His breathing pattern seemed more labored in the last hour. Oxygen saturation dipped but recovered with support.

The structured output had values. formal_note was non-empty. gut_concern was non-empty. things_not_in_chart had the oxygen dip. watch_outs mentioned respiratory monitoring.

But clinically, the handoff was not done.

"Oxygen saturation dipped" is not enough. What was the low? What is it now? What support helped? Oxygen? nebulization? positioning? Should the incoming doctor check respiratory rate and saturation in the first 30 minutes?

The mistake was treating "field is non-empty" as "field is adequate." That is a classic software bug when the domain has tacit meaning. Clinical ambiguity is not null. It is often a phrase with just enough information to be dangerous.

So I added an explicit risk clarification layer.

Hindsight Is Memory, Not Just Storage

The Hindsight layer is small, but it carries a lot of the system's intent. When a handoff is saved, extraction produces tacit memories and stores them against the patient. Retrieval is scoped by patient and department, and it tries semantic search first with a recent-memory fallback.

class HindsightService:
    def store_memory(
        self,
        patient_id: str,
        author_id: str,
        department: str,
        memory_type: str,
        content: str,
        importance: float = 0.5,
        confidence: float = 0.5,
        tags: list[str] = None,
        handoff_id: Optional[str] = None,
        metadata: dict = None,
    ) -> dict:
        try:
            vec = embed(content)
        except Exception as exc:
            logger.warning("Embedding failed while storing memory; using zero fallback vector: %s", exc)
            vec = [0.0] * EMBEDDING_DIM

        row = {
            "patient_id": patient_id,
            "author_id": author_id,
            "department": department,
            "memory_type": memory_type,
            "content": content,
            "embedding": vec,
            "importance": importance,
            "confidence": confidence,
            "tags": tags or [],
            "handoff_id": handoff_id,
            "metadata": metadata or {},
        }

This is intentionally not just "chat history." A memory has type, confidence, importance, tags, author, patient, department, and source handoff. That matters because incoming questions need provenance. "Dr. Patel had a gut concern last night" is more useful than "the model thinks there may be a concern."

The Hindsight documentation helped clarify that memory should be a retrieval substrate, not a transcript dump. The broader concept is described well in Vectorize's write-up on what agent memory is and why retrieval matters. In ShiftBrain, I used that idea to keep each handoff connected to future bedside questions.

cascadeflow Is Where Runtime Judgment Lives

For extraction, I wanted a runtime that could start with a fast model and escalate when the input looked risky, long, or hard to parse. That is the cascadeflow-shaped part of the system.

MODELS = {
    "fast": {
        "provider": "groq",
        "name": "llama-3.1-8b-instant",
        "in": 0.05,
        "out": 0.08,
    },
    "balanced": {
        "provider": "groq",
        "name": "llama-3.3-70b-versatile",
        "in": 0.59,
        "out": 0.79,
    },
    "premium": {
        "provider": "openrouter",
        "name": "anthropic/claude-3.5-sonnet",
        "in": 3.00,
        "out": 15.00,
    },
}

CRITICAL_KEYWORDS = [
    "cardiac arrest", "stroke", "hemorrhage", "sepsis", "code blue",
    "anaphylaxis", "respiratory failure", "unstable", "deteriorating",
]

The router is not magic. It checks the prompt, picks a tier, requires JSON when the caller needs a schema, and writes audit logs with tokens, latency, model, provider, and estimated cost.

def _route_tier(self, prompt: str) -> str:
    lower = prompt.lower()
    if any(k in lower for k in CRITICAL_KEYWORDS):
        return "premium"
    if len(prompt) > 2000:
        return "balanced"
    return "fast"

That style maps closely to the ideas in cascadeflow docs: runtime decisions belong close to the work, not buried in prompts. The cascadeflow GitHub project is useful because it treats model choice as an engineering control surface. I wanted that same property here. If a handoff contains "respiratory failure" or "deteriorating," the system should not pretend every input has the same risk profile.

The Follow-Up Loop That Broke My Assumptions

The outgoing voice bot exposed the most interesting failure.

At first, I had a simple loop:

transcribe audio
extract draft
ask first follow-up question if one exists
listen again
re-run extraction

Then I hit a repeat bug. The bot kept asking, "Is patient's oxygen saturation stable?" even after the doctor answered. The backend did not know which question had already been asked, and the frontend did not attach the doctor's next utterance to that question.

The fix was not a bigger prompt. It was state.

const [askedQuestions, setAskedQuestions] = useState<string[]>([])
const [answeredFollowups, setAnsweredFollowups] = useState<AnsweredFollowup[]>([])

const askedQuestionsRef = useRef<string[]>([])
const answeredFollowupsRef = useRef<AnsweredFollowup[]>([])
const currentFollowupRef = useRef<string | null>(null)

function captureFollowupAnswer(answer: string) {
  const question = currentFollowupRef.current
  if (!question) return answeredFollowupsRef.current
  const next = [...answeredFollowupsRef.current, { question, answer }]
  answeredFollowupsRef.current = next
  setAnsweredFollowups(next)
  currentFollowupRef.current = null
  return next
}

The frontend now sends the full conversation context:

const data = await api.post<Draft>('/api/handoffs/extract-draft', {
  patient_id: patientId,
  transcript: nextTranscript,
  previous_draft: draftRef.current,
  asked_followup_questions: askedQuestionsRef.current,
  answered_followup_questions: answeredOverride,
})

That one change made the bot feel less like a form generator and more like an interview loop. It also made the backend responsible for a stricter contract: never repeat a question, cap follow-ups at two, and decide whether the handoff is ready to save.

The backend response now includes risk_clarifications_needed and ready_to_save. That is a small schema change with a big behavioral effect.

Return JSON only with:
{
  "formal_note": "",
  "gut_concern": "",
  "things_not_in_chart": "",
  "watch_outs": "",
  "shift": "day|night|swing",
  "missing_fields": ["formal_note", "gut_concern", "watch_outs"],
  "risk_clarifications_needed": true,
  "followup_questions": ["one or two concise questions"],
  "ready_to_save": false
}

This is where the system became more opinionated. It no longer asks follow-ups only when fields are blank. It also asks when risk phrases are ambiguous:

oxygen saturation dipped
recovered with support
breathing more labored
guarded
something feels off
not comfortable
pain is different
unusual quietness

For the TB handoff, the system asks:

What was his lowest oxygen saturation and what is it now?

Then, if it has room for a second follow-up:

What support helped him recover - oxygen, nebulization, positioning, or something else?

After those answers are captured, the structured preview updates, the bot stops asking, and it says:

Thanks. I've structured the handoff. Please review and save.

The max-two rule matters. An agent that asks infinite clarifying questions is not careful. It is unusable.

Incoming Questions Need Patterns, Not Just Recall

The other major design decision was cross-shift pattern detection. If two doctors on consecutive nights say a patient seemed unusually quiet, that should not be buried as two independent notes.

I started with rule-based detection rather than an LLM. That was deliberate. Pattern detection should not fail because a model is unavailable, and it should be easy to inspect.

PATTERN_RULES = [
    {
        "key": "gut",
        "pattern": "Repeated gut concern",
        "terms": ["gut concern", "something feels off", "feels off", "not right", "worried", "concern"],
        "risk_level": "medium",
        "suggested_action": "Incoming doctor should manually reassess patient early in shift.",
    },
    {
        "key": "quiet_anxious",
        "pattern": "Repeated behavior change",
        "terms": ["quiet", "unusually quiet", "anxious", "anxiety", "withdrawn", "not herself", "not himself"],
        "risk_level": "medium",
        "suggested_action": "Incoming doctor should compare current behavior with baseline and family report.",
    },
]

When an incoming doctor asks a question, /api/memory/ask fetches handoffs, memories, and patterns for the selected patient. The prompt explicitly includes all three.

return f"""PATIENT:
{patient_block}

SAVED HANDOFF HISTORY FOR THIS PATIENT:
{handoff_block}

EXTRACTED MEMORIES FOR THIS PATIENT:
{memory_block}

CROSS-SHIFT PATTERNS DETECTED:
{pattern_block}

INCOMING DOCTOR QUESTION:
{question}
"""

That gives the answerer enough context to say something concrete:

Two different doctors flagged gut concern across recent handoffs. Dr. Patel noted the patient seemed unusually quiet, and Dr. Osei also wrote that something felt off despite stable charted vitals. I would reassess respiratory status and mental status early, then compare against the last two shifts before relying on the chart alone.

That is the behavior I wanted: not "here is a summary," but "here is the thing that repeated across shifts."

What I Learned

First, memory systems need write-time judgment. Retrieval quality starts before retrieval. If the system stores vague handoffs without asking about ambiguous risk, no vector search trick will recover the missing detail later.

Second, prompts are not enough state. The repeated oxygen follow-up bug was not solved by saying "do not repeat yourself" more loudly. It was solved by representing asked_followup_questions and answered_followup_questions as data.

Third, cascade routing belongs in normal application code. I want model choice, latency, token usage, and cost visible in audit logs. That makes the system easier to operate than a black-box agent loop.

Fourth, deterministic rules still matter. The cross-shift detector is intentionally rule-based before LLM-based. For repeated gut concerns, missing-chart context, behavior changes, and pain changes, simple rules are inspectable and good enough to trigger attention.

Fifth, voice UX is mostly about boundaries. Stop listening while the bot speaks. Resume only when speech ends. Cap follow-ups. Provide typed fallback. Never silently return a fake answer when Groq fails. These details are not glamorous, but they are the difference between a useful voice bot and a demo that talks over itself.

The System I Trust More

The version of ShiftBrain I trust more is not the one that answers fastest. It is the one that knows when a handoff is too ambiguous to store as-is.

Hindsight gives the system a patient-specific memory layer. cascadeflow gives it runtime discipline around model choice and observability. But the most important behavior lives between them: the moment an outgoing doctor says, "oxygen dipped but recovered with support," and the bot asks, "What support helped?"

That question is small. It is also the difference between remembering a sentence and preserving a clinical signal.

DEV Community: Rohini Adibatti