DEV Community

Cover image for My fix for hallucinating case notes was weirdly boring: stop stuffing context and split the job in two
Lars Winstand
Lars Winstand

Posted on • Originally published at standardcompute.com

My fix for hallucinating case notes was weirdly boring: stop stuffing context and split the job in two

I keep seeing the same failure mode in note-to-action workflows:

  • therapy notes -> action plan
  • incident report -> follow-up steps
  • HR case log -> recommendation
  • intake notes -> classification + priority

The model output sounds polished.

And then you trace it back to the source notes and realize half the reasoning is mush.

Not always fully fabricated. Just... not grounded enough to trust.

I got pulled back into this after reading a thread on r/openclaw from someone trying to turn therapy notes into action plans with OpenClaw. The problem was painfully familiar: the agent could read the notes, but the recommendations drifted away from the actual evidence.

One reply had the best fix in the whole thread:

Hallucination in data extraction usually happens when the prompt is too open-ended or the context window is crowded. Try implementing a two-step verification process: first, have the agent extract raw quotes from the notes that support the action item, and then have a second pass generate the action plan based only on those quotes.

That answer is boring.

Which is exactly why I trust it.

The real fix usually is not:

  • a smarter system prompt
  • a bigger model
  • a 1M-token context window
  • a more elaborate agent loop

The real fix is architectural.

Split the job in two.

The actual bug: one prompt doing too many jobs

A single-pass note-to-action prompt usually asks one model call to do all of this at once:

  • read messy notes
  • decide what matters
  • classify the case
  • infer missing context
  • prioritize risk
  • generate recommendations
  • explain why

That is not one task.

That is a committee meeting inside a single prompt.

For sensitive workflows, that’s where things go sideways.

If you’re using OpenClaw, n8n, Make, Zapier, or a custom agent stack to process notes and trigger downstream actions, you do not want the model improvising across all of those steps in one shot.

You want a chain of custody for the evidence.

The two-pass pattern

This is the version I’d ship first.

Approach What actually happens
Single-pass note-to-action prompt One model call does extraction, classification, and recommendations together. Fast to prototype, but recommendations are generated from the full noisy context, so drift is harder to catch.
Two-pass grounded workflow Pass 1 extracts evidence or quotes with source references; Pass 2 generates recommendations from only approved evidence. More auditable, easier to debug, and much safer for sensitive workflows.

Pass 1: extract evidence only

Do not ask for recommendations.

Do not ask for synthesis.

Ask for:

  • verbatim quotes
  • structured facts
  • source references
  • confidence or ambiguity flags if needed

Example schema:

{
  "evidence": [
    {
      "id": "ev_001",
      "quote": "Client reported missing two medication doses this week.",
      "source": "note_2026_06_05",
      "line_range": "14-14",
      "type": "adherence_issue"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Pass 2: generate recommendations from evidence only

Now the model gets a much smaller input.

No giant note blob. No hidden distractions. Just the extracted evidence.

Every recommendation should cite evidence IDs.

{
  "recommendations": [
    {
      "action": "Schedule medication adherence follow-up within 48 hours.",
      "priority": "high",
      "evidence_ids": ["ev_001"]
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Pass 3: abstain or escalate

This is the step teams skip.

If the evidence is weak, conflicting, or incomplete, the workflow should return:

  • insufficient evidence
  • needs human review
  • unable to classify

That is not failure.

That is a working safety mechanism.

Why stuffing more context often makes outputs worse

People still talk about long context like it automatically solves grounding.

It doesn’t.

A big context window means the model can receive more text. It does not mean the model will reliably use the right text.

That’s the important distinction.

The "Lost in the Middle" result is still one of the clearest explanations for why giant prompts underperform in practice: relevant information buried in the middle of long context is easier for the model to miss.

That matches what a lot of us see in production.

You stuff in every note, all prior history, metadata, policy text, and a giant instruction block because it feels safer than leaving anything out.

But now the important sentence is buried on page 8 between irrelevant details.

The model has more text.

It does not have better grounding.

That is why retrieval and scoped evidence extraction keep beating context stuffing in real systems.

Long context is useful. Just not for every job.

I’m not anti-long-context.

Long context is great for:

  • policy Q&A
  • chat-with-documents
  • repeated analysis over a stable corpus
  • cached prompts over large reference material

If I want Claude or GPT-5 to answer questions about a handbook, a giant cached prompt can be elegant.

If I want a model to turn sensitive notes into recommendations, I want evidence extraction first.

Different job. Different failure mode.

Practical implementation with an OpenAI-compatible client

This is where things get useful for devs.

If your stack already talks to the OpenAI API, you can usually implement the two-pass pattern without rewriting your app architecture.

That matters because the best workflow changes are the ones you can actually ship.

Here’s a minimal example using the OpenAI Node SDK shape.

Pass 1: evidence extraction

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: process.env.OPENAI_BASE_URL
});

const notes = `
Case note 1: Client reported missing two medication doses this week.
Case note 2: Client denied suicidal ideation.
Case note 3: Client requested transportation help for next appointment.
`;

const extraction = await client.responses.create({
  model: "gpt-5.4",
  input: [
    {
      role: "system",
      content: `Extract only verbatim evidence from the notes.
Return JSON with this schema:
{
  "evidence": [
    {
      "id": "string",
      "quote": "string",
      "source": "string",
      "type": "string"
    }
  ]
}
Do not generate recommendations.`
    },
    {
      role: "user",
      content: notes
    }
  ]
});

console.log(extraction.output_text);
Enter fullscreen mode Exit fullscreen mode

Pass 2: recommendations from extracted evidence only

const extractedEvidence = extraction.output_text;

const plan = await client.responses.create({
  model: "claude-opus-4.6",
  input: [
    {
      role: "system",
      content: `Generate recommendations using only the provided evidence.
Return JSON with this schema:
{
  "recommendations": [
    {
      "action": "string",
      "priority": "low|medium|high",
      "evidence_ids": ["string"]
    }
  ],
  "status": "ok|insufficient_evidence"
}
Every recommendation must cite evidence_ids.
If evidence is insufficient, return status=insufficient_evidence.`
    },
    {
      role: "user",
      content: extractedEvidence
    }
  ]
});

console.log(plan.output_text);
Enter fullscreen mode Exit fullscreen mode

A few practical notes:

  • You can use the same model for both passes.
  • You can route extraction to a cheaper model and recommendation to a stronger one.
  • You can validate JSON between steps.
  • You can insert human review after pass 1.

That last one is huge for sensitive workflows.

Why this pattern fits Standard Compute really well

This is exactly the kind of workflow where per-token pricing gets annoying fast.

Once you split one giant prompt into two or three smaller calls, you usually get better reliability.

But you also increase call volume.

That’s the right tradeoff technically. It’s just annoying financially on standard per-token billing.

This is where Standard Compute makes a lot of sense for agent and automation teams.

You get:

  • unlimited AI compute for a flat monthly price
  • an OpenAI-compatible API
  • drop-in use with existing SDKs and HTTP clients
  • dynamic routing across GPT-5.4, Claude Opus 4.6, and Grok 4.20

So instead of trying to cram everything into one giant call to save tokens, you can design the workflow the way it should be designed:

  • one pass for extraction
  • one pass for recommendation
  • maybe one pass for validation or abstention

That’s especially useful if you’re building in:

  • n8n
  • Make
  • Zapier
  • OpenClaw
  • custom agent frameworks

Flat-rate compute changes behavior.

People stop asking, “Can we afford one more verification step?”

And start asking the better question:

“Does one more verification step make this workflow safer and easier to debug?”

That is a much healthier way to build.

OpenClaw setup for local or agent-based workflows

If you’re using OpenClaw as the orchestration layer, splitting responsibilities between agents is straightforward.

Basic setup:

npm install -g openclaw@latest
openclaw onboard --install-daemon
openclaw dashboard
Enter fullscreen mode Exit fullscreen mode

Health checks:

openclaw status
openclaw status --all
openclaw health --json
Enter fullscreen mode Exit fullscreen mode

A nice pattern here is:

  • Agent 1: retrieve relevant notes/chunks
  • Agent 2: extract evidence only
  • Agent 3: generate recommendations from approved evidence
  • Agent 4: escalate if evidence is incomplete

That beats one overloaded agent trying to do everything in one breath.

Retrieval vs stuffing

For tiny corpora, stuffing can be fine.

For growing corpora, retrieval usually wins.

Strategy Tradeoff
Context stuffing Simpler for small corpora. But as prompts grow, relevant facts can get buried in the middle and become harder for the model to use correctly.
Retrieval + reranking More moving parts, but it scales better and is stronger when the right evidence would otherwise be lost inside long context.

If your recommendation step starts from the wrong chunks, it does not matter whether you picked GPT-5.4, Claude Opus 4.6, Grok 4.20, Qwen, or Llama.

The output will still drift because the evidence was wrong or incomplete upstream.

That’s why model shopping is often a distraction.

If the architecture is wrong, a better model just gives you more fluent mistakes.

The default workflow I’d recommend

If you’re building anything that turns notes into decisions, this is the default I’d start with:

  1. Retrieve the smallest useful evidence set
  2. Extract verbatim quotes and structured facts first
  3. Attach source references to every extracted item
  4. Generate recommendations only from extracted evidence
  5. Allow abstention when evidence is weak or missing
  6. Add human review where the cost of error is high

It’s not glamorous.

It won’t win any prompt engineering beauty contests.

But it survives contact with reality.

Final take

When a model hallucinates on case notes, my first question is no longer:

  • Should we switch models?
  • Should we increase the context window?
  • Should we write a smarter prompt?

My first question is:

Why did we ask one model call to do three jobs while hiding the evidence in the middle of a giant prompt?

If you fix that, a lot of the "hallucination" problem gets much less mysterious.

And if you’re running these workflows at scale, this is also where flat-rate infrastructure becomes practical, not just cheaper. The more your architecture improves, the more multi-step validation you’ll want. Standard Compute is built for exactly that kind of agent workload.

If your stack already uses an OpenAI-compatible client, it’s a very small implementation change to test this pattern.

That’s probably the most useful part of this whole idea:

The fix is boring.

Which means you can ship it this week.

Top comments (0)