Paul SANTUS for AWS Community Builders

Posted on May 29

LLMs suck at generating large, structured data. Tips on how to get your AI agent to do it reliably

#ai #agents #llm

Accumulating state outside the token window

LLMs are great at generating text. They're terrible at generating structured data reliably. If you've ever tried to get an agent to produce a JSON object with a specific schema, you know the pain: missing fields, hallucinated keys, inconsistent types, and outputs that break your downstream pipeline.

As I got past toy examples and labs to work on real, production-grade AI apps, I faced the problem and found an approach that works remarkably well for an AI app I'm building: use tools like object-oriented programming Builder pattern. Instead of asking the model to produce a final JSON blob, you give it tools that incrementally build the output - like calling methods on an object. The model never sees or produces the final structure directly. It just calls functions, and the structured output emerges as a side effect.

This matters especially when your agent processes large documents (like insurance forms, legal filings, medical records) that eat up most of the context window. When the input is big and the task is multi-step, you can't afford to also reserve space for a massive structured output at the end. The accumulator pattern lets you compress the conversation mid-flight without losing any of the structured data you've already collected, because that data lives outside the token window entirely.

Challenges

The "generate JSON" problem

The naive approach - asking a model to output a complete JSON structure - fails in predictable ways:

Schema drift. The model forgets required fields, invents new ones, or changes types between runs. A date field might be a string one time and an object the next.
All-or-nothing failure. If the model makes one mistake in a 200-line JSON output, the entire thing is unparseable. You either retry the whole generation or write brittle fixup code.
No incremental progress. If the model hits a context limit or stops mid-generation, you lose everything. There's no partial result to recover from.
Hallucination in structure. Models are more likely to hallucinate when producing structured output in one shot. They fill in fields they're uncertain about rather than leaving them empty, because the structure demands completeness.
Coupling research and output. When an agent needs to gather information and produce structured output, asking it to do both in one pass means it can't iterate. It commits to a structure before it has all the facts.

Why `response_format` and function-calling schemas aren't enough

Structured output modes (like OpenAI's response_format: json_schema or Bedrock's tool result schemas) help with syntax - you'll get valid JSON. But they don't solve the semantic problem. The model still has to produce the entire structure in one shot, and it still hallucinates content to fill required fields.

A wide-spread issue

Any team building autonomous or semi-autonomous agents face this, not just me. Kiro CLI, AWS' agentic dev companion, for instance, struggled hard with large data structures when first launched.

Since then, its maintainers have equipped its harness with JSON capabilities (jq manipulations, for instance) and multiples strategies (extensive use of grep, glob, tail..) to avoid filling the context window.

Still, happy to know I'm not alone in facing this :)

My solutions

Here are a few tricks I have used successfully to control both agent output and context window. As I don't claim to have all the recipes, don't hesitate to comment your own or tag my in your own posts :)

Tools as Builder methods

The core idea: define tools that act like OOP builder methods. Each tool call adds one well-typed element to an accumulator. The model's job shifts from "produce this structure" to "call these functions in the right order."

Here's the pattern - imagine an agent that processes insurance claims by reading documents and building a structured claim assessment:

from strands import tool

# The accumulator - this is your structured output
claim_output = {
    "parties": [],
    "events": [],
    "damages": [],
    "evidence": [],
    "assessment": None,
}

def reset_output():
    claim_output["assessment"] = None
    for k in ["parties", "events", "damages", "evidence"]:
        claim_output[k] = []


@tool
def add_party(name: str, role: str, policy_id: str = "") -> str:
    """Register a party involved in the claim.

    Args:
        name: Full name of the person or organization.
        role: One of: claimant, insured, witness, adjuster, third_party
        policy_id: Policy number if applicable.

    Returns:
        Confirmation with party details.
    """
    if role not in ("claimant", "insured", "witness", "adjuster", "third_party"):
        return f"Error: invalid role '{role}'. Must be one of: claimant, insured, witness, adjuster, third_party"

    claim_output["parties"].append({
        "name": name,
        "role": role,
        "policy_id": policy_id,
    })
    return f"Added {role}: {name}"


@tool
def add_event(description: str, date: str, location: str = "") -> str:
    """Record a chronological event relevant to the claim.

    Args:
        description: What happened (1-3 sentences).
        date: ISO date string (YYYY-MM-DD).
        location: Where it happened (optional).
    """
    claim_output["events"].append({
        "description": description,
        "date": date,
        "location": location,
    })
    return f"Recorded event on {date} ({len(claim_output['events'])} events total)"


@tool
def add_damage(item: str, amount: float, category: str, evidence_ref: str = "") -> str:
    """Register a damage item with estimated cost.

    Args:
        item: Description of the damaged item or cost.
        amount: Estimated cost in dollars.
        category: One of: property, medical, liability, lost_income
        evidence_ref: Reference to supporting evidence (optional).
    """
    if category not in ("property", "medical", "liability", "lost_income"):
        return f"Error: invalid category '{category}'."

    claim_output["damages"].append({
        "item": item,
        "amount": amount,
        "category": category,
        "evidence_ref": evidence_ref,
    })
    total = sum(d["amount"] for d in claim_output["damages"])
    return f"Added damage: {item} (${amount:.2f}). Running total: ${total:.2f}"

The agent is given these tools and a system prompt that tells it to process a claim. As it reads documents and discovers information, it calls add_party, add_event, and add_damage. The structured output builds up incrementally.

Validation at the boundary

Each tool call is a validation checkpoint. You can reject bad input immediately:

@tool
def add_damage(item: str, amount: float, category: str, evidence_ref: str = "") -> str:
    if category not in ("property", "medical", "liability", "lost_income"):
        return f"Error: invalid category '{category}'."
    if amount <= 0:
        return f"Error: amount must be positive, got {amount}."
    if evidence_ref and evidence_ref not in [e["id"] for e in claim_output["evidence"]]:
        return f"Error: evidence '{evidence_ref}' not registered. Call add_evidence first."
    # ...

The model gets instant feedback. If it tries to reference evidence it hasn't registered yet, the tool tells it. The model self-corrects on the next turn. Compare this to validating a 500-line JSON blob after the fact - by then, the model has moved on and can't fix its mistakes in context.

Separating research from output construction

A key benefit: the same agent can have reading tools and writing tools. Reading tools fetch and explore data. Writing tools construct the output. The model interleaves them naturally:

agent = Agent(
    model=model,
    system_prompt=prompt,
    tools=[
        # Reading tools
        read_document,
        search_policy,
        get_weather_report,
        # Writing tools (builder methods)
        add_party,
        add_event,
        add_damage,
        add_evidence,
        set_assessment,
        # Progress tracking
        mark_step_done,
    ],
)

# One call - the agent reads documents AND builds structured output
agent("Process this claim: " + claim_text)

# Output is ready
print(claim_output)

The model reads a police report, extracts a party, reads a medical bill, registers a damage item, cross-references the policy, and so on. Research and output construction are interleaved rather than sequential.

Progress tracking and recovery

Because output accumulates incrementally, you get crash recovery for free:

STEPS = [
    "1. Identify all parties",
    "2. Establish timeline of events",
    "3. Catalog damages with evidence",
    "4. Cross-reference policy coverage",
    "5. Produce assessment",
]
completed_steps: list[int] = []

@tool
def mark_step_done(step_number: int) -> str:
    """Mark a processing step as completed."""
    completed_steps.append(step_number)
    remaining = [s for i, s in enumerate(STEPS, 1) if i not in completed_steps]
    return f"Step {step_number} done. Remaining: {', '.join(remaining)}"

If the agent hits a context window limit or errors out, you already have partial results - every party identified, every event recorded, every damage item cataloged up to that point. You can resume or use what you have.

Context management with state injection

Here's where this pattern really pays off. When your agent ingests a 30-page document and then makes dozens of tool calls to fetch additional sources, the context window fills up fast. In a traditional approach, you'd lose your structured output along with the conversation when you hit the limit. But because the accumulator lives in Python memory - not in the message history - you can aggressively compress the conversation without losing a single data point.

A custom conversation manager (a possibility offered, for instance, by the Strands Agents SDK) replaces old messages with a compact state summary derived from the accumulator:

class ClaimConversationManager(ConversationManager):
    def apply_management(self, agent, **kwargs):
        messages = agent.messages
        if len(messages) <= 2:
            return

        # Keep first message + last 2 messages
        # Replace everything in between with a state summary
        first_msg = messages[0]
        recent = messages[-2:]

        state = self._build_state_summary()
        state_msg = {
            "role": "user",
            "content": [{"text": f"[STATE]\n{state}\n\nContinue."}],
        }
        messages[:] = [first_msg, state_msg] + recent

    def _build_state_summary(self) -> str:
        """Summarize what's been done using the accumulator state."""
        lines = []
        if claim_output["parties"]:
            parties = [f"{p['name']} ({p['role']})" for p in claim_output["parties"]]
            lines.append(f"Parties: {', '.join(parties)}")
        if claim_output["damages"]:
            total = sum(d["amount"] for d in claim_output["damages"])
            lines.append(f"Damages: {len(claim_output['damages'])} items, ${total:.2f} total")
        if claim_output["events"]:
            lines.append(f"Events: {len(claim_output['events'])} recorded")
        return "\n".join(lines)

Because the structured output lives in Python (not in the conversation), context compression doesn't lose any data. The model can always see what it's already produced by reading the state summary.

Benefits

Type safety without type coercion

Each tool has typed parameters enforced by the framework. The model must provide a category that's one of property, medical, liability, lost_income - not because you're parsing JSON and checking after the fact, but because the tool signature demands it. Invalid calls get rejected with clear error messages.

Composability

Tools compose naturally. You can add new output fields by adding new tools without changing existing ones. Want to track evidence attachments? Add an add_evidence tool. Want a final recommendation? Add a set_assessment tool. The model discovers new capabilities through its tool list.

Testability

Each tool is a pure function (or close to it). You can unit test them independently:

def test_add_damage_rejects_invalid_category():
    reset_output()
    result = add_damage(item="Roof repair", amount=5000, category="cosmetic")
    assert "Error" in result
    assert len(claim_output["damages"]) == 0

def test_add_damage_tracks_total():
    reset_output()
    add_damage(item="Roof repair", amount=5000, category="property")
    add_damage(item="Water damage", amount=2000, category="property")
    assert len(claim_output["damages"]) == 2
    assert sum(d["amount"] for d in claim_output["damages"]) == 7000

Deterministic output schema

The output schema is defined by your Python code, not by the model's interpretation of a prompt. claim_output always has the same keys with the same types. Downstream consumers can rely on the structure unconditionally.

Graceful degradation

If the model runs out of context or hits an error, you have everything it produced up to that point. You can even detect empty output and retry with a nudge:

try:
    agent(claim_text)
except Exception:
    pass

if not claim_output["parties"] and not claim_output["events"]:
    agent("You haven't started processing. Begin by identifying the parties involved.")

Natural agent behavior

The model doesn't have to context-switch between "thinking" and "formatting." It thinks by calling tools. The structured output is a byproduct of the agent doing its job, not an additional formatting burden layered on top.

This pattern - tools as Builder, accumulator as output, validation at the boundary - has been the most reliable way I've found to get structured data out of an agentic workflow. It works because it aligns with how tool-calling models already behave: they reason, they act, they observe results, and they act again. You're just making "act" mean "build one piece of the output."

Top comments (9)

Leo Pessoa • May 30

The insight about keeping the accumulator outside the token window is the one most developers get wrong first — once partial JSON leaks back into context, the model starts describing what the JSON should look like instead of calling the builder tool. The separation of concerns you've outlined here (tools as typed constructors, state as an external accumulator, validation at each boundary) is essentially a hand-rolled version of what schema-first frameworks try to enforce automatically: the schema defines the output contract, and the model's only job is to satisfy typed method calls, never to serialize the final structure directly. I'm building something in this space (exomodel.ai) that applies this pattern at the library level — your article articulated the why better than most framework documentation does.

Paul SANTUS AWS Community Builders • Jun 1

Great! Glad I could help!

Harjot Singh • May 31

The Builder-pattern approach is the right instinct and it generalizes into a principle worth stating: don't ask the model for the whole artifact, ask it for the next valid step. A 200-field JSON blob fails because every token is a chance to drop a field, hallucinate a key, or break a type, and one bad token corrupts the whole output. Incremental tool calls turn one fragile high-stakes generation into many small ones, each individually validatable, and crucially the schema does the enforcing instead of the model's good intentions, an invalid step just can't be taken. That's the same reason structured-output/constrained-decoding helps: you move correctness from "the model remembered the schema" to "the schema is mechanically enforced." The deeper lesson is the recurring one with LLMs, decompose until each unit is verifiable, then verify each unit, rather than trusting a big atomic output. That's core to how I build generation in Moonshift. Did you pair the Builder tools with hard schema validation on each call, or does the model still occasionally call a builder step with bad args and you catch it downstream?

Paul SANTUS AWS Community Builders • Jun 1

Each "builder method" validates its own schema and prompts the LLM to fix their mistake when one occurs. So far it's working quite well 😀

AudioProducer.ai • May 29

This maps onto a different AI-output class very cleanly. I work on AudioProducer.ai (long-form text-to-audio for books); early on we tried to get the LLM to return the whole annotated chapter as one structured JSON blob (per-line speaker map, per-paragraph soundscape annotation, per-line emotion tag), and on a 5,000-word chapter it would silently drop the last third of lines or invent characters that were never registered. We moved to builder-style tools (add_speaker_assignment, add_soundscape, add_emotion_tag) with validation-at-the-boundary identical to your example: character must already be in the registered list, soundscape mood must be in a fixed set, emotion tag must match the line position. The state-injection conversation manager turned out to be the load-bearing move for book-scale work, because chapter ingestion otherwise blows the context window on the LLM side before annotation finishes. The accumulator-outside-the-conversation framing also gives us crash recovery for free on long chapters, same as your claims example. Curious whether your accumulator design supports a model-proposed vs writer-confirmed state per item, so downstream tooling can read the pair as a review queue rather than committing the proposal as final. That distinction has been the most-edited UX surface on our side.

Paul SANTUS AWS Community Builders • Jun 1

Hi,

Thanks for sharing your experience!
Could you please explain a little bit more your question on "model-proposed vs writer-confirmed state per item" so I can try to provide an accurate answer. I'm not sure I get it.

Xidao • Jun 2

The builder-style accumulator is a much better abstraction than asking the model to emit one giant JSON blob. Once the model's job becomes "add one typed fact at a time" instead of "serialize the final world state perfectly," retries become local and you stop paying the all-or-nothing penalty on every malformed field.

The part that made the biggest difference for me in similar systems was adding idempotency and validation at the tool layer, not just type hints. For example, tool calls like add_party/add_event often need dedupe keys, confidence thresholds, and merge rules, otherwise a model that revisits evidence will happily append near-duplicates forever. The accumulator pattern fixes syntax fragility, but you still need a data-quality contract around each mutation.

Another nice side effect is that it makes human review and replay much cleaner. If you persist the sequence of tool calls, you can diff two runs, re-run validation on just the mutated fields, and inspect where the agent's reasoning drifted without re-tokenizing the whole document. That observability piece is usually what turns a clever pattern into something operational.

NRNST Tech Pvt. Ltd. • Jun 3

This is one of those posts I wish I'd found six months ago — would have saved me a lot of frustrated debugging sessions.
I hit the exact same wall with the "generate the whole JSON in one shot" approach. The schema drift issue you described is painfully real — fields randomly changing types between runs is not something you can easily catch until it breaks something downstream. Your framing of it as an "all-or-nothing failure" is spot on, and I think that's the part that doesn't get talked about enough.
The Builder pattern idea genuinely clicked for me. It's such a natural fit for how tool-calling models already work — they're used to calling things one step at a time anyway. Forcing them to also serialize a full structure at the end is honestly just asking for trouble.
What really stood out for me personally was the context management piece. The idea of keeping the accumulator outside the token window entirely and injecting a state summary when context fills up — that's the kind of practical, production-thinking insight that most tutorials just skip over completely.
The separation between reading tools and writing tools is also a pattern I'll be borrowing immediately. Mixing research and output construction in one pass was something I never questioned, and now I can't unsee why it's a problem.
Quick question: when you have nested or dependent structures (e.g., a damage item referencing an evidence item that may or may not exist yet), do you generally let the model retry on its own after the validation error, or do you build in a fallback prompt nudge after N failed attempts? Curious how that plays out in practice on longer runs.
Great writeup — this is the kind of real-world pattern post that the AI dev space badly needs more of.

Paul SANTUS AWS Community Builders • Jun 3

Thanks.
So far model is quite good at fixing if error message is clear.

Nevertheless I still have guardrails on number of turns/tools call to avoid infinite loop

Challenges

The "generate JSON" problem

Why response_format and function-calling schemas aren't enough

A wide-spread issue

My solutions

Tools as Builder methods

Validation at the boundary

Separating research from output construction

Progress tracking and recovery

Context management with state injection

Benefits

Type safety without type coercion

Composability

Testability

Deterministic output schema

Graceful degradation

Natural agent behavior

Why `response_format` and function-calling schemas aren't enough