Most structured-extraction tutorials look the same. Take a document, write one big prompt that says "extract A, B, C, D, E, F", get JSON back. Done.
This works on short inputs.
It quietly breaks on long ones.
After running this in production for a while, I stopped doing it. Here's what I switched to and why.
The fat prompt problem
Say you have a 50-page report and you want a structured summary out of it. The natural first move is something like:
Extract:
- title
- sections (with headings)
- purpose
- mentioned services
- acceptance criteria
- ...
Return JSON in this shape: { ... }
You hand the whole document to the model. It returns JSON. It looks fine on the first try.
Then you scale it up and three things happen:
- Quality drifts. The model "forgets" mid-document. Later sections are summarized worse than earlier ones, or fields go missing.
- One bad field poisons the whole call. If "acceptance criteria" hallucinates, you don't just lose that field — the whole record gets quarantined for review.
- Latency goes up, parallelism goes down. A single 30k-token call takes what it takes. You can't shard it.
You can fight this with longer prompts, more examples, stricter formatting rules. I did. It buys you maybe 10% more reliability and costs you a lot of prompt-engineering time.
The structural problem doesn't go away.
What I do now: split it
The pattern I use looks like an accordion that expands:
[ document ]
│
▼
[ Stage 1: segment ] ← one prompt, one job: produce a list
│
▼
[ array of segments ]
│
▼ (fan out)
[ Stage 2: extract ] ← one prompt, runs per segment
│
▼
[ structured records ]
Stage 1 reads the whole document and returns a clean array of segments — sections, paragraphs, line items, whatever the right unit is for the task.
Stage 2 takes one segment at a time and extracts the structured fields you actually want.
Two prompts, each doing one thing.
Why this works better
Each prompt has a single job.
Stage 1 is "find the boundaries". Stage 2 is "extract the schema". Neither prompt has to hold both ideas at once. You can write each one tightly. Examples are shorter and more on-point.
Errors localize.
If Stage 2 fails on segment 7, you re-run segment 7. You don't redo the whole document. Bad fields get isolated to one record instead of contaminating the whole batch.
Stage 2 parallelizes naturally.
The output of Stage 1 is an array. Fan it out. Run 50 small extractions in parallel instead of one big one. Total wall-clock time drops, and so does the variance.
Cache hits go up.
If the same segment shows up twice (templates, standard headers, repeated forms), Stage 2 sees the same input and you can cache. The fat-prompt version sees the entire document as one unique input every time.
Long documents stop being scary.
The hard limit on a fat prompt is the model's context window. The accordion pattern doesn't have that ceiling. Stage 1 still has to read the whole document, but its output is small. Stage 2 only ever sees one segment.
What it costs
It's not free.
You're making more LLM calls — one for Stage 1 plus N for Stage 2 instead of one. On short inputs that's wasteful. The accordion pattern is for documents long enough that fat prompts start failing, not for two-paragraph emails.
You also need to think a little harder about what a "segment" is for your task. Sometimes it's a section heading. Sometimes it's a row in a table. Sometimes it's a logical unit that doesn't map to any visible boundary. That's a design decision and it matters.
When to use it
Reach for the accordion when:
- The document is long enough that you've seen the model lose the thread mid-way.
- The output schema has more than ~5 fields and they don't all care about the same context.
- You need to retry failed records without redoing successful ones.
- You want parallelism.
Stick with one fat prompt when:
- The input is short and the schema is small.
- The fields are tightly coupled (extracting one needs context from another).
- You're prototyping and don't care yet.
A small concrete example
I run this on a service called StructFlow. The shape of the calls is roughly:
# Stage 1: segment
curl -X POST https://gw.ldxhub.io/structflow/jobs \
-H "Authorization: Bearer $KEY" \
-d '{
"model": "google/gemini-3-flash-preview",
"system_prompt": "Split this document into logical sections. Return one JSON record per section.",
"example_output": { "section_title": "...", "section_text": "..." },
"inputs": [{ "id": "doc1", "data": { "text": "..." } }]
}'
The response gives you back an array. Then Stage 2:
# Stage 2: extract (one call per segment, run in parallel)
curl -X POST https://gw.ldxhub.io/structflow/jobs \
-H "Authorization: Bearer $KEY" \
-d '{
"model": "google/gemini-3-flash-preview",
"system_prompt": "From this section, extract: purpose, mentioned services, acceptance criteria.",
"example_output": { "purpose": "...", "mentioned_services": [], "acceptance_criteria": [] },
"inputs": [{ "id": "sec1", "data": { "section_text": "..." } }]
}'
Two calls, each focused. One returns segments. The other turns each segment into structured fields.
That's the whole pattern.
Why I'm posting this
I built LDX hub partly to make this pattern easy to run — one API, async jobs, file-based input/output so Stage 1's output is directly usable as Stage 2's input. But the pattern itself doesn't depend on any specific tool. You can do it with raw OpenAI calls, Anthropic calls, anything that takes a prompt and returns text.
The takeaway isn't "use my API". It's: if your structured extraction is getting flaky on long inputs, the answer probably isn't a longer prompt. It's two prompts.
If you've tried something similar — or if you've got a case where this falls apart — I'd be curious to hear it.
Top comments (0)