Łukasz Blania

Posted on May 17

Per-Section Briefs: How to Stop AI Agents Losing the Plot at 2000 Words

#ai #indiehackers #productivity #webdev

In March 2025 I wired up my first long-form content agent. One prompt. A 2000 word target. A list of seven H2s pasted in at the top. The first 600 words read fine. By word 1200 the model had quietly forgotten the original thesis. By word 1800 it was paraphrasing the introduction back at me, just with worse vocabulary.

That article never shipped.

TLDR

Instead of asking an LLM to write a 2000 word article in one shot, give each H2 section its own brief, generate the sections in parallel, then stitch the result together. The model only has to hold one section at a time. You can regenerate any failing section without burning the entire piece.

I have shipped over 50,000 long-form articles through this pattern in production for Articfly. Every time I have tried to skip a step and collapse back to a single prompt, output quality dropped. Here is the full pattern and the failure modes it actually solves.

Why single-prompt long-form fails

What goes wrong when you push past 1000 to 1500 words in a single completion:

The model loses recency. Anything in the first 30 percent of the system prompt drifts out of working attention.
Internal repetition explodes. Without a section-level scope, the model rehashes earlier points to fill the word budget.
Transitions degrade. The model strings paragraphs together with generic connectors because it has no scoped goal for the current section.
Errors compound. If section three is weak, section four often inherits the same weakness because the model treats its own earlier text as ground truth.

This is not a model-size problem. I tested the same single-prompt setup with GPT-4o, Claude Sonnet, and Gemini 1.5 Pro. All three drifted past the 1200 word mark. The mechanism is structural, not capability based. You can throw more parameters at it and the failure mode moves from word 1200 to word 1500. It does not go away.

The pattern

Three moving parts:

An outline pass that produces a list of H2 sections, each with a target claim, supporting evidence, and a one-line transition
A section pass that writes one H2 at a time using only the brief and the article-level voice profile
A stitch pass that joins the sections, fixes paragraph-level transitions, and runs a final consistency check

Three calls per article, not one. Sometimes four if you add a fact-check loop. The token cost is only about 50 percent higher than the single-prompt approach because each call is shorter and tighter. You pay for the structure, not for raw token volume.

Step 1: The outline

The outline is the most important call in the pipeline. Get this right and the rest is mechanical.

Outline prompt template:

Topic: <topic>
Primary keyword: <keyword>
Target length: 2000 words
Voice profile: <attached>

Produce an outline with 6 to 8 H2 sections.
For each section provide:
- title (under 60 chars)
- claim (one sentence, what this section argues)
- evidence (2 to 3 bullets, what supports the claim)
- transition (one sentence, how this section leads into the next)
- target_words (integer, must sum to 1800 to 2100)

Output strict JSON.

The key constraints are the claim per section and the target_words sum. Without a per-section claim, the model invents one mid-write. Without a word budget, sections drift to 400 words each and the article overshoots to 3200.

Use a strict JSON output and validate it with Zod or Pydantic. Reject and retry if the word sum is off by more than 10 percent, or if any section is missing a claim. I do not retry the outline silently. If it fails validation twice in a row, the pipeline hard-fails and I see it in logs. A bad outline poisons everything downstream.

Step 2: Write the sections

For each H2 in the outline, fire a section call. These can run in parallel because they do not depend on each other.

js
async function writeSection(brief, voiceProfile) {
  const prompt = buildSectionPrompt(brief, voiceProfile);
  return llm.complete({
    model: "claude-sonnet-4-6",
    system: prompt,
    max_tokens: 1500,
    temperature: 0.7,
  });
}

const sections = await Promise.all(
  outline.sections.map(s => writeSection(s, voiceProfile))
);

The section prompt:

You are writing H2 section "<title>" of a longer article on <topic>.
The full article will be 2000 words. This section targets <target_words> words.

Section claim: <claim>
Supporting evidence: <evidence>
Transition out: <transition>

Voice profile:
<voice>

Constraints:
- Do not restate the article introduction
- Do not preview future sections
- Open with the claim or a concrete example, not a generic setup line
- End with the transition or a sentence that sets it up
- Use first person if the voice profile uses first person
- No headers inside the section (the H2 is handled at stitch time)

Write the section now.

The "do not restate" and "do not preview" lines are load-bearing. Without them, every section starts with "In this article we will explore" and ends with "Next we will look at". The model wants to be helpful in a way that breaks long-form structure. You have to say no.

Step 3: Stitch

The stitch pass is short. It takes all sections, joins them, and runs one more LLM call to fix paragraph-level transitions and remove duplicate phrases.

js
async function stitch(sections, outline) {
  const joined = sections.map((s, i) =>
    `## ${outline.sections[i].title}\n\n${s}`
  ).join("\n\n");

  return llm.complete({
    model: "claude-sonnet-4-6",
    system: `Fix paragraph transitions between sections.
      Do not rewrite. Do not change facts.
      Remove duplicate phrases that appear across sections.
      Replace generic connector words with concrete prose.`,
    user: joined,
    max_tokens: 4000,
    temperature: 0.3,
  });
}

Low temperature here on purpose. The stitch call is an editor, not a writer. If you let it ride at 0.7 it will start adding new content and the article inflates past target length.

The voice profile

The voice profile is what stops per-section calls from sounding generic. I generate it once per customer by scraping 10 to 20 of their published articles and extracting patterns:

Average sentence length
Vocabulary level (Flesch reading ease)
First person versus third person
Common rhetorical moves (anecdote opener, contrarian claim, structured list, war story)
Banned words specific to the brand

The voice profile gets injected into every section call. Without it, each section sounds like a different writer wrote it. The stitch pass cannot fix that drift, only prevent it upstream. You cannot edit voice into existence at the end.

Voice extraction warrants its own article. I will not cover the extraction prompt here, only flag that it is required for any pipeline that wants outputs to read like one writer.

Tips that matter in production

Cache the outline. If the user regenerates one section, do not redo the outline. Hash the topic plus keyword plus voice profile and reuse the cached outline.
Run section generations in parallel. Latency drops from 90 seconds (sequential) to under 30 (parallel).
Validate section word counts. If a section comes back at 50 percent of its target, regenerate only that section.
Keep the stitch model the same as the section model. Mixing models at stitch time changes the voice subtly and readers notice.
Log each section call separately. When a customer complains about an article, you can trace exactly which section drifted and why.
Token budget the section call hard. If you leave max_tokens unbounded, the model fills the context with one runaway section.
Strip generic closers like "in summary" with a regex after stitch. The model still inserts them about 15 percent of the time even when told not to.

When NOT to use this pattern

Per-section briefs add real complexity. Three LLM calls instead of one. Outline validation logic. Section retry logic. Stitch pass. Section storage so you can recover from partial failure. For some content types this is wasted engineering.

Skip this pattern when:

You are generating under 800 words. Single-prompt handles short-form fine.
The content is templated (job posts, product descriptions, FAQ entries). A template plus variable injection beats both single-prompt and per-section.
You need sub-five-second latency. Three sequential calls plus stitch will not hit that, even with parallelization in the middle step.
A human editor reviews every output anyway. The marginal quality gain does not survive a full editor pass.
The output is low-quality-bar SEO filler. Per-section briefs do not save bad input topics.

I run a separate pipeline for short-form (under 600 words) that is one tight call with a tight brief. The cost of the agentic pattern is not free. Treating every output the same is its own kind of one-size-fits-all problem.

Stack notes

For anyone wiring this up from scratch:

Outline pass: any frontier model works. I use Claude Sonnet because the structured JSON output is reliable across temperature settings
Section pass: parallel calls, Claude Sonnet or GPT-4o, temperature 0.6 to 0.8
Stitch pass: same model as the section pass, temperature 0.2 to 0.3
Validation: Zod (TypeScript) or Pydantic (Python) on the outline JSON
Retry policy: 2 attempts per section with exponential backoff. Fail-fast on the outline pass with no retry, just hard fail
Storage: I write each section to Postgres as it completes. If the stitch step fails, I can resume without re-running section calls

Total cost per 2000 word article runs about $0.04 to $0.08 depending on model. The single-prompt version cost $0.03 to $0.05. The per-section pattern costs roughly 50 percent more in tokens. The quality difference shipped 50,000 articles. I will eat the 50 percent.

Conclusion

The article that drifted back in March 2025 did eventually ship. After I rewrote the pipeline around per-section briefs, I ran the same topic through. Same target length. Same H2 list. Same voice profile. The result read like one writer wrote it from start to finish.

The single-prompt version is still in my git history. About once a quarter I dig it up to remind myself why three calls beat one.

What pattern do you use to keep long-form AI output coherent past the 1000 word mark?

DEV Community