DEV Community

Nova
Nova

Posted on

The Context Budget Pattern: Keep LLMs Fast Without Losing the Plot

When people say “LLMs are slow” or “the model is getting dumb halfway through the task”, the culprit is often neither latency nor model quality.

It’s context bloat.

Every extra screen of logs, pasted files, and backscroll is something the model has to ingest before it can do useful work. Long prompts increase cost, increase response time, and—more subtly—reduce reliability because important constraints get diluted.

This post is a practical workflow I use to keep multi-step AI work sharp: the Context Budget Pattern.

The idea: treat context like a finite budget

Think of an LLM session like a small backpack on a day hike:

  • You can keep stuffing things into it.
  • You will eventually regret it.

A “context budget” is simply an explicit limit you set for how much you will carry forward at each step.

Not a token-perfect number. A rule of thumb that you can follow without tools.

My default budget rules

  • Working context: 1–2 pages of the current plan + current artifact.
  • Reference context: links/paths to source material, not the full material.
  • History: summarized checkpoints, not full chat transcripts.

If you do nothing else: stop pasting whole files and start pasting excerpts + pointers.

Pattern overview (in 5 moves)

  1. State the task + exit criteria (one paragraph).
  2. Select the minimum working set (what must be “in RAM”).
  3. Externalize everything else (files, notes, links, logs).
  4. Checkpoint summaries every time the direction changes.
  5. Restart the chat when the budget is exceeded (carry only the checkpoint + working set).

Let’s make this concrete.

Example: “Add rate limiting to an API endpoint”

Imagine you’re adding rate limiting to a Node/Express endpoint. A typical failure mode looks like this:

  • You paste 600 lines of server code.
  • You paste 200 lines of middleware.
  • You paste error logs.
  • You ask for an implementation.

The model responds with a rewrite that doesn’t fit your project, or it forgets you’re using Redis, or it misses the exact endpoint.

Here’s the Context Budget way.

Step 1) Task + exit criteria (tiny, explicit)

Paste something like:

Task: add IP-based rate limiting to POST /api/login.

Exit criteria:

  • 10 attempts / 10 minutes per IP
  • returns HTTP 429 with Retry-After
  • uses existing Redis client
  • unit test added

That’s the north star.

Step 2) Minimum working set

Now decide what the assistant absolutely must see.

Good working set:

  • The specific route handler (20–60 lines)
  • Your Redis client initializer (10–30 lines)
  • Your testing framework setup (10–30 lines)

Bad working set:

  • Your entire server/ directory
  • Full .env
  • Logs from unrelated endpoints

Step 3) Externalize the rest

Instead of pasting everything, give pointers:

  • “Redis client lives in src/infra/redis.ts (already connected, exported as redis).”
  • “Auth route is in src/routes/auth.ts.”
  • “We use Jest + supertest.”

This seems small, but it changes the model’s behavior: it stops trying to be a code search engine and starts behaving like a collaborator.

Step 4) Ask for a patch, not a rewrite

A tiny prompt tweak keeps output scoped:

Produce a patch-style change: list files to edit, then show only the changed blocks.
Don’t rewrite unrelated code.

You can go even more strict:

If you need additional context, ask a single question before proposing changes.

That one sentence prevents the “confident wrong rewrite” pattern.

Step 5) Checkpoint summary (the secret weapon)

Every time you reach a stable milestone, write a checkpoint that can survive a restart.

Example checkpoint:

CHECKPOINT 1 (rate limiting):
- Goal: IP-based limiter for POST /api/login
- Redis client: src/infra/redis.ts exports `redis`
- Approach chosen: token bucket in Redis using INCR + EXPIRE
- Response: 429 + Retry-After
- Tests: Jest + supertest, add tests in src/routes/__tests__/auth.test.ts
Enter fullscreen mode Exit fullscreen mode

Now you can:

  • continue the same chat, or
  • start a fresh one and paste only the checkpoint + the next snippet.

How to know your context budget is blown

I use three signals:

  1. The model starts re-asking settled questions (it “forgets” decisions).
  2. Output becomes generic (“Here’s a typical Express rate limiter…”) even though you provided specifics.
  3. You spend more time correcting than steering.

When that happens, don’t fight it. Restart with a clean packet.

The “Clean Packet” template (copy/paste)

This is the exact structure I use to restart a bloated chat:

You are helping me implement a change in an existing codebase.

TASK
<one paragraph>

EXIT CRITERIA
- <bullet>
- <bullet>

CURRENT STATE (checkpoint)
<5–10 bullets, no prose>

WORKING SET
- File: <path>
- Relevant excerpt:
<snippet>

CONSTRAINTS
- Keep changes minimal
- Output as patch: files + changed blocks
- Ask 1 clarifying question if needed
Enter fullscreen mode Exit fullscreen mode

It’s short. It’s stable. It’s portable.

Why this works (mechanically)

Two reasons:

  1. Attention is not infinite. Even when a model can technically “fit” more tokens, the effective focus on any single constraint degrades.
  2. Ambiguity scales with volume. The more you paste, the more chances there are for conflicting instructions, outdated logs, or irrelevant code paths.

A context budget forces you to curate.

A small upgrade: split “working memory” vs “archive”

If you want to go one level more structured, maintain two artifacts:

  • WORKING.md — the current plan, current decisions, and the next steps.
  • ARCHIVE.md — old explorations, rejected approaches, and raw logs.

Only paste WORKING.md (or a short excerpt). Keep ARCHIVE.md out of the prompt unless you’re explicitly revisiting it.

Common mistakes (and how to avoid them)

  • Mistake: “I’ll just paste more so it has everything.”

    • Fix: paste less, but include paths + roles (where things live and what they do).
  • Mistake: checkpoints that read like novels.

    • Fix: bullets only. If it can’t fit on one screen, it’s too big.
  • Mistake: restarting without carrying decisions.

    • Fix: always paste the latest checkpoint.

Closing

If you want better LLM output, don’t only tune the prompt.

Tune the payload.

A simple context budget—plus checkpoints you can carry between sessions—turns long, fuzzy conversations into short, decisive iterations.

If you try this pattern, the first thing you’ll notice is speed.
The second thing you’ll notice is something better: fewer “almost right” answers.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.