A "leaky" prompt is one that burns tokens without contributing to the output. The model gets bigger inputs, the bill goes up, and the quality gets worse because the signal drowns in noise. I audited my own prompt library last week and found 5 categories of leaks. Here they are, with the fixes.
Sign 1: You're pasting the whole file when the model only needs 20 lines
Symptom: Your prompts look like "Here's my project: [3000 lines of code]. Now fix the bug in the login function."
Why it leaks: The model has to scan the entire context to find the relevant section. For every leak like this, you pay ingestion cost twice — once to find the relevant code, once to actually reason about it. And the larger the haystack, the more likely the model pulls irrelevant context into its answer.
The fix: Extract the minimal slice. For a bug in login(), you need:
- The
login()function itself - Its direct callers (one or two at most)
- The types/interfaces it uses
- NOT the unrelated 2,000 lines
A 30-second grep and sed gets you there. Most IDEs have "copy symbol with references" built in.
Sign 2: You keep re-explaining the same context in every prompt
Symptom: Every prompt in your session starts with "I'm building a Flask app that uses Postgres and handles user auth via JWT..."
Why it leaks: You're paying ingestion cost for the same context on every single turn. In a 20-turn session, that's 20x the tokens for no benefit.
The fix: Seed files. Put the project context in a single file (CONTEXT.md) and reference it once at the top of the session. Then let the model remember via its own context window, or re-inject only on genuine switches.
I'm working on the project described in CONTEXT.md (attached).
For this session, focus on: <specific task>.
Done. Context explained once, referenced by name after that.
Sign 3: Your examples are longer than your instructions
Symptom: You give the model three full-length input/output pairs as "few-shot examples" to teach it a format, and the examples eat 80% of your prompt budget.
Why it leaks: Examples are necessary, but they compound fast. Three 200-word examples = 600 words just to set up a task that should be 100 words.
The fix: Use minimal-pair examples. Instead of full realistic examples, give tiny toy examples that only demonstrate the format, then describe the real inputs separately.
Before (leaky):
Example 1:
Input: [300 words of realistic text]
Output: [250 words of formatted output]
Example 2:
Input: [300 more words]
Output: [250 more words]
Now process this: [real input]
After (sealed):
Format every response as:
{"summary": string, "tags": string[], "urgency": "low"|"med"|"high"}
Example:
Input: "Ship broken"
Output: {"summary": "Ship broken", "tags": ["bug"], "urgency": "high"}
Now process: [real input]
Same signal, a tenth of the tokens.
Sign 4: You're including chat history the model doesn't need
Symptom: Your conversation is 15 turns deep and you're still sending all 15 turns to the model for every new message.
Why it leaks: The model re-processes the entire history on every turn. Turn 15 pays for context from turns 1-14 every single time, even if those turns were tool errors, clarifying questions, or false starts that are no longer relevant.
The fix: Compact the history. At turn 5, 10, or 15, ask the model to summarize what's been decided and what state we're in. Then start a new thread with just the summary. (See also: The Handoff Prompt — same idea, different trigger.)
Most chat UIs won't do this automatically. You have to build the habit: "Summarize our progress so far, then I'll start fresh."
Sign 5: You're using verbose natural language where a schema would do
Symptom: Prompts like "Return the answer as a JSON object with a field called 'summary' which should be a string containing a brief description, and then a field called 'tags' which should be an array of strings, and also include a field called 'urgency' which can be one of 'low', 'medium', or 'high'..."
Why it leaks: You're using 80 tokens to describe what a 20-token schema fragment could say unambiguously.
The fix: Write the schema directly:
Output (strict JSON, no prose):
{"summary": string, "tags": string[], "urgency": "low"|"medium"|"high"}
Fewer tokens, less ambiguity, better compliance from the model. Modern LLMs handle type-like notation better than long English descriptions for format instructions.
Running the audit on your own prompts
Pick your 10 most-used prompts. For each one, ask:
- Am I pasting more context than the task needs?
- Am I re-explaining things the model already knew?
- Are my examples longer than my instructions?
- Is my chat history a graveyard?
- Am I describing structure in prose?
A single "yes" is a leak. Two or more, and you're probably spending 2-3x what you should on tokens without improving quality.
I cut my token usage by about 40% after doing this audit on my own stuff. The unexpected bonus: the quality of the outputs also went up, because the signal-to-noise ratio improved in every prompt.
Question for you: Which of these five leaks is your worst offender? Mine is #4 — I let chat histories run way too long before compacting. Curious where everyone else's leaks live.
Top comments (1)
This is a great practical guide for anyone working with LLMs, especially those looking to optimize both their costs and the quality of their model's output. The concept of "leaky" prompts is a very intuitive way to think about token management.
The 5 key signs are:
CONTEXT.md).I found the "minimal-pair examples" advice particularly useful—it’s a great way to reduce the prompt size without losing the structural guidance. The advice to "compact the history" is also a great tip for maintaining focus in long-running sessions.
Which of these do you think is your most frequent "leak"? I'm definitely guilty of letting histories run too long sometimes!