You Can't Prompt Your Way Out of a Hard Constraint

#buildinginpublic #claudecode #promptengineering #n8n

Thursday morning I removed five nodes from my content pipeline. By lunch I understood something about building with language models that eleven failed edits had been trying to teach me all week: when a rule absolutely has to hold, you don't write the rule into the prompt. You enforce it in code.
This is a field report from the inside of the AI content engine I built in n8n. It's not a hot take about prompt engineering. It's the specific, expensive way I learned where prompts stop working — and what to do instead.

The setup

ConnectEngine OS has a module called ContentFlow. You give it a topic, it grounds itself in real sources, and it writes platform-specific posts: a blog draft, a LinkedIn version, an X version, Facebook, Instagram, plus a matching image prompt. One idea in, six shaped outputs out.
For weeks there had been a verifier stage in the middle — a fact-check node that re-read every claim against the cited sources. It was slow and it was noisy, so on Thursday I split it out and removed it from the generation path. The workflow went from 36 nodes to 31. Cleaner. Faster.
Then I regenerated an idea to smoke-test the change, and every platform output came back wrong.
X was over 5,000 characters against a 270 limit. LinkedIn and Instagram had markdown # headers that those platforms explicitly don't render. Everything read like a blog post regardless of which platform it was for. The image prompt field was stuffed with the article body instead of a visual description. When I checked the backlog, 21 of 45 ideas were affected — 47% of everything in the pipeline.
My first reaction was the wrong one: what did removing the verifier break?

The technical heart

It hadn't broken anything. The verifier removal was innocent. What it did was stop hiding a bug that had been there the whole time.
Here's the part that matters. The node that calls the model assembles a system prompt that's roughly 49KB. That's not a typo. It's the platform's format rules, plus the full grounding context — the primary source (~6KB), three separate search-result bodies (~4KB each), the citation-formatting rules, the founder voice profile, and the per-platform instructions. All concatenated into one instruction block.
Inside that 49KB sits a single line that says, in effect, "X posts must be under 270 characters, no markdown headers." And the model ignores it.
Not maliciously. The grounding context is the overwhelming bulk of those tokens, and it's full of concrete, specific article content. A single formatting sentence floating in that ocean doesn't get the model's attention. The signal is swamped.
The actual root cause was even more direct: an upstream node was writing each idea's raw_idea as an imperative instruction ("write a comprehensive guide to..."), and that instruction was passed verbatim into the user message. The model obeyed the imperative it was handed over the format rules buried in the system prompt. Same story for the image prompt — it was told to write an article, so it wrote an article into the image field.

Eleven edits, and a pattern I couldn't ignore

So I did what most people do. I tried to fix it with better instructions.
| Fix attempt | Mechanism | Outcome |
|---|---|---|
| Topic-reframe in the user message | prompt | Partial — stopped the imperative echo, lengths still wrong |
| End-of-prompt "final reminder" with hard char limits | prompt | Partial — LinkedIn 4550 → 2896, Facebook 1789 → 691, but X and LinkedIn still over |
| "Default to a single tweet, not a thread" rule | prompt | Ignored — still produced a 3-tweet thread |
| "Don't write source stories in the first person" rule | prompt | Ignored — still wrote a borrowed "$257/month" story as mine |
| Re-splitter: break long output into ≤270-char tweets at sentence boundaries | code | Works — X is postable no matter what the model emits |
| Character gate with an X exemption | code | Works |
| Brand-aware image fallback (read brand config, build the prompt from a template) | code | Works — images stay on-brand even when generation misfires |
| Image guard: discard anything with # headers or over 400 chars | code | Works — article bodies never reach the image field |
Read that table top to bottom. Every prompt-level fix was partial or flatly ignored. Every code-level fix worked the first time and kept working.
By the eleventh edit I stopped pretending the next instruction would be the one that stuck. The lesson wasn't "write the rule more forcefully." The lesson was that I'd been using the wrong tool for the job.

The numbers

Metric	Value
Grounding context per generation	~49KB
Prompt-level fix attempts (E1–E11)	11
Prompt fixes that fully held	0
Code-level fixes that held	4
Ideas affected by the unmasked bug	21 of 45 (47%)
Platforms posting correctly after the fix	5 of 6

Zero out of eleven on one side. Four out of four on the other. When the data is that lopsided, it isn't telling you to try harder. It's telling you the category is wrong.

The lesson

Here's the rule I walked away with, and it's now how I build every model-backed feature:
Use the prompt for the generative task. Use code for the hard constraints.
The prompt decides what to write about, the voice, the tone, the angle. That's what language models are extraordinary at, and you should let them cook. But the moment a requirement must hold — a character limit, a banned markdown token, a brand color in an image, a field that must never contain an article body — that requirement does not belong in the prompt. It belongs in a post-processor, a re-splitter, a deterministic truncation at a sentence boundary, a validation gate, a template you interpolate into. Something that runs in code, after the model, and cannot be argued with.
This is the same shape as a lesson I keep relearning across the whole product. When I rewrote 16 plan documents from scratch, the takeaway was "plans rot faster than code because plans have no CI." A prompt instruction is a plan. Code is the CI. If the constraint has no enforcement below the layer that can ignore it, it will eventually be ignored.

The honest gaps

I'm not going to pretend it's all solved. Five of six platforms post correctly now and the images came out genuinely good — idea-specific and on-brand. But:

X still wants to write threads. The prompt rule asking for a single tweet is ignored; the re-splitter makes the thread postable, but it's still a thread. That's a product decision I haven't made yet, not a bug I've fixed.
LinkedIn occasionally trips its own publish gate because the Unicode-bold formatting uses surrogate-pair characters that count as two each in a naive length check. The fix is to count code points, not string length — another deterministic code fix, queued.
The real root lever is the 49KB itself. Trimming the grounding context would reduce the dominance that causes all of this. I'm holding off, because shrinking the source bodies re-opens an older bug where the model invents list endings when it's given too little to work with. That's a genuine tradeoff I haven't resolved, and I'd rather say so than pretend the architecture is finished. There's also a sharper edge here than formatting. One of the ignored rules was "don't write a source's story in the first person." The model kept taking a number it read in a source article and presenting it as my own experience. For a founder writing under his own name, that's not a formatting miss — it's an honesty problem. And the durable fix for that one isn't a post-filter at all. It's feeding the pipeline my real stories instead of asking it to rewrite someone else's. Which, transparently, is exactly what this post is. ## The pattern, if you're building with models Three things to take from a week I'd rather have spent shipping features:
Audit where your constraints actually live. If the only thing standing between your output and a hard requirement is a sentence in a prompt, you don't have a constraint — you have a suggestion. Find every "must" in your prompts and ask which ones have code behind them.
Watch for signal drowning in scale. A rule that worked in a 2KB prompt can quietly stop working when that prompt grows to 49KB and fills with concrete content. More context makes generation better and makes instruction-following worse. Budget for that.
When a fix is partial three times, change categories. Partial-partial-partial is the model telling you the lever doesn't reach. Stop adding prompt text and move the requirement into deterministic code. This connects directly to the context-architecture work I did two weeks ago — the whole reason I think in token budgets and signal-to-noise now — and it's the kind of thing the parallel-tab debugging setup was built to chase down quickly. The compounding is real: every time I learn where a model's attention gives out, the next feature gets a deterministic guardrail instead of a hopeful instruction. Prompts are for what to say. Code is for what must be true. --- I build ConnectEngine OS in production, in public, most mornings. The scan tool is free if you want to see what it does.

DEV Community