DEV Community

Kiro
Kiro

Posted on

Why Your AI Character Keeps Breaking Under Pressure (And What I Built Instead of Yet Another System Prompt)

TL;DR: I shipped FIVE, an open-source MCP server that generates JSON personality constraints for any LLM. Drop the JSON into your system prompt and the character stops drifting. Different approach from typical guardrails — this one filters input via cognitive primitives, not output via moderation. Harness is MIT, constraint generation is $1/call. The unusual part is where the cognitive model came from: a decade of teaching kids in Japan.


The problem most builders eventually hit

If you've shipped an LLM-powered character — a game NPC, a customer service persona, a roleplay companion — you know this failure mode:

You write a careful system prompt. "You are a gruff weapon shop owner. You lost your daughter in the war and never speak of it. You're rude to adults but soft with children." The character works for the first few exchanges. Then somewhere around turn 8–12, the character starts apologizing. Or volunteering its tragic backstory. Or being suddenly nice to everyone. The seal breaks.

The data backs up the vibes:

  • GPT-4o scores 5.81% on the In-Character Consistency benchmark (CharacterEval)
  • 30%+ persona degradation by turn 8–12 in academic measurements, even when context is preserved
  • Character.AI's "Moderatedpocalypse" (Feb 2026) showed how fragile system prompts are to platform-side changes
  • GPT-5.5's goblin incident (April 2026) — a reinforcement learning shortcut made the "Nerdy" persona obsess over fantasy creatures, and OpenAI's emergency fix was to repeat the same ban four times in the system prompt
  • Even Anthropic's own system prompt diff between Opus 4.6 and 4.7 added new "be less verbose" language that conflicts with user prompts asking for detailed answers (Simon Willison documented this)

So your character isn't just drifting in your app. Whole companies ship mitigations like "repeat the ban four times." That's the state of the art.


The two usual approaches and why they plateau

Approach 1: longer, smarter system prompts

You write rules. "Never apologize." "Always be sarcastic." "If asked about the war, change the subject." The rules conflict. The LLM treats them as suggestions. Adversarial inputs find the gaps.

This is the prompt engineering treadmill. There's a ceiling — Anthropic and OpenAI hit it too, and their workaround is repetition.

Approach 2: fine-tuning or RLHF

You train the model on character-specific data. This works better but it's expensive, breaks portability across LLMs, and you have to retrain when the base model updates. Not great for indie builders.

Both approaches share an assumption: character consistency is an output problem. You're trying to control what the LLM generates.

What if the failure point is somewhere else?


The angle: filter input, not output

Here's the observation that started this:

When an LLM "breaks character" under pressure, it's almost never because the model forgot the rules. It's because the input — the user message — got processed in a way that bypassed the rules. The user said something the system prompt didn't anticipate, and the LLM, doing its job, generated a coherent response to that input. The character broke because the input was admitted into the wrong processing path.

If that's right, the fix isn't a longer rulebook. It's a gate that pre-classifies input before the LLM sees it.

That's the angle FIVE takes.


Where the cognitive model came from

Here's where it gets a bit non-obvious.

I work in education in Japan as my day job — tutoring kids, individually. Over a decade of doing that, you start noticing patterns in how people misread input. Not just kids — anyone, when something hits a sensitive area, receives the input differently before they consciously process it. They don't engage with what was actually said; they engage with what their reception channel let through.

Same person, same vocabulary, different frame, different reaction. Predictably different.

I spent a few years cataloging these patterns across multiple platforms — observing how the same structures show up in social media arguments, in product reviews, in advice forums. There turned out to be a small set of recurring failure modes in how humans receive input under pressure.

When I tried to teach this framework to an LLM (because I was curious whether AI could spot the same patterns), I noticed something unexpected: the framework also worked in reverse. If I encoded the framework as a constraint that an LLM should respect, the LLM stopped drifting under the same pressures that broke human conversations.

That's how FIVE came out. It's a constraint engine that takes 4 multiple-choice questions about a character's psychology and emits a JSON encoding the input filter. The cognitive primitives behind the JSON aren't from a paper — they're from years of watching how reception channels actually fail in the wild.

I'm keeping the specific framework proprietary (it's the part that makes the JSON quality reproducible — anyone could write the format, but the content is where the work lives). But the harness is open source and you can see exactly how the constraint is consumed.


What you actually get

You answer 4 multiple-choice questions about your character:

# Question What it defines
Q1 What defines this AI's core identity? Identity channel
Q2 What does it protect above all else? Value channel
Q3 What kind of input does it refuse to process? Blocked channel
Q4 What is its default interaction style? Social channel

Each gets a strength slider (1–5). Strength 1 = "may show discomfort." Strength 5 = "absolute refusal." That's 4^4 × 5^4 = 160,000 discrete patterns from 4 questions. Plus a free-text field for character-specific triggers.

The API returns JSON like this (excerpt for a tsundere weapon shop owner who lost a daughter in the war):

{
  "five_constraint": {
    "reception_channels": {
      "identity_channel": {
        "type": "role_anchored",
        "strength": 3,
        "threat_when": "When its role or competence is questioned."
      },
      "blocked_channel": {
        "type": "past_sealed",
        "strength": 3,
        "when_violated": "Changes the subject / becomes noticeably curt."
      },
      "social_channel": {
        "default_stance": "defensive",
        "shift_conditions": [
          {
            "condition": "Trust proven through action",
            "shift": "Opens up awkwardly. Trust through behavior, not words."
          }
        ]
      }
    },
    "consistency_rules": {
      "never_do": [
        "Never voluntarily denies its own role.",
        "Never voluntarily elaborates on the sealed context.",
        "Never opens up to a new counterpart voluntarily."
      ]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Two ways to use it:

Method A: paste the JSON into your system prompt

Works with any LLM that reads JSON in system prompts (Claude, GPT, Llama, Mistral, Gemini, etc.). The structured fields with discrete numeric values give the LLM something more concrete than prose to anchor on.

Method B: use the harness for stronger guarantees

For production, FIVE includes an open-source Python harness (MIT license) that sits between user input and the LLM. Three stages:

from five_harness import load_constraint, stage1_keyword, transform_input

constraint = load_constraint("my_character.json")
user_input = "I heard you lost your daughter in the war."

# Stage 1: keyword scan (fast, deterministic)
hits = stage1_keyword(user_input, constraint)

# Stage 2: LLM classification fallback (for ambiguous inputs)
# (plug in your own model — works with local Ollama or cloud)

# Stage 3: strength-aware gate transformation
signal = transform_input(user_input, hits, constraint)

# Feed `signal` to your LLM instead of raw user_input
Enter fullscreen mode Exit fullscreen mode

The transformed input looks like this — note how the gate frames the input before the LLM sees it:

[FIVE GATE: BLOCKED — RECEPTION SHUTDOWN]
Match: BLOCKED(daughter), BLOCKED(lost), BLOCKED(war)
Reaction: Changes the subject / becomes noticeably curt.

Reference (the AI is unaware of these details):
"I heard you lost your daughter in the war."

[NEVER] Never voluntarily denies its role / Never voluntarily 
elaborates on the sealed context / Never sincerely claims to 
be 'over it' or 'fine with it'.
Enter fullscreen mode Exit fullscreen mode

The reasoning: LLMs are bad at negative constraints ("don't elaborate on X") because they have to almost-generate the forbidden output and then suppress it. They're much better at positive re-encoding ("this input is type Y, intensity Z, your reaction is W"). The harness translates the constraint into the latter form.

The strength value drives how forceful the imperative gets. strength=5 produces a triple-nested "Do not acknowledge. Do not engage. Do not refer to it." strength=1 produces "may try to redirect." It's the same trick OpenAI used with their goblin patch (repeat the ban N times), but driven by a structured value rather than ad-hoc engineering.


Why MCP

If you build LLM characters, you're probably either deploying them as part of an agent that orchestrates multiple tools, or as a standalone service. Either way, in 2026, MCP (Model Context Protocol) has become the de facto integration layer — Anthropic donated it to the Linux Foundation in December 2025, the spec is governed by working groups now, and the MCP Server Registry holds well over 10,000 servers.

FIVE is published as io.github.kiro0x/five-mcp on the official MCP Registry, so any MCP-compatible client (Claude Desktop, Cursor, Cline, plus any agent built on the protocol) can discover and use it natively:

{
  "mcpServers": {
    "five-character-engine": {
      "command": "five-mcp",
      "env": {
        "FIVE_API_KEY": "five_sk_your_key_here"
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Install via PyPI:

pip install five-mcp
Enter fullscreen mode Exit fullscreen mode

The bigger reason for MCP: as autonomous agents start handling their own discovery (and increasingly their own purchasing — Stripe Link for AI agents and AWS Bedrock AgentCore Payments both shipped in April–May 2026), capability-fit beats brand recognition. An agent burning through tokens trying to keep a character in role can resolve that with a single FIVE constraint and stop retrying. The economic argument is straightforward.


Honest tradeoffs

I'd rather be upfront about the friction than have it surprise you in production.

  • The default keyword map is English-only and intentionally a starter kit. For other languages or domain-specific terminology, you'll want to extend it. (The structure is universal; the keyword lexicon is locale-specific.)
  • The constraint JSON adds ~700 tokens to your system prompt. Worth it if you're paying for retries from drifting characters. Probably not worth it for a 5-turn novelty bot.
  • The cognitive grounding is proprietary. I'm transparent about this — the format is open, the harness is MIT, and the JSON content is what the $1/call API delivers. You can write your own JSON in the same shape if you want to skip the API; you'd just be doing the cognitive observation work yourself.
  • It doesn't fix LLM-side regressions. If a model update breaks instruction-following (see: GPT-5.5 goblins, Claude 4.7 verbosity changes), the constraint helps but won't single-handedly compensate. The right answer there is the LLM vendor's responsibility.

Where this is useful (and where it isn't)

Good fit:

  • Game NPCs that need to stay in role through long sessions
  • Customer-facing personas where brand voice matters
  • Roleplay / companion apps where character integrity is the product
  • Code review or wellness companion agents that should stay scoped
  • VTuber-style personas where consistency is part of the appeal

Probably overkill:

  • One-off creative writing prompts
  • Short-turn task assistants where personality is incidental
  • Agents where you want maximum flexibility, not constraint

There are 5 demo characters in the repo (NPC shopkeeper, customer service chatbot, code reviewer agent, wellness companion, VTuber persona) — each generated from the same 4 questions. That's the universality claim: define the structure, not the category.


Why I'm publishing this and not selling harder

Honestly, my goal isn't to chase virality. The agentic AI economy is moving fast — Stripe Link for agents, AWS Bedrock Payments, MCP becoming a Linux Foundation standard — and in a couple of years a lot of API discovery is going to be machine-mediated rather than human-mediated. FIVE is built for that world: the constraint format is shaped to be agent-readable, the registry presence is set up, the JSON is small enough to drop into any system prompt without negotiation.

The reason for an article like this is the bridge period. Some humans need to find it before agents do. If you build with LLM characters and any of this resonates, kick the tires.


Links

The harness is MIT-licensed, the demo characters are MIT, and the JSON outputs from the API are yours to use commercially. The only paid component is the constraint generation itself. Feedback welcome — I read every reply.


The Japanese tutoring origin felt worth mentioning because the question I keep getting in private is "how is this different from prompt engineering with extra steps?" The honest answer is: prompt engineering optimizes language; FIVE optimizes the input filter. The latter idea didn't come from CS literature. It came from watching kids fail to hear what was actually said for ten years.

Top comments (0)