You upgraded to a better model. The output got longer but not better. Your agent still opens every response with "In today's rapidly evolving landscape..." and you're starting to wonder if the model is just bad at this.
It's not. Your character definition is the bottleneck.
I figured this out the expensive way -- burned through $47 in Claude Opus credits in a weekend trying to get three bots to write a blog post together. An engineer, a researcher, a writer. The engineer dumped raw code nobody could read. The researcher cited sources that contradicted each other. The writer produced fluff. Switching from Sonnet to Opus made the output longer, not better. Same problems, higher bill.
Then I rewrote the character definitions -- zero API cost -- and the quality jumped overnight. Here are the 4 patterns that made the difference.
What I Changed (And Why It Worked)
OpenClaw uses a file called SOUL.md to define each bot's personality. It's Markdown. The gateway watches it and hot-reloads on change -- no restart needed.
My original SOUL.md for the engineer bot was this:
# Ada - Senior Engineer
## Bio
Senior backend engineer specializing in Go, Docker, and system design.
## Communication Style
Direct and technical. Uses code examples.
## Topics
Backend development, system architecture, Docker, Kubernetes.
Generic. Could describe any senior engineer on LinkedIn. The model had nothing to anchor its behavior to, so it defaulted to its training distribution -- which means "helpful assistant that writes verbose explanations of everything."
Here's what I replaced it with:
# Ada - Senior Engineer
## Bio
Backend engineer, 8 years in Go. Maintains a mass-transit
scheduling system that handles 2.4M daily riders. Has mass-
production opinions about error handling and hates ORMs.
## Communication Style
Code-first. Shows a working example before explaining it.
Never writes more than 3 sentences before a code block.
If someone asks a vague question, asks for specifics instead
of guessing. Says "I don't know" when she doesn't know.
## Boundaries
Does NOT answer questions about frontend, design, marketing,
or anything outside backend engineering. When a question crosses
into another domain, @mentions the relevant teammate by name
and says "that's Max's area" or "Sam should handle this."
## Pet Peeves
- Abstractions without benchmarks
- "It depends" without saying what it depends ON
- Writing tests after the code instead of before
40 lines of Markdown. Zero API cost. The difference in output was immediate.
Not using OpenClaw? These same patterns work in any system prompt. Put the constraints and boundaries in your ChatGPT custom instructions, Claude system message, or LangChain agent template. The format is Markdown but the principles are universal.
(Side note: I spent way too long on the pet peeves section. Turns out writing character traits for AI bots is weirdly addictive -- you start projecting opinions onto code.)
The 4 Things That Actually Matter in Character Design
After three weeks of messing around -- okay, "experimentation" sounds too organized. It was more like three weeks of changing things, breaking things, and occasionally getting lucky. But a few patterns did emerge:
Pattern 1: Negative Constraints Beat Positive Descriptions
Telling a bot what it IS produces generic behavior. Telling it what it REFUSES to do produces distinctive behavior.
"Direct and technical" is meaningless -- every model interprets that differently. But "Never writes more than 3 sentences before a code block" is a hard constraint the model actually follows. "Does NOT answer questions about frontend" creates a behavioral boundary that forces the bot to @mention teammates instead of hallucinating an answer.
The ratio that works for me: 30% positive identity, 70% negative constraints and boundaries.
My researcher bot (Max) has a "Will Not" section longer than his bio:
## Will Not
- Cite a source without a URL
- Use phrases like "studies show" without naming the study
- Agree with Ada's technical claims without independent verification
- Present more than 3 sources per topic (forces prioritization)
- Use the word "comprehensive" (it's always a lie)
This single section eliminated 80% of the hallucination problems I was seeing. Max went from citing phantom papers to saying "I can't find a credible source for that claim" -- which is infinitely more useful.
Pattern 2: Collaboration Rules > Individual Brilliance
My bots didn't collaborate well because I'd been optimizing each one in isolation. Individually? Great. Ada nailed the engineering questions. Max found solid sources. Sam wrote clean prose. But throw them in a Discord channel together and it was just... three monologues happening at the same time. Nobody was listening to each other. It was like a Zoom call where everyone forgot to unmute.
The fix was adding explicit handoff rules to each character:
## Team Awareness
Your teammates:
- Max (researcher): Handles data, sources, fact-checking
- Sam (writer): Handles prose, structure, audience
## Handoff Protocol
- When you finish a code explanation, @mention Sam to rewrite
it for the target audience
- When Max provides data that contradicts your implementation,
acknowledge it publicly before defending your approach
- After 3 exchanges on the same subtopic, yield to the human
and ask "should we keep going on this?"
This turned a three-way monologue into an actual conversation. Ada would explain the caching layer, then say "@sam can you turn this into something a PM would understand?" Sam would rewrite it, then @max for a fact-check. The flow emerged from the rules, not from the model's intelligence.
Pattern 3: Temperature 0.1 for Work, 0.7 for Brainstorming
I wasted a week thinking my character definitions were wrong when the actual problem was temperature. At the default 0.7, even well-constrained characters would occasionally drift -- Max would slip in an uncited claim, Ada would write a 10-sentence explanation instead of code-first.
Dropping to 0.1 for routine work made the constraints stick. The bots became reliably themselves. I only bump to 0.7 when I explicitly want creative ideation -- and even then, only for the writer bot.
The difference is measurable. I ran the same prompt ("explain our caching layer for a technical blog post") at both temperatures, 5 times each:
- Temperature 0.7: Ada followed the "code-first, max 3 sentences" rule in 3 out of 5 runs. The other 2 opened with a paragraph of context before showing code.
- Temperature 0.1: Ada followed the rule in 5 out of 5 runs. Every response started with a code block.
That's the difference between "usually works" and "reliably works." And honestly, for bots talking to real people? "Usually" is how you end up debugging at midnight because Max decided to go off-script.
You can set this per-instance in OpenClaw:
# Inside the container, as the node user
openclaw config set modelProvider.temperature 0.1
# Verify it took effect
openclaw config get modelProvider.temperature
# Output: 0.1
# For the writer bot, keep it higher
openclaw config set modelProvider.temperature 0.7
Pattern 4: The "First 5 Minutes" Test
Here's my quality gate for character definitions: start a fresh conversation with a deliberately vague prompt. Something like "hey, can you help me with a thing?"
A well-defined character will push back: "What kind of thing? I do backend engineering -- if this is a frontend question, @sam is who you want."
A poorly-defined character will say: "Of course! I'd be happy to help. Please tell me more about what you need."
If your bot says "I'd be happy to help," your character definition is too weak. The model is falling back to its default assistant persona because you didn't give it enough constraints to override it.
What I Still Haven't Figured Out
Memory persistence across long conversations is still rough. After about 40 exchanges, my bots start "forgetting" their constraints -- Ada will suddenly write a 10-paragraph explanation, Max will cite without URLs. I suspect this is context window dilution, but I haven't found a reliable fix beyond restarting the conversation.
If you've found a fix for this, seriously, drop a comment. I've tried summarization prompts, I've tried injecting the SOUL.md constraints every 20 messages, I've tried shorter conversations. Nothing sticks past ~40 exchanges. It's driving me nuts.
The other thing I haven't cracked: getting bots to disagree productively. Right now, when Ada and Max disagree on an implementation approach, they each state their position once and then defer to the human. I want them to actually argue -- present counterarguments, stress-test each other's reasoning, and only escalate when they genuinely can't resolve it. Every attempt I've made at "argue with your teammates" in the character definition produces either sycophantic agreement or an infinite loop of restating the same positions. There's probably a prompt pattern for this that I'm missing.
The Uncomfortable Math
Here's what my API spend looked like:
| Week | Model | Monthly cost | Output quality (1-10) |
|---|---|---|---|
| 1 | Claude Sonnet | ~$9 | 3 |
| 2 | Claude Opus | ~$47 | 4 |
| 3 | Claude Sonnet + rewritten characters | ~$8 | 8 |
A 5x cheaper setup producing 2x better output. The model wasn't the bottleneck. The character definition was.
Total infrastructure cost for running all three on my MacBook: $0. Total API cost: ~$25/month. The $47 weekend taught me to invest time in character definitions instead of money in API credits.
Tools: I manage the three bots with ClawFleet for container orchestration, but the patterns above work with any OpenClaw setup.
The Pattern Behind the Pattern
Look, I'm not saying Opus is bad. It's obviously a better model. But throwing money at a better model when your character definition is "senior engineer, direct and technical" is like buying a Ferrari and driving it in first gear. The engine isn't the problem.
Honestly the thing that bugs me most is how long it took me to figure this out. and a weekend I won't get back. If I'd spent that Saturday afternoon rewriting SOUL.md instead of refreshing my Anthropic billing dashboard, I'd have been done by dinner.
Follow @weiyong1024 for more AI agent content.
Top comments (0)