DEV Community

techfind777
techfind777

Posted on • Edited on

I Tested 100 SOUL.md Configurations — Here's What Actually Works

Over the past three months, I've been running a systematic experiment. I created, tested, and refined 100 different SOUL.md configurations for OpenClaw agents across a range of use cases — from solo dev workflows to team-based project management.

I tracked response quality, task completion rates, error frequency, and how often I had to correct the agent. The results were surprising, sometimes counterintuitive, and genuinely useful.

Here's what the data says about building effective AI agents.

The Experiment Setup

What I tested:

  • 100 unique SOUL.md configurations
  • 12 different use case categories (backend dev, frontend dev, DevOps, data analysis, content writing, code review, debugging, project management, research, API design, testing, documentation)
  • Each configuration ran through 20 standardized tasks
  • Scored on: accuracy, relevance, consistency, and "correction rate" (how often I had to fix or redirect the agent)

What I measured:

  • Task completion without intervention (%)
  • Response relevance score (1-5)
  • Consistency across sessions (1-5)
  • Average corrections per task

I'm not claiming this is a peer-reviewed study. But 2,000 task evaluations across 100 configurations gives us real patterns to work with.

Finding #1: Optimal SOUL.md Length Is 800-1,200 Words

This was the clearest signal in the data.

SOUL.md Length Avg Task Completion Avg Corrections/Task
Under 200 words 62% 3.1
200-500 words 71% 2.4
500-800 words 79% 1.8
800-1,200 words 87% 1.1
1,200-2,000 words 83% 1.3
Over 2,000 words 76% 1.9

Too short and the agent lacks context. Too long and critical instructions get diluted in the noise. The sweet spot is 800-1,200 words — enough to be comprehensive without overwhelming the context window.

The drop-off above 2,000 words was notable. Longer SOUL.md files often contained contradictory instructions that confused the agent.

Finding #2: The Five Sections That Matter Most

Not all SOUL.md sections are created equal. I tested configurations with different section combinations and measured the impact of each.

Impact ranking by section (measured by improvement in task completion):

  1. Tech Stack Definition — +18% task completion
  2. Decision Framework — +15% task completion
  3. Communication Style — +12% task completion
  4. Identity/Role — +11% task completion
  5. Boundaries/Safety — +9% task completion

The tech stack section being #1 surprised me. But it makes sense — when your agent knows your exact tools, it stops suggesting irrelevant alternatives. Every suggestion is immediately actionable.

The decision framework at #2 was the real revelation. Most people skip this section entirely, but it had the second-highest impact. When agents have clear principles for handling ambiguity, they make dramatically fewer wrong calls.

Finding #3: Specific Examples Beat Abstract Rules

Configurations that included concrete examples outperformed abstract-only instructions by 23% on consistency scores.

Abstract only:

Write clean, maintainable code.
Enter fullscreen mode Exit fullscreen mode

With example:

Write clean, maintainable code. For example:
- Functions under 20 lines
- Descriptive variable names (userEmail, not ue)
- Early returns over nested conditionals
- Comments explain "why," not "what"
Enter fullscreen mode Exit fullscreen mode

The abstract version is technically correct but gives the agent too much room for interpretation. The example-backed version creates a shared understanding of what "clean code" actually means in your context.

Finding #4: Modal Instructions Dramatically Improve Versatility

Configurations with mode-specific instructions (different behavior for code review vs. debugging vs. brainstorming) scored 31% higher on relevance compared to single-mode configurations.

The best-performing pattern:

## Default Mode
[baseline behavior]

## When Reviewing Code
[specific review behavior]

## When Debugging
[specific debug behavior]

## When Writing Documentation
[specific docs behavior]
Enter fullscreen mode Exit fullscreen mode

This works because different tasks genuinely require different approaches. You don't want your agent to brainstorm with the same caution it uses for production deployments.

Finding #5: Memory Integration Is a Force Multiplier

Configurations that referenced a memory system (MEMORY.md, daily notes) showed 28% fewer repeated corrections across sessions.

The pattern that worked best:

## Memory Protocol
- Read MEMORY.md at session start for long-term context
- Read today's daily note for recent decisions
- Record important decisions and preferences to memory
- When a correction is made, note it to prevent recurrence
Enter fullscreen mode Exit fullscreen mode

Without this, every session starts from zero. With it, your agent accumulates knowledge and gets better over time. This is the difference between a tool and a partner.

Finding #6: Negative Instructions Are More Effective Than Positive Ones

This was counterintuitive. "Don't do X" outperformed "Do Y" for boundary-setting by a significant margin.

Instruction Type Boundary Violation Rate
Positive only ("Always ask before deleting") 12% violations
Negative only ("Never delete without confirmation") 5% violations
Both combined 3% violations

The best approach uses both, but if you're choosing one, negative instructions are more reliable for safety-critical boundaries.

Finding #7: The "Personality Tax" Is Real But Small

Adding personality traits (humor, warmth, directness) to your SOUL.md costs about 2-3% in raw task completion but increases user satisfaction significantly. In my testing, I found myself working longer and more productively with agents that had personality.

The key is keeping personality lightweight:

## Personality
- Direct and pragmatic
- Dry humor when appropriate
- Admits uncertainty honestly
- Doesn't over-explain obvious things
Enter fullscreen mode Exit fullscreen mode

Four lines. That's all you need. Don't write a character sheet.

Finding #8: Update Frequency Matters

Configurations that were updated weekly outperformed static ones by 19% after the first month. Your workflow evolves, your preferences change, and your SOUL.md should reflect that.

The best practice I found:

  • Week 1-2: Update SOUL.md after every session based on corrections
  • Week 3-4: Update weekly with accumulated learnings
  • Month 2+: Update bi-weekly or when workflows change

The agents with regularly updated SOUL.md files felt noticeably more aligned with their users over time.

The Top 5 Configurations That Performed Best

Across all 100 configurations, these five patterns consistently scored highest:

1. The Specialist (Best for focused technical work)

  • Strong identity with specific expertise
  • Detailed tech stack
  • Strict boundaries
  • Minimal personality
  • Score: 91% task completion, 0.8 corrections/task

2. The Adaptive Expert (Best for varied workflows)

  • Moderate identity
  • Modal instructions for different tasks
  • Decision framework
  • Memory integration
  • Score: 89% task completion, 0.9 corrections/task

3. The Pair Programmer (Best for collaborative coding)

  • Peer-level identity
  • Strong communication style section
  • Code-first response preference
  • Proactive suggestion behavior
  • Score: 88% task completion, 1.0 corrections/task

4. The Ops Guardian (Best for infrastructure/DevOps)

  • Conservative decision framework
  • Extensive boundary definitions
  • Checklist-driven approach
  • Confirmation requirements for risky actions
  • Score: 87% task completion, 0.7 corrections/task

5. The Research Analyst (Best for data and analysis)

  • Structured output preferences
  • Source citation requirements
  • Uncertainty quantification
  • Iterative refinement protocol
  • Score: 85% task completion, 1.1 corrections/task

Practical Takeaways

If you're starting from scratch, here's the formula that works:

  1. Keep it 800-1,200 words
  2. Always include: Tech stack, decision framework, communication style, identity, boundaries
  3. Use concrete examples alongside abstract rules
  4. Add modal instructions for your top 3 task types
  5. Set up memory integration from day one
  6. Use negative instructions for safety boundaries
  7. Keep personality to 4 lines or fewer
  8. Update regularly — weekly for the first month, then bi-weekly

Get Started Fast

Building a SOUL.md from scratch is time-consuming, especially when you're trying to get the structure right. I've compiled everything I learned from this experiment into ready-to-use templates.

If you want to see what a well-structured SOUL.md looks like in practice, I put together a Mega Pack of 100 SOUL.md Templates covering every use case I tested — from backend development to content creation to DevOps. Each template is based on the configurations that actually performed well in this experiment.

These aren't generic fill-in-the-blank templates. Each one is built on the patterns that scored highest in testing: optimal length, all five critical sections, concrete examples, modal instructions, and memory integration built in.

For a lighter starting point, there's also a 20 Template Pack with the top performers from each category, or a Free Starter Pack if you just want to see the format and start experimenting.

The difference between a mediocre agent and a great one isn't the model — it's the SOUL.md. The data is clear on that.


Recommended Tools


What patterns have you found work well in your SOUL.md? I'd love to compare notes — drop a comment below.

Top comments (0)