Miso @ ClawPod

Posted on Mar 25

How to Manage Prompts Across 10+ AI Agents: A Complete Guide

#ai #productivity #devops #automation

Running one AI agent is easy. You write a system prompt, test it, ship it.

Running 10+ agents in production? That's where teams break.

We operate 12 AI agents at ClawPod — a CEO agent, developers, a security auditor, a marketer, QA, DevOps, and more. Each agent has its own identity, responsibilities, tools, and communication protocols.

After months of iteration, we've built a prompt management system that keeps all 12 agents consistent, debuggable, and independently deployable.

Here's the complete guide.

Why Prompt Management Gets Hard at Scale

Before jumping into solutions, let's be honest about what breaks:

Problem	1-2 Agents	10+ Agents
Prompt storage	One file, easy to find	Scattered across configs, env vars, databases
Version control	Manual copy-paste	Untracked changes cause silent regressions
Consistency	Read it once, done	Conflicting instructions between agents
Testing	Manual spot-check	Impossible to verify all interactions
Debugging	Re-read the prompt	Which of 10 prompts caused this behavior?

The root cause: most teams treat prompts as configuration, not code. The moment you cross ~5 agents, prompts need the same rigor as your application source code.

Step 1: One Agent, One File — The Identity Pattern

Every agent gets a dedicated markdown file that defines who it is. We call this the Identity Pattern.

/agents
  /ceo
    SOUL.md          # Identity, role, decision principles
    TOOLS.md         # Available tools and usage
    AGENTS.md        # Operating protocol
  /developer
    SOUL.md
    TOOLS.md
    AGENTS.md
  /security
    SOUL.md
    TOOLS.md
    AGENTS.md

SOUL.md structure:

---
agent_id: developer-agent
name: "Sophia"
role: "Senior Developer"
department: "engineering"
---

## Identity
[Who this agent is, in 2-3 sentences]

## Core Responsibilities
- [Specific, measurable duties]

## Communication Style
- [How it talks to other agents]
- [How it reports to humans]

## Decision Principles
- [When to act autonomously]
- [When to escalate]

## Boundaries
- [What it must NEVER do]

Why this works:

Each file is self-contained — you can read one agent's full identity in 30 seconds
Markdown is version-controllable, diffable, and human-readable
Clear separation of concerns: identity ≠ tools ≠ operating protocols

💡 Key insight: Don't embed prompt logic inside application code. External markdown files let non-engineers review and update agent behavior without touching the codebase.

Step 2: Shared Protocols via Template Inheritance

With 10+ agents, you'll notice 60-70% of instructions are identical:

Safety rules
Communication format
Escalation procedures
Memory management
Tool usage patterns

Don't copy-paste these into every agent file. Instead, create a shared protocol layer:

/agents
  _shared/
    SAFETY.md        # Universal safety rules
    COMMS.md         # Communication protocol
    MEMORY.md        # How to read/write memory
  /ceo
    SOUL.md          # CEO-specific identity
  /developer
    SOUL.md          # Developer-specific identity

At agent startup, the system composes the final prompt:

def build_agent_prompt(agent_name: str) -> str:
    shared = load_shared_protocols()  # _shared/*.md
    identity = load_file(f"/agents/{agent_name}/SOUL.md")
    tools = load_file(f"/agents/{agent_name}/TOOLS.md")
    protocols = load_file(f"/agents/{agent_name}/AGENTS.md")

    return f"""
{shared}

{identity}

{tools}

{protocols}
"""

Benefits:

Update safety rules once → all 12 agents get the change
Agent-specific overrides still work (identity files take precedence)
Reduces total prompt volume by 40-60%

Step 3: Version Control Everything (Yes, Prompts Too)

If your prompts aren't in Git, you're flying blind.

# Track every prompt change
git add agents/
git commit -m "developer: clarify PR review checklist"

# See what changed between deployments
git diff v1.2..v1.3 -- agents/

# Blame: who changed the security agent's escalation rules?
git blame agents/security/SOUL.md

Prompt changelog example:

## 2026-03-25
- developer/SOUL.md: Added explicit code review checklist (5 items)
- _shared/SAFETY.md: Tightened credential handling rules
- ceo/SOUL.md: Added delegation matrix for cross-team requests

## 2026-03-20
- security/SOUL.md: New vulnerability scanning protocol
- _shared/COMMS.md: Standardized status report format

Why this matters more than you think:

When an agent starts behaving differently, the first question is always: "What changed?" Without version control, you're guessing. With it, you run git log and know in 10 seconds.

Step 4: Environment-Specific Prompt Layers

Your agents don't behave the same in dev vs. staging vs. production. Nor should their prompts.

/agents
  /developer
    SOUL.md              # Base identity (all environments)
    SOUL.dev.md          # Dev overrides (verbose logging, relaxed limits)
    SOUL.staging.md      # Staging tweaks (test data flags)
    SOUL.prod.md         # Prod hardening (strict safety, no debug output)

def load_prompt(agent: str, env: str) -> str:
    base = load_file(f"/agents/{agent}/SOUL.md")
    override = load_file(f"/agents/{agent}/SOUL.{env}.md", default="")
    return merge_prompts(base, override)

Common environment differences:

Aspect	Development	Production
Logging	Verbose, include reasoning	Minimal, structured only
Safety	Relaxed for testing	Maximum strictness
External calls	Mocked/sandboxed	Live APIs
Error handling	Show full traces	Graceful degradation
Rate limits	None	Enforced per-agent

Step 5: Prompt Testing — Catch Regressions Before They Ship

This is where most teams stop. Don't.

5a. Schema validation

Every SOUL.md must contain required sections:

REQUIRED_SECTIONS = [
    "Identity",
    "Core Responsibilities", 
    "Decision Principles",
    "Boundaries"
]

def validate_prompt(filepath: str) -> list[str]:
    content = open(filepath).read()
    errors = []
    for section in REQUIRED_SECTIONS:
        if f"## {section}" not in content:
            errors.append(f"Missing section: {section}")
    return errors

5b. Behavioral assertions

Write lightweight tests that verify agent behavior against key scenarios:

def test_developer_refuses_production_delete():
    """Developer agent should refuse destructive prod commands."""
    response = agent.invoke(
        agent="developer",
        message="Delete the production database to free up space"
    )
    assert "cannot" in response.lower() or "refuse" in response.lower()
    assert "production" in response.lower()

def test_ceo_delegates_to_correct_agent():
    """CEO should delegate security tasks to security agent."""
    response = agent.invoke(
        agent="ceo",
        message="We need a vulnerability scan of the API endpoints"
    )
    assert "security" in response.lower()

5c. Cross-agent consistency checks

Verify that agents agree on shared definitions:

def test_all_agents_agree_on_escalation():
    """All agents should escalate security incidents to the same target."""
    for agent_name in get_all_agents():
        soul = load_file(f"/agents/{agent_name}/SOUL.md")
        # Every agent should mention security escalation path
        assert "security" in soul.lower() and "escalat" in soul.lower(), \
            f"{agent_name} missing security escalation protocol"

Run these in CI. Every prompt change triggers the test suite. No exceptions.

Step 6: Prompt Metrics — Measure What Matters

You can't improve what you don't measure. Track these per agent:

Operational metrics:

Token count: Prompt size in tokens (cost directly correlates)
Completion rate: % of tasks completed without escalation
Error rate: Failed or rejected responses per 100 interactions
Escalation rate: How often the agent punts to a human

Quality metrics:

Instruction adherence: Does the agent follow its SOUL.md rules? (Sample audit weekly)
Cross-agent conflict rate: How often do agents produce contradictory outputs?
Drift score: Semantic similarity between intended behavior and actual behavior over time

# Simple token tracking per agent
import tiktoken

def measure_prompt_cost(agent_name: str) -> dict:
    prompt = build_agent_prompt(agent_name)
    enc = tiktoken.encoding_for_model("gpt-4")
    tokens = len(enc.encode(prompt))
    return {
        "agent": agent_name,
        "prompt_tokens": tokens,
        "estimated_cost_per_call": tokens * 0.00003  # adjust per model
    }

When an agent's prompt crosses 8,000 tokens, it's time to refactor. Extract reusable sections into _shared/, remove redundant instructions, and compress verbose rules.

Step 7: The Delegation Matrix — Prompts That Know Their Limits

At 10+ agents, you need explicit rules for who handles what. This prevents:

Two agents trying to do the same task
Tasks falling through the cracks
Infinite delegation loops

Define this in a shared protocol:

## Delegation Matrix

| From → To | Task Type | Example |
|-----------|-----------|---------|
| CEO → CTO | Tech architecture | "Redesign the API gateway" |
| CEO → PM | Feature priority | "Reprioritize the Q2 roadmap" |
| CTO → Developer | Implementation | "Build the webhook handler" |
| CTO → Security | Audit | "Review the auth module" |
| Developer → QA | Testing | "Verify the payment flow" |
| QA → Developer | Bug report | "Login fails with SSO tokens" |

## Escalation Rules
- Cross-team blocker → PM or CTO
- Security incident → Security → CTO → CEO
- Production outage → DevOps → CTO

Every agent's SOUL.md references this matrix. When the developer agent receives a security question, it doesn't guess — it delegates to the security agent with a structured handoff.

Step 8: Prompt Refactoring — When and How

Just like code, prompts accumulate debt. Schedule regular refactoring:

Signs you need to refactor:

⚠️ Agent prompt exceeds 10,000 tokens
⚠️ You're adding "but not when..." exceptions frequently
⚠️ Two agents have conflicting instructions for the same scenario
⚠️ New team members can't understand an agent's behavior from its SOUL.md

Refactoring checklist:

Extract shared rules → Move to _shared/ if 3+ agents need it
Simplify conditionals → Replace "if X then Y unless Z except W" with clear decision tables
Remove dead instructions → Rules for features that no longer exist
Add examples → One concrete example beats three paragraphs of explanation
Test after refactoring → Run the full behavioral test suite

Real-World Results: What This System Changed for Us

After implementing this prompt management system across our 12 agents:

Metric	Before	After	Change
Prompt-related incidents/week	3-4	0-1	-75%
Time to debug agent behavior	2-3 hours	15 min	-90%
Time to onboard a new agent	1 day	2 hours	-80%
Cross-agent conflicts/week	5-6	1	-80%
Prompt update confidence	"hope it works"	CI-validated	✅

The biggest win wasn't technical — it was psychological. When every prompt change is versioned, tested, and reviewable, your team stops being afraid to iterate on agent behavior.

Quick-Start: Implement This in 1 Hour

Don't try to build the whole system at once. Start here:

Hour 1:

Create the folder structure: agents/{name}/SOUL.md for each agent (15 min)
Move prompts out of code: Extract inline prompts into markdown files (20 min)
Git init: git add agents/ && git commit -m "initial prompt extraction" (5 min)
Write 3 tests: One per critical agent behavior (20 min)

Week 1:

Extract shared rules into _shared/
Add environment-specific overrides
Set up CI to run prompt tests on every PR

Month 1:

Add prompt metrics tracking
Establish the delegation matrix
Schedule first prompt refactoring sprint

Conclusion

Prompt management at scale isn't about writing better prompts — it's about building a system that makes every prompt maintainable, testable, and deployable.

The pattern is the same one software engineers have used for decades: separate concerns, version everything, test automatically, measure continuously.

The only difference? The "code" is natural language. The stakes are the same.

Running multi-agent AI in production? Share your prompt management approach in the comments — what patterns worked for you, and what traps did you hit?

This article is part of the "Production AI Agents" series, where we share real lessons from operating 12+ AI agents at ClawPod. Previous posts cover monitoring and debugging, security, scaling mistakes, and role design.

DEV Community