Six months ago, I built a multi-agent customer support system that handles 10,000+ conversations daily. It reduced response time from 4 hours to under 2 minutes. It now resolves 73% of tickets without human intervention.
But here's what the case study won't tell you: it almost failed spectacularly in week two. And the reason reveals everything about how NOT to design multi-agent systems.
The Architecture That Almost Died
Here's what I built initially:
Single Agent Architecture (FAILURE):
Customer Message → [Router Agent] → [Single Resolution Agent] → Response
Simple, right? One agent receives, one agent resolves.
Within two weeks, we hit three problems:
- The agent couldn't handle different timezones and urgency levels
- Complex issues (refunds + exchanges + account problems) required different knowledge bases
- Peak hours (Mondays, 9 AM) crashed the single agent
The fix was obvious: multiple specialized agents working together.
The Production Architecture
Here's what actually works:
Customer Message
↓
[ triage_agent ] ← Fast, stateless, decides where to route
↓
┌───┴───┐
↓ ↓
[ billing ] [ shipping ] [ returns ] [ general ] ← Specialized, stateful
↓ ↓
[ resolution_agents ] ← Generate response, check policies
↓
[ quality_check_agent ] ← Final review before sending
↓
Response
Let me walk through each component.
1. The Triage Agent (Stateless Router)
The first decision point. It should be fast and stateless—no conversation history.
class TriageAgent:
SYSTEM_PROMPT = """
You are a customer support triage specialist.
Your ONLY job: read the incoming message and route it correctly.
Do NOT try to solve the problem. Just classify and route.
Categories:
- BILLING: charges, payments, subscriptions, invoices, refunds
- SHIPPING: delivery, tracking, addresses, delays
- RETURNS: return policy, return requests, exchanges
- GENERAL: account, login, password, other
Urgency levels:
- URGENT: money involved, legal keywords, explicit threats
- HIGH: dissatisfaction markers, complaint patterns
- NORMAL: standard requests
Output ONLY this format:
{
"category": "BILLING|SHIPPING|RETURNS|GENERAL",
"urgency": "URGENT|HIGH|NORMAL",
"confidence": 0.0-1.0,
"summary": "one sentence summary of the issue"
}
"""
def classify(self, message: str) -> dict:
response = self.llm.chat([
{"role": "system", "content": self.SYSTEM_PROMPT},
{"role": "user", "content": message}
])
return self.parse_json(response)
This agent has one job and does it well. It's fast because it's stateless.
2. The Specialized Resolution Agents
Each category gets its own agent with specialized knowledge:
class BillingAgent:
def __init__(self):
self.tools = [
self.get_subscription_details,
self.process_refund,
self.update_payment_method,
self.issue_credit,
]
self.policy = load_billing_policy()
def resolve(self, issue: dict, conversation_history: list) -> dict:
# Check for edge cases first
if self.is_high_risk_refund(issue):
return self.escalate(issue, reason="refund_over_threshold")
if self.requires_manager_approval(issue):
return self.flag_for_review(issue)
# Normal resolution path
return self.generate_resolution(issue, conversation_history)
def is_high_risk_refund(self, issue: dict) -> bool:
refund_amount = issue.get("amount", 0)
customer_tier = self.get_customer_tier(issue["customer_id"])
days_since_purchase = self.get_days_since_purchase(issue)
return (
refund_amount > 500 or
(customer_tier == "basic" and refund_amount > 100) or
days_since_purchase > 30
)
Notice: the agent has tool access, not just text generation. It actually does things.
3. The Quality Check Agent
Before sending any response to a customer, it goes through review:
class QualityCheckAgent:
def review(self, response: str, original_issue: dict, customer_tier: str) -> dict:
checks = {
"tone_appropriate": self.check_tone(response, customer_tier),
"policy_compliant": self.check_policy(response, original_issue),
"no_hallucinations": self.verify_claims(response, original_issue),
"complete": self.check_completeness(response, original_issue),
}
all_passed = all(checks.values())
if all_passed:
return {"approved": True, "response": response}
else:
return {
"approved": False,
"needs_revision": True,
"issues": [k for k, v in checks.items() if not v]
}
This catched issues before customers see them.
4. The Orchestration Layer
The magic is in how these agents coordinate:
class SupportOrchestrator:
def __init__(self):
self.triage = TriageAgent()
self.resolvers = {
"BILLING": BillingAgent(),
"SHIPPING": ShippingAgent(),
"RETURNS": ReturnsAgent(),
"GENERAL": GeneralAgent(),
}
self.quality = QualityCheckAgent()
self.human_escalation = HumanEscalationHandler()
async def handle(self, message: str, customer_id: str) -> str:
# Step 1: Fast triage
classification = self.triage.classify(message)
if classification["urgency"] == "URGENT":
await self.human_escalation.notify(message, customer_id)
# Step 2: Get specialized resolver
resolver = self.resolvers[classification["category"]]
# Step 3: Resolve with conversation context
resolution = resolver.resolve(
issue=classification,
conversation_history=self.get_history(customer_id)
)
# Step 4: Quality check
quality_result = self.quality.review(
response=resolution["response"],
original_issue=classification,
customer_tier=self.get_customer_tier(customer_id)
)
if quality_result["needs_revision"]:
# Loop back with feedback
resolution = resolver.revise(
previous_response=resolution,
quality_feedback=quality_result["issues"]
)
return resolution["response"]
What I'd Do Differently
Looking back, here's what I'd change:
1. Start with observability from day one
I added logging in week three. Should have been there from the start.
# Add this everywhere from day one
async def handle(self, message: str, customer_id: str) -> str:
trace_id = generate_trace_id()
start_time = time.now()
logger.info({
"trace_id": trace_id,
"customer_id": customer_id,
"message_preview": message[:100],
"stage": "start"
})
try:
result = await self._handle_impl(message, customer_id)
logger.info({
"trace_id": trace_id,
"duration": time.now() - start_time,
"success": True
})
return result
except Exception as e:
logger.error({
"trace_id": trace_id,
"error": str(e),
"stage": "failure"
})
raise
2. Plan for agent failures explicitly
# Have a fallback agent ready
FALLBACK_RESOLVER = """
You are a general support agent. The specialized agent was unavailable.
Apologize briefly, then:
1. Acknowledge the customer's issue
2. Promise a human will follow up within 4 hours
3. Create a ticket for manual resolution
"""
3. Implement feedback loops
Track which resolutions worked and which didn't:
# After customer interaction ends
def record_outcome(trace_id: str, customer_feedback: str):
# Did they accept the resolution?
# Did they escalate?
# Did they express satisfaction?
# Store for agent improvement
The Numbers
After 6 months in production:
- 73% of tickets resolved without human intervention
- Average response time: 1 minute 47 seconds
- Customer satisfaction: 4.2/5 (up from 3.1/5)
- Cost per ticket: $0.34 (down from $4.80)
- Peak load handling: 500 concurrent conversations
The architecture isn't magic. It's just well-designed coordination between agents that each do one thing well.
If you're building a similar system, I've documented the full architecture, including the prompts, error handling, and deployment setup in my AI Agent Engineering Playbook. Includes the complete prompt templates and code patterns.
Building multi-agent systems is hard. But with the right architecture, it doesn't have to be painful.
Top comments (0)