AI makes mistakes. It hallucinates. It misses things. It gets stuck in loops.
What if AI could catch its own errors?
That's the Reflection pattern.
The problem
Standard AI agents act and move on. They don't look back.
User: "Write a function to calculate factorial"
AI: [writes code]
def factorial(n):
return n * factorial(n-1) # Bug: no base case
AI: "Here's your function!"
# Infinite recursion. AI didn't notice.
The AI completed the task but didn't verify its work. No human would submit code without at least glancing at it.
Reflection: Think twice
Reflection adds a verification step. After acting, AI reviews its own output.
User: "Write a function to calculate factorial"
AI: [writes code]
def factorial(n):
return n * factorial(n-1)
AI [REFLECT]: "Let me check this...
- Does it handle base case? No, missing n==0 or n==1
- Will it terminate? No, infinite recursion
- This is wrong. Let me fix it."
AI: [rewrites code]
def factorial(n):
if n <= 1:
return 1
return n * factorial(n-1)
AI: "Here's your function!"
Same AI, better result. Just by thinking twice.
How it works
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Task │────→│ Actor │────→│ Output │
└─────────┘ └─────────┘ └────┬────┘
│
▼
┌──────────┐
│ Reflector│
└────┬─────┘
│
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
[Looks good] [Minor fix] [Major issue]
│ │ │
▼ ▼ ▼
Return Fix & Return Redo task
- AI does the task (Actor)
- AI reviews its work (Reflector)
- Based on review: accept, fix, or redo
Implementation
Basic reflection
def reflect_and_act(task, tools):
# Step 1: Act
result = actor.run(task, tools)
# Step 2: Reflect
reflection = reflector.analyze(
task=task,
result=result,
prompt="""Review this output:
- Does it correctly solve the task?
- Are there any errors or issues?
- What could be improved?
Rate: GOOD / NEEDS_FIX / REDO"""
)
# Step 3: Handle reflection
if reflection.rating == "GOOD":
return result
elif reflection.rating == "NEEDS_FIX":
return actor.fix(result, reflection.feedback)
else: # REDO
return reflect_and_act(task, tools) # Try again
Reflection prompt
The reflector needs clear criteria:
REFLECTION_PROMPT = """
You are reviewing work done by an AI assistant.
Original task: {task}
Output produced: {output}
Evaluate the output:
1. CORRECTNESS: Does it solve the task correctly?
2. COMPLETENESS: Is anything missing?
3. ERRORS: Are there bugs, mistakes, or issues?
4. QUALITY: Is it well-structured and clear?
Based on your evaluation:
- GOOD: Output is correct and complete
- NEEDS_FIX: Minor issues that can be fixed
- REDO: Fundamentally wrong, needs complete redo
Provide specific feedback on what's wrong and how to fix it.
"""
With MCP tools
Reflection works great with tool use. The reflector can verify tool results.
def execute_with_reflection(task, mcp_client):
# Actor uses tools
result = actor.run(
task=task,
tools=mcp_client.get_tools()
)
# Reflector checks the work
reflection = reflector.analyze(
task=task,
tool_calls=result.tool_calls,
output=result.output
)
if reflection.needs_verification:
# Use tools to verify
verification = mcp_client.call_tool(
"verify_result",
{"expected": task, "actual": result.output}
)
if not verification.passed:
return execute_with_reflection(task, mcp_client) # Retry
return result
Example: Code generation
Task: "Write a function to find the second largest number in a list"
Without reflection:
AI: Here's the function:
def second_largest(nums):
sorted_nums = sorted(nums, reverse=True)
return sorted_nums[1]
# Bugs:
# - Fails on empty list
# - Fails on single element
# - Doesn't handle duplicates [5,5,3] returns 5 not 3
With reflection:
AI [ACT]: Here's the function:
def second_largest(nums):
sorted_nums = sorted(nums, reverse=True)
return sorted_nums[1]
AI [REFLECT]: Let me verify this...
- Empty list: sorted_nums[1] will IndexError ❌
- Single element: same issue ❌
- Duplicates [5,5,3]: returns 5, should return 3 ❌
- Rating: REDO
AI [ACT]: Let me rewrite:
def second_largest(nums):
if len(nums) < 2:
raise ValueError("Need at least 2 numbers")
unique = list(set(nums))
if len(unique) < 2:
raise ValueError("Need at least 2 unique numbers")
unique.sort(reverse=True)
return unique[1]
AI [REFLECT]: Checking again...
- Empty list: raises ValueError ✓
- Single element: raises ValueError ✓
- Duplicates: handles correctly ✓
- Rating: GOOD
Example: Tool use
Task: "Find all users who signed up last month and email them a survey"
Without reflection:
AI: [calls get_users tool]
AI: [calls send_email tool for each user]
AI: "Done! Sent emails to 150 users."
# Problem: Didn't verify the date filter worked
# Actually sent to ALL users, not just last month
With reflection:
AI [ACT]:
[calls get_users(filter="signup_date > 2024-11-01")]
Got 150 users
AI [REFLECT]: Wait, let me verify this is correct.
[calls count_users(filter="signup_date > 2024-11-01")]
Count: 42
That doesn't match. 150 vs 42. Something's wrong.
Let me check the first query...
[calls get_users with correct filter]
Got 42 users. The first query ignored my filter.
AI [ACT]:
[calls send_email for 42 users]
AI [REFLECT]: Verify emails sent:
[calls get_email_log(campaign="survey")]
42 emails sent. Matches user count. ✓
AI: "Done! Sent survey to 42 users who signed up last month."
Reflection strategies
Self-critique
AI critiques its own output.
reflection = llm.create(
messages=[
{"role": "assistant", "content": output},
{"role": "user", "content": "Critique your response. What's wrong?"}
]
)
Verification tools
Use tools to verify results.
tools:
- name: run_tests
description: Run tests against code
- name: validate_json
description: Check if JSON is valid
- name: check_sql
description: Verify SQL query syntax
- name: count_results
description: Verify expected count
Multi-perspective
Different prompts catch different issues.
perspectives = [
"Check for logical errors",
"Check for edge cases",
"Check for security issues",
"Check for performance issues"
]
for perspective in perspectives:
reflection = reflector.analyze(output, focus=perspective)
if reflection.found_issue:
output = fix(output, reflection)
Constitutional AI
Check against rules.
RULES = [
"Output must be valid JSON",
"No sensitive data in response",
"Must handle empty input",
"Response under 1000 tokens"
]
for rule in RULES:
if not check_rule(output, rule):
output = fix_for_rule(output, rule)
When to use reflection
Use reflection when:
- Output quality matters
- Mistakes are costly
- Tasks are complex
- AI tends to make specific errors
- Verification is possible
Skip reflection when:
- Simple, low-stakes tasks
- Speed is critical
- Output is obviously correct
- Resources are limited
Costs and trade-offs
More tokens
Reflection means more LLM calls.
Without: 1 call
With: 2-3 calls (act + reflect + maybe fix)
More latency
Extra round trips take time.
Without: 500ms
With: 1000-1500ms
When it's worth it
Cost of reflection: ~2x tokens/latency
Cost of wrong output: Much higher
If mistakes matter, reflect.
Integration with MCP
MCP tools can both cause errors and help catch them.
Tools that help reflection:
tools:
- name: validate_output
description: Check if output meets requirements
parameters:
- name: output
type: string
- name: requirements
type: string
- name: run_tests
description: Execute test cases
parameters:
- name: code
type: string
- name: tests
type: array
- name: diff_check
description: Compare expected vs actual
parameters:
- name: expected
type: string
- name: actual
type: string
- name: syntax_check
description: Validate code/data syntax
parameters:
- name: content
type: string
- name: type
type: string
Run these with Gantz Run to give your reflector verification capabilities.
Patterns
Reflect-then-act
Reflect on the plan before executing.
Plan → Reflect on plan → Execute → Done
Act-then-reflect
Reflect on the result after executing.
Execute → Reflect on result → Fix if needed → Done
Continuous reflection
Reflect at every step.
Act → Reflect → Act → Reflect → Act → Reflect → Done
Ensemble reflection
Multiple reflectors vote.
Output → Reflector A → Good
→ Reflector B → Good
→ Reflector C → Needs fix
Majority: Good (2/3)
Summary
Reflection makes AI check its work.
Without reflection:
Task → Act → Output (might be wrong)
With reflection:
Task → Act → Output → Reflect → Fix → Verified Output
Benefits:
- Catches errors before delivery
- Improves output quality
- Handles edge cases
- Builds trust
Costs:
- More tokens
- More latency
- More complexity
For anything that matters, reflection is worth it.
Using reflection in your agents? What patterns work best for you?
Top comments (0)