For a more in-depth guide
Medium: Link
The infrastructure layer your agents can't live without
Your agent just billed a user $38 on a single query.
Not because it did something complex.
Because it summarized the same document 47 times in a loop.
No crash. No alert. Just a growing invoice.
You check logs. The model worked exactly as expected.
The failure was everything around the model.
No memory. No state. No stop condition.
The system had no way to say:
"We’ve already done this."
That is the difference between a demo and production.
The gap nobody warns you about
Building a working agent is easy.
- Call an LLM
- Give it tools
- Add a loop
20 lines of Python. Done.
Then you ship it.
Reality hits:
- Tools return empty
- Context overflows
- Agents contradict themselves
- Infinite retries start
Nothing breaks in demos.
Everything breaks in production.
The problem is not the model.
It is the harness.
The real definition
Agent = Model + Harness
- Model → reasoning, decisions
- Harness → execution, control, safety, memory
If you're not building the model, you're building the harness.
And that’s where most failures live.
The 7 components that actually matter
1. Control Loop
This is the heartbeat.
Without it → chatbot
With it → agent
while agent_is_running:
response = call_model(context)
if response.has_tool_calls:
results = execute_tools(response.tool_calls)
append_to_context(results)
continue
if response.is_final_answer:
return response.content
if step_count > MAX_STEPS:
return "Task incomplete. Max steps reached."
Critical rule:
MAX_STEPS is non-negotiable.
No step limit = infinite billing.
2. State Management
Models are stateless.
You must track:
Session state
- conversation history
- tool outputs
- current step
Persistent state
- completed tasks
- progress
- processed files
Example:
{
"task_id": "refactor-auth-module",
"completed_files": ["auth.py"],
"pending_files": ["routes.py"],
"current_step": 3
}
Without this → agents repeat work endlessly.
3. Memory
State = what happened now
Memory = what survives later
Short-term
- conversation history
Long-term
- user preferences
- past outcomes
- domain knowledge
Typical flow:
Start:
Load memory → inject into prompt
During:
Maintain history
End:
Summarize → store
Without memory → users feel your agent is dumb.
4. Tools (and the Bash Escape Hatch)
Tools convert language into action.
Bad tools:
- vague descriptions
- unclear usage
Good tools:
- clear purpose
- defined inputs/outputs
- explicit usage rules
The real unlock: Bash
Instead of fixed tools → let agent create tools dynamically.
This is powerful.
Also dangerous.
So you need:
Sandboxing
- isolated execution
- no host access
- safe parallel runs
Without sandbox → you are gambling.
5. Context Management
Silent killer.
Everything works... until it doesn’t.
Why?
Context fills up → important instructions get buried.
Solutions
1. Compaction
- summarize old messages
- keep system prompt intact
2. Truncation
- limit tool outputs
- store full data externally
3. Progressive loading
- load tools only when needed
Rule:
Never lose the task definition.
6. Planning
Without planning → chaos.
Agent takes steps, but not the right ones.
Plan file pattern
task: Migrate database
steps:
- Backup DB [ ]
- Run migration [x]
- Verify data [ ]
current_step: 2
Each loop:
- inject plan
- update progress
- verify step
Key concept: Self-verification
After each step:
- run tests
- validate output
Agents that verify → reliable
Agents that assume → break
Ralph Loop (important)
When context ends:
- reload goal
- continue from state
This enables long-running agents.
7. Error Handling
Reality is messy.
Things will fail.
You must define behavior for each failure.
Example logic
Tool fails:
Retry (if temporary)
Switch approach (if data issue)
Escalate (if blocked)
Malformed output:
Retry with constraints
Fallback after 3 attempts
Loop detected:
Interrupt execution
Low confidence:
Send to human review
No error handling = hallucination or silent failure.
What actually happens in production
Example task:
"Summarize EU AI regulation news"
Flow:
- Plan created
- State initialized
- Search executed
- Articles fetched
- Context managed
- Summary generated
- Verification step
- Final output
- Memory updated
The model writes.
The harness makes it reliable.
Common failure cases
Infinite loops
Fix → step limits + repetition detection
Tool misuse
Fix → better tool descriptions
Context overflow
Fix → compaction strategy
Hallucination with tools
Fix → enforce tool usage rules
Latency explosion
Fix → parallel tool execution
Hidden truth: Model ≠ independent
Modern agents are trained with harnesses.
Meaning:
- Models adapt to specific tool patterns
- Changing harness can reduce performance
Same model. Different harness. Different results.
When NOT to use agents
Be honest here.
Use deterministic pipelines when:
- steps are fixed
- outputs are predictable
Use humans when:
- mistakes are costly
Avoid agents when:
- task is structured and rule-based
If every step is predefined, an agent is overkill.
Where to start (practical order)
- Control loop + step limit
- State tracking
- Small toolset
- Error handling
- Context management
- Memory
- Planning
Skip order → debug hell.
Final thought
The real test is not:
What happens when everything works?
It is:
What happens when things break?
Models will improve.
Harness design is what makes them usable.
The model is not your agent.
The harness is.

Top comments (0)