DEV Community

Diven Rastdus
Diven Rastdus

Posted on

5 Architecture Patterns for Production AI Agents (That Actually Work)

Most AI agent demos look great in a tweet. Then you deploy them and everything breaks.

I have built six AI agent systems in the last three weeks. Code review agents, research automation, form navigators, interview coaches, and developer briefing tools. Some worked. Some failed badly. Here is what I learned about making agents reliable in production.

1. The Fallback Chain (Never Trust One Model)

Your agent will hit rate limits. The model will go down. Your credits will expire at 2am on a Sunday.

The fix is a fallback chain. Not "retry the same model" but "try a completely different provider."

PROVIDERS = [
    {"name": "primary", "client": anthropic_client, "model": "claude-sonnet-4-6"},
    {"name": "secondary", "client": openai_client, "model": "gpt-4o"},
    {"name": "tertiary", "client": google_client, "model": "gemini-2.5-flash"},
]

async def generate(prompt: str) -> str:
    for provider in PROVIDERS:
        try:
            return await provider["client"].generate(prompt)
        except (RateLimitError, APIError) as e:
            logger.warning(f"{provider['name']} failed: {e}")
            continue
    raise AllProvidersFailedError("Every provider is down")
Enter fullscreen mode Exit fullscreen mode

Every agent I have shipped uses this pattern. The primary model handles 95% of requests. The fallback fires maybe once a week. But that one time is the difference between "it works" and a 3am page.

Key insight: Each provider has different strengths. Claude is better at following complex instructions. GPT-4o is faster for simple tasks. Gemini is cheapest for high volume. Design your fallback order around your specific workload.

2. Tool Call Sandboxing (Agents Will Do Stupid Things)

Give an AI agent access to tools and it will eventually call them wrong. I watched an agent try to delete a production database because it misinterpreted "clean up the test data."

Every tool call needs three things:

class ToolExecutor:
    def execute(self, tool_name: str, args: dict) -> dict:
        # 1. Validate inputs BEFORE execution
        schema = self.tool_schemas[tool_name]
        validate(args, schema)  # JSONSchema validation

        # 2. Check permissions
        if tool_name in DESTRUCTIVE_TOOLS:
            if not self.user_approved(tool_name, args):
                return {"error": "Requires explicit approval"}

        # 3. Timeout and resource limits
        with timeout(seconds=30):
            result = self.tools[tool_name](**args)

        return result
Enter fullscreen mode Exit fullscreen mode

The pattern is: validate, authorize, limit. Never skip any of these.

I built a code review agent that runs shell commands to check test results. Without the timeout, it once sat for 40 minutes running an infinite loop a developer accidentally committed. With a 30 second timeout, it fails fast and reports the issue.

3. Context Window Management (The Silent Killer)

Your agent works perfectly with 3 tools. You add a 4th and everything gets worse. Not because the tool is bad but because you are burning context on tool definitions.

The math is brutal. Each tool definition costs 200-500 tokens. Ten tools is 2,000-5,000 tokens of your context window gone before the user says anything. Add conversation history and you are at 50% capacity before doing real work.

The fix is lazy tool loading:

class AgentToolRegistry:
    def __init__(self):
        self.all_tools = {}       # Full registry
        self.active_tools = {}    # Currently loaded

    def activate_for_task(self, task_type: str):
        """Only load tools relevant to the current task."""
        relevant = TASK_TOOL_MAP.get(task_type, [])
        self.active_tools = {
            name: self.all_tools[name]
            for name in relevant
        }

    def get_definitions(self) -> list:
        """Return only active tool definitions to the model."""
        return [t.schema for t in self.active_tools.values()]
Enter fullscreen mode Exit fullscreen mode

When a user asks about code review, load the git and testing tools. When they ask about deployment, load the infrastructure tools. Never load everything at once.

4. Structured Output with Retry (Parse or Die)

Agents return free text. Your application needs structured data. This gap is where half your bugs come from.

Do not regex your way through agent output. Use structured output with validation and retry:

from pydantic import BaseModel

class CodeReviewResult(BaseModel):
    issues: list[Issue]
    severity: Literal["critical", "major", "minor", "suggestion"]
    approved: bool
    summary: str

async def review_code(diff: str) -> CodeReviewResult:
    for attempt in range(3):
        response = await client.messages.create(
            model="claude-sonnet-4-6",
            messages=[{"role": "user", "content": f"Review this diff:\n{diff}"}],
            # Force JSON output matching our schema
            response_format={"type": "json_object"}
        )
        try:
            return CodeReviewResult.model_validate_json(response.content)
        except ValidationError as e:
            # Feed the error back and retry
            messages.append({
                "role": "user",
                "content": f"Your response failed validation: {e}. Fix and retry."
            })
    raise ParseError("Failed to get valid output after 3 attempts")
Enter fullscreen mode Exit fullscreen mode

Three attempts with error feedback. The model almost always self-corrects on the second try. If it fails three times, the input is probably ambiguous and needs human review anyway.

5. State Persistence (Agents Forget Everything)

Agents have no memory between sessions. Every conversation starts from zero. For a chatbot this is fine. For a production system doing multi-step workflows, it is a disaster.

The solution is explicit state management:

class AgentState:
    def __init__(self, session_id: str):
        self.session_id = session_id
        self.store = RedisStore()

    async def checkpoint(self, step: str, data: dict):
        """Save progress after each step."""
        state = await self.store.get(self.session_id) or {}
        state["completed_steps"] = state.get("completed_steps", [])
        state["completed_steps"].append(step)
        state["last_data"] = data
        state["updated_at"] = datetime.utcnow().isoformat()
        await self.store.set(self.session_id, state)

    async def resume(self) -> dict:
        """Pick up where we left off."""
        state = await self.store.get(self.session_id)
        if not state:
            return {"completed_steps": [], "last_data": {}}
        return state
Enter fullscreen mode Exit fullscreen mode

Every step the agent completes gets checkpointed. If the process crashes (and it will), you restart from the last checkpoint instead of from scratch. This is especially important for agents that make API calls with side effects. You do not want to send the same email twice because your agent lost its place.

What I Got Wrong

Two mistakes I keep making:

Building the agent before finding the problem. My first form navigator was a tech demo. "Look, AI reads screenshots!" Cool. Nobody needed it. The second version was a Chrome extension that overlays guidance on forms you are already filling out. Same AI, completely different product. The first version solved no problem. The second removes real friction.

Overloading the context window. I kept adding tools thinking "more capability is better." It is not. An agent with 5 focused tools outperforms one with 20 tools every time. The model spends less time deciding which tool to use and more time using them well.

The Stack That Works

For anyone building production agents today:

  • Framework: Build your own thin orchestration layer. LangChain adds complexity without enough value.
  • Models: Claude for complex reasoning, GPT-4o for speed, Gemini for cost. Use all three via fallback chains.
  • State: Redis for session state, PostgreSQL for persistent data, S3 for artifacts.
  • Monitoring: Log every tool call, every model response, every retry. You will need it for debugging.
  • Testing: Unit test your tools. Integration test the full agent loop. Never skip either.

The agents that work in production are not the clever ones. They are the boring ones with good error handling, clear boundaries, and explicit state management.



I recently launched Rebill on Product Hunt. It recovers failed SaaS payments for $19/mo instead of $249/mo. If you run a SaaS, check it out.

I build production AI systems and full-stack applications. If you are looking for an AI engineer for a fixed-scope project, reach out at theagentthatcould@gmail.com or book a call at cal.com/diven-rastdus.

Top comments (0)