bredmond1019

Posted on Jun 23 • Originally published at learn-agentic-ai.com

The 7 Building Blocks of Reliable AI Agents (Skip the Frameworks)

#ai #python #architecture #programming

When I started building multi-agent systems, I made the mistake most engineers make: I went looking for the right framework. I evaluated LangChain, AutoGPT, and a dozen other options. I built prototypes in each. I spent weeks on tooling instead of solving the actual problem.

The moment things clicked was when I stripped everything back to bare API calls and asked: what does every agent I've shipped actually need? The answer wasn't a framework. It was seven building blocks — patterns that show up in every reliable agentic system I've built, from the Python orchestration framework that drives parallel task execution to the SDLC agentic harness that ships production code.

Here's what I've learned from building and operating these systems: the best AI agents aren't "agentic" in the way the demos make them look. They're mostly deterministic code with surgical, well-scoped calls to language models.

The Two Types of AI Systems (And Why It Matters)

Before the building blocks, this distinction matters enormously.

AI Assistants (Human-in-the-Loop): Users interact directly, can correct mistakes in real time, and expect exploration. Multiple LLM calls make sense. Think pair programming with Cursor, or a chat interface.

Background Automation (No Human Oversight): Errors compound silently. Latency and cost are magnified. Every LLM call needs justification. This is the domain of production agents.

My Python orchestration framework and the support-automation system I built both fall into the second category. They run without a human watching every step. That constraint shaped every architectural decision — and it's why these seven building blocks matter.

# Background automation: minimize and isolate LLM calls
def process_support_ticket(ticket_data: dict) -> dict:
    # Deterministic routing first — no LLM needed here
    if ticket_data["type"] == "refund" and ticket_data["amount"] < 50:
        return handle_refund_automatically(ticket_data)

    # Use LLM only where reasoning is genuinely required
    if requires_classification(ticket_data):
        context = build_classification_context(ticket_data)
        classification = classify_with_llm(context)
        return route_by_classification(classification, ticket_data)

    return standard_response_template(ticket_data)

Most of what I ship is the outer shell — deterministic orchestration code. The LLM handles the reasoning that can't be hard-coded.

Building Block 1: The Intelligence Layer

This is the only actual AI component. Everything else is software engineering.

I use the Anthropic SDK directly — no intermediate abstractions. The API surface is small and stable. Every layer on top is a maintenance liability.

import anthropic

client = anthropic.Anthropic()

def get_llm_response(prompt: str) -> str:
    """The atomic unit of intelligence: a focused, scoped LLM call."""
    message = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    return message.content[0].text

The hard part isn't the API call. It's everything around it: what you put in the prompt, how you validate what comes back, and how you handle failure. The remaining six building blocks cover exactly that.

Building Block 2: Memory (State Management)

Language models are stateless. They know nothing about previous calls unless you tell them. Memory management is your responsibility entirely.

In the Python orchestration framework, I maintain context across multi-step pipelines explicitly:

def chat_with_memory(conversation_history: list, new_message: str) -> tuple[str, list]:
    """Maintain conversation state across orchestration steps."""
    conversation_history.append({"role": "user", "content": new_message})

    message = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=1024,
        messages=conversation_history
    )

    assistant_message = message.content[0].text
    conversation_history.append({
        "role": "assistant",
        "content": assistant_message
    })

    # Keep context window bounded — trim oldest turns when needed
    if len(conversation_history) > 20:
        conversation_history = conversation_history[-20:]

    return assistant_message, conversation_history

In production, conversation history lives in the database, not in RAM. Each message carries a session ID and timestamp. Token counts are tracked explicitly so you can manage context windows before hitting limits — not after.

Note:
In the SDLC agentic harness, each task worktree maintains its own context: the spec file, a running implementation report, and git history. The "memory" is the files on disk, not an in-memory list. This makes every step auditable and resumable.

Building Block 3: Tools (External Integration)

Tools let an agent take action in the world: read a file, call an API, query a database. They're powerful — and expensive. Every tool call is at least two LLM round trips.

The support-automation system I built uses tools conservatively. The vast majority of tickets route deterministically. Tools activate only when the agent genuinely can't proceed without external information.

tools = [
    {
        "name": "lookup_account_status",
        "description": "Look up a customer account's current status and plan details.",
        "input_schema": {
            "type": "object",
            "properties": {
                "account_id": {
                    "type": "string",
                    "description": "The customer account identifier"
                }
            },
            "required": ["account_id"]
        }
    }
]

def agent_with_tools(user_message: str) -> str:
    messages = [{"role": "user", "content": user_message}]

    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        tools=tools,
        messages=messages
    )

    # Process tool use if the model requested it
    if response.stop_reason == "tool_use":
        tool_use = next(b for b in response.content if b.type == "tool_use")
        tool_result = execute_tool(tool_use.name, tool_use.input)

        messages.extend([
            {"role": "assistant", "content": response.content},
            {
                "role": "user",
                "content": [{"type": "tool_result",
                             "tool_use_id": tool_use.id,
                             "content": tool_result}]
            }
        ])

        final = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            tools=tools,
            messages=messages
        )
        return final.content[0].text

    return response.content[0].text

The rule I follow: if a tool call can be replaced with a pre-query before the LLM sees the prompt, do that instead. Cheaper, faster, and deterministic.

Building Block 4: Validation (Structured Output)

Unstructured LLM output is a liability in production systems. The fix is forcing structured responses and validating them before downstream code touches them.

Anthropic's API supports structured output natively via tool use patterns. I define schemas with Pydantic, which gives me type safety and validation logic in one place:

from pydantic import BaseModel
from typing import Literal
import json

class TicketClassification(BaseModel):
    category: Literal["billing", "technical", "account", "general"]
    urgency: Literal["high", "medium", "low"]
    requires_human: bool
    summary: str

def classify_ticket(ticket_text: str) -> TicketClassification:
    """Structured classification — always returns a validated object."""
    response = client.messages.create(
        model="claude-haiku-4-5-20251001",
        max_tokens=512,
        system=(
            "You are a support ticket classifier. "
            "Return a JSON object with these exact fields: "
            "category (billing/technical/account/general), "
            "urgency (high/medium/low), "
            "requires_human (boolean), "
            "summary (one sentence). Return only JSON."
        ),
        messages=[{"role": "user", "content": ticket_text}]
    )

    raw = response.content[0].text
    data = json.loads(raw)
    return TicketClassification(**data)

In the SDLC harness, every implement step returns a structured report. The orchestrator reads that report as data, not as prose. When an agent produces output that doesn't match the expected shape, it fails fast and clearly — not silently downstream.

→
Structured output is the single highest-leverage reliability improvement I've made to agentic systems. It converts probabilistic text generation into deterministic data extraction that downstream code can depend on.

Building Block 5: Control Flow (Deterministic Routing)

This is the pattern that made the biggest difference when I productionized the support-automation system. Instead of handing control to the LLM, I use the LLM for classification and then let deterministic code do the routing.

def handle_support_request(ticket: dict) -> str:
    """One LLM call for classification; deterministic routing after."""
    classification = classify_ticket(ticket["body"])

    # Log the classification — this is invaluable for debugging
    print(f"Category: {classification.category}")
    print(f"Urgency: {classification.urgency}")
    print(f"Requires human: {classification.requires_human}")

    # Deterministic routing — no more LLM calls for most paths
    if classification.requires_human or classification.urgency == "high":
        return escalate_to_human_queue(ticket, classification)

    if classification.category == "billing":
        return handle_billing_ticket(ticket, classification)

    if classification.category == "technical":
        return handle_technical_ticket(ticket, classification)

    return generate_general_response(ticket, classification)

The key insight: the LLM's job is to understand the input, not to drive the execution. Once the intent is classified and the confidence is above threshold, the rest is a state machine.

In the SDLC harness, this pattern runs end-to-end: each pipeline stage classifies what the previous agent produced (PASS / FAIL / PARTIAL), then routes deterministically — implement, test, review, document, or retry. No agent decides its own next step.

Building Block 6: Recovery (Error Handling)

Production agents fail. APIs time out, rate limits hit, models return malformed output. The question isn't whether failures happen — it's whether your system degrades gracefully.

The Python orchestration framework uses a circuit-breaker pattern combined with exponential backoff. Individual task failures don't cascade:

import time
import logging
from typing import Optional

def resilient_llm_call(
    prompt: str,
    max_retries: int = 3,
    backoff_base: float = 2.0
) -> Optional[str]:
    """Production-ready call with retry and graceful fallback."""
    for attempt in range(max_retries):
        try:
            message = client.messages.create(
                model="claude-haiku-4-5-20251001",
                max_tokens=1024,
                messages=[{"role": "user", "content": prompt}]
            )
            result = message.content[0].text
            if not result or len(result.strip()) < 10:
                raise ValueError("Response too short — likely an error state")
            return result

        except Exception as e:
            logging.warning(
                f"LLM call failed (attempt {attempt + 1}/{max_retries}): {e}"
            )
            if attempt < max_retries - 1:
                time.sleep(backoff_base ** attempt)
            else:
                logging.error(f"All {max_retries} attempts exhausted")
                return None

    return None

def process_with_fallback(input_data: dict) -> str:
    """Graceful degradation when the LLM path fails."""
    ai_result = resilient_llm_call(build_prompt(input_data))

    if ai_result:
        return ai_result

    # Rule-based fallback — always have one before you ship
    logging.info("Falling back to rule-based processing")
    return rule_based_response(input_data)

Warning:
Build the fallback path before you build the AI path. If you can't describe what your system does when the LLM fails, you're not ready to ship it.

Building Block 7: Human Oversight

Not every decision should be fully automated. For operations with significant consequences — sending a message to a customer, modifying critical data, triggering an irreversible action — a human checkpoint is worth the latency.

The SDLC harness implements this at the review stage: after an agent implements and tests a task, a review step evaluates the result against acceptance criteria. Clean passes continue automatically; ambiguous or failed results either retry (automated) or escalate to a human. Automation handles scale; humans handle accountability.

from enum import Enum
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional

class ApprovalStatus(Enum):
    PENDING = "pending"
    APPROVED = "approved"
    REJECTED = "rejected"

@dataclass
class ApprovalRequest:
    id: str
    content: str
    context: dict
    status: ApprovalStatus = ApprovalStatus.PENDING
    created_at: datetime = field(default_factory=datetime.now)

def generate_with_approval(
    prompt: str,
    require_approval: bool = True
) -> Optional[str]:
    """Generate content with an optional human approval gate."""
    draft = resilient_llm_call(prompt)

    if not draft:
        return None

    if not require_approval:
        return draft

    # In production: store the request, notify via webhook/Slack, await response
    request = ApprovalRequest(
        id=generate_id(),
        content=draft,
        context={"prompt": prompt}
    )
    store_approval_request(request)
    notify_approver(request)

    return poll_for_approval(request.id, timeout_seconds=300)

In the support-automation system, high-urgency escalations feed into a human review queue. The agent drafts a response; a team member approves or edits before it sends. The automation provides velocity; the human provides accountability where it matters.

Putting It All Together

These building blocks combine into a complete production loop:

class ReliableAgent:
    def __init__(self):
        self.conversation_memory: dict[str, list] = {}
        self.error_counts: dict[str, int] = {}

    def process(self, session_id: str, input_data: dict) -> str:
        try:
            # Block 2: Load conversation context
            history = self.conversation_memory.get(session_id, [])

            # Block 4 + 5: Classify with structured output, route deterministically
            classification = classify_ticket(input_data["body"])

            if classification.requires_human:
                # Block 7: Human oversight for critical paths
                return self.escalate_with_draft(input_data, classification)

            if classification.urgency == "high":
                # Block 3: Tool use only when needed
                account_info = fetch_account_details(input_data["account_id"])
                response = self.generate_with_context(
                    input_data, classification, account_info, history
                )
            else:
                # Block 1: Direct LLM call for routine responses
                response = resilient_llm_call(
                    build_prompt(input_data, classification, history)
                )

            # Block 2: Update memory
            history.append({"role": "user", "content": input_data["body"]})
            history.append({"role": "assistant", "content": response})
            self.conversation_memory[session_id] = history[-10:]

            self.error_counts[session_id] = 0
            return response

        except Exception as e:
            # Block 6: Recovery
            count = self.error_counts.get(session_id, 0) + 1
            self.error_counts[session_id] = count

            if count > 5:
                return "System temporarily unavailable. A team member will follow up."

            return rule_based_response(input_data)

What This Architecture Makes Possible

The support-automation system I built processes a large volume of daily requests with this pattern. What I can tell you from operating it: the deterministic routing layer handles the majority of requests without any LLM call at all. Only the ones requiring genuine language understanding — nuanced complaints, ambiguous multi-part questions, edge cases the rules don't cover — involve a model call.

That constraint — use LLM calls only where they earn their cost — forced better system design across the board. The classification layer got sharper because I had to define what "requires reasoning" actually means. The fallback paths got more complete because I had to think about every failure mode. The observability layer became essential because I needed to see which paths were being taken and why.

The SDLC agentic harness runs the same way. Each pipeline stage is mostly deterministic orchestration code. The agents are scoped to specific reasoning tasks: write implementation, evaluate test results, review against criteria. Everything else is file I/O, git operations, and control flow.

The Seven Building Blocks

To build a reliable production agent, you need all seven:

Intelligence: Direct API calls to the language model — scoped, purposeful, minimal
Memory: Explicit state management — conversation history, session context, persistent storage
Tools: External integrations — used sparingly, only where deterministic pre-fetching won't do
Validation: Structured output with schema enforcement — convert text generation into data extraction
Control Flow: LLM for classification, deterministic code for routing — never let the model drive execution
Recovery: Retry logic, exponential backoff, graceful degradation — always have a fallback
Human Oversight: Approval checkpoints for high-consequence operations — automation handles scale, humans handle accountability

You don't need a framework. You need these patterns, implemented cleanly, with the Anthropic SDK directly.

Start with one workflow. Map the classification logic. Build the deterministic routing. Add an LLM call only where you genuinely can't route without it. The framework will emerge from the patterns — and it will be one you understand completely.

→
If you're evaluating frameworks right now, pause. Pick one real workflow in your system and write it using just these seven building blocks with direct SDK calls. The clarity you get from that exercise will shape every architectural decision after it.

If this was useful, I write about building production AI and agentic systems at learn-agentic-ai.com — including hands-on learning paths available in both English and Brazilian Portuguese. Come build something real.

DEV Community