Imagine you’re building a personal research assistant. Its job is to ingest hundreds of academic PDFs, learn your unique writing style, and eventually draft comprehensive reports for you.
When you first launch it, you connect it to a bleeding-edge cloud model like Claude 3.5 Sonnet or GPT-4o via OpenRouter. It works beautifully. But after a month of heavy use, your API bill arrives—and it looks like a mortgage payment.
You decide to pivot. You want to move the heavy, repetitive daily query load to a local, quantized Llama 3 checkpoint running on a spare GPU in your office. But there is a catch: you don’t want your agent to lose its "soul." You want it to retain its persistent memory—the facts it has painstakingly learned about your project preferences, your past instructions, and your style—across this massive hardware migration. Furthermore, you want it to be smart enough to autonomously route simple tasks to your cheap local model while reserving the expensive cloud model for complex, high-stakes reasoning.
This is the exact point where most naive AI agent implementations break. They fail because they are built as monoliths, tightly coupled to a single LLM provider’s SDK.
To build truly resilient, cost-effective, and autonomous AI systems, we must decouple the agent's cognitive loop from the specific engine providing that cognition. We need to treat the LLM not as the application itself, but as a pluggable utility.
Welcome to the architecture of the Hermes Agent. In this guide, we will explore the design patterns, configuration strategies, and code structures required to build an "Operating System for LLMs"—where the soul persists, the brain executes, and the framework seamlessly connects them.
(The concepts and code demonstrated here are drawn from my ebook Hermes Agent, The Self-Evolving AI Workforce)
The Core Abstraction: The Brain as a Pluggable Service
To build a production-grade agent, we must first undergo a massive paradigm shift. An AI agent is not a single, monolithic intelligence. It is a structured software system where the LLM acts as the central processing unit.
In traditional operating systems, the kernel abstracts the underlying hardware. An application developer doesn’t write code for a specific Western Digital SSD or an Intel network card; they write to a standardized system call API (open(), read(), write()). The device driver acts as the adapter.
In a modern agent architecture, we apply this exact principle:
- The Framework is the kernel.
- The Tools are the filesystem.
- The Memory (volatile state and persistent SQLite/PostgreSQL storage) is the RAM and disk.
- The LLM Provider is the CPU.
How do we abstract the "thinking unit" so that our core execution loop—the executive function of the agent—can run identically whether it is powered by Claude Opus in a supercluster or a quantized Qwen model running on a developer's laptop in a coffee shop?
We solve this using three classic software engineering design patterns:
- The Application Factory Pattern (to build and configure our API clients).
- The Strategy Pattern (to dynamically select our providers).
- The Interface Segregation Principle (to define clean contracts between the agent loop and the model).
Pattern 1: The Application Factory (Forging the Brain)
The Application Factory Pattern is a creational pattern where a dedicated component is responsible for instantiating and configuring a complex object. Instead of hardcoding API clients throughout our codebase, we centralize this creation.
The framework reads a configuration file (config.yaml), determines which provider is requested, and dynamically stamps out the correct API client wrapper.
Let’s look at how this is structured within the core AIAgent class:
# run_agent.py - The Application Factory Pattern in Action
import os
import openai
import anthropic
from typing import Dict, Any
class AIAgent:
def __init__(
self,
base_url: str = None,
api_key: str = None,
provider: str = "auto",
api_mode: str = None,
model: str = "",
timeout: float = 30.0,
**kwargs
):
self.provider = provider
self.model = model
self.timeout = timeout
# The Factory dynamically resolves the client based on configuration
self.client = self._bootstrap_client(base_url, api_key)
def _bootstrap_client(self, base_url: str, api_key: str) -> Any:
"""
The factory method responsible for instantiating the correct
SDK client based on the selected provider strategy.
"""
resolved_provider = self.provider.lower()
if resolved_provider == "anthropic":
# Instantiate direct Anthropic SDK client
return anthropic.Anthropic(
api_key=api_key or os.getenv("ANTHROPIC_API_KEY"),
timeout=self.timeout
)
elif resolved_provider in ["openai", "openrouter", "lmstudio", "ollama"]:
# These providers all support the OpenAI-compatible specification
target_url = base_url or self._resolve_default_url(resolved_provider)
target_key = api_key or self._resolve_default_key(resolved_provider)
return openai.OpenAI(
base_url=target_url,
api_key=target_key,
timeout=self.timeout
)
else:
raise ValueError(f"Unsupported provider: {self.provider}")
def _resolve_default_url(self, provider: str) -> str:
defaults = {
"openai": "https://api.openai.com/v1",
"openrouter": "https://openrouter.ai/api/v1",
"ollama": "http://localhost:11434/v1",
"lmstudio": "http://localhost:1234/v1"
}
return defaults.get(provider, "")
def _resolve_default_key(self, provider: str) -> str:
keys = {
"openai": os.getenv("OPENAI_API_KEY"),
"openrouter": os.getenv("OPENROUTER_API_KEY"),
"ollama": "ollama", # Local endpoints rarely require valid keys
"lmstudio": "lmstudio"
}
return keys.get(provider, "")
Why This Matters
By isolating client creation inside this factory method, the rest of our agent’s execution loop (the run_conversation cycle) remains completely pristine. It doesn't know—and doesn't care—if it is talking to a local server or a cloud cluster. It simply calls a unified interface.
Pattern 2: The Strategy Pattern (Dynamic Provider Selection)
While the Application Factory builds our client, the Strategy Pattern determines which client configuration to build at any given moment.
Instead of forcing the developer to hardcode their environment, we can define a declarative configuration file (config.yaml) that outlines our routing strategies.
# config.yaml - The Neuroanatomy Blueprint
model:
provider: "auto" # The strategy selector
default: "anthropic/claude-3-5-sonnet"
# Fallback chain strategy for high-availability environments
provider_routing:
sort: "throughput"
order: ["anthropic", "openrouter", "ollama"]
ignore: ["bad_provider_endpoint"]
# Specialized auxiliary brains
auxiliary:
vision:
provider: "openai"
model: "gpt-4o-mini"
web_extract:
provider: "ollama"
model: "mistral"
session_compression:
provider: "ollama"
model: "qwen2.5-coder:7b"
Let's look at how the "auto" strategy functions. It acts as a master strategist, scanning the runtime environment to discover available credentials and local services:
# config_resolver.py - Implementing the "Auto" Strategy
import os
import socket
def detect_best_available_strategy() -> Dict[str, str]:
"""
Scans environment variables and local ports to automatically
select the most capable, cost-effective inference strategy.
"""
# 1. Check for premium cloud credentials first
if os.getenv("ANTHROPIC_API_KEY"):
return {"provider": "anthropic", "model": "claude-3-5-sonnet-20241022"}
if os.getenv("OPENAI_API_KEY"):
return {"provider": "openai", "model": "gpt-4o"}
if os.getenv("OPENROUTER_API_KEY"):
return {"provider": "openrouter", "model": "meta-llama/llama-3.1-70b-instruct"}
# 2. Fallback: Check if a local Ollama instance is running
if is_port_open("localhost", 11434):
return {"provider": "ollama", "model": "llama3:latest"}
# 3. Last resort fallback
raise RuntimeError("No viable LLM provider strategy could be auto-detected.")
def is_port_open(host: str, port: int) -> bool:
try:
with socket.create_connection((host, port), timeout=1.0):
return True
except OSError:
return False
This is Resilience Engineering at the architectural level. If you run this code on an expensive cloud GPU instance, it leverages high-performance APIs. If you run it offline on a train with no internet connection, it gracefully degrades to your local Ollama instance without throwing a single runtime exception.
The Signal Chain: A Recording Studio Analogy
To visualize how these patterns interact, let's use an analogy from professional music production.
+-------------------------------------------------------------+
| THE MIXING CONSOLE |
| (The AIAgent Loop) |
+------------------------------+------------------------------+
|
v
+-------------------------------------------------------------+
| THE PATCHBAY |
| (config.yaml) |
+------------------------------+------------------------------+
|
v
+-------------------------------------------------------------+
| THE AUTO-PATCHER |
| (Strategy: "auto") |
+------------------------------+------------------------------+
|
+------------------+------------------+
| (Cloud Active) | (Offline Mode)
v v
+-----------------------+ +-----------------------+
| PREMIUM PRE-AMP | | LOCAL PRE-AMP |
| (Anthropic SDK) | | (Ollama/Llama.cpp) |
+-----------------------+ +-----------------------+
- The Framework (
AIAgent): This is the mixing console. It routes signals, applies effects (tools), and writes the final output to tracks (persistent memory). - The Brain (LLM): This is the vocalist. It provides the raw creative signal.
- The Provider (Anthropic, Ollama, OpenRouter): This is the microphone pre-amplifier. The Anthropic pre-amp colors the sound with high-end clarity (deep reasoning), while the local Ollama pre-amp is rugged, cheap, and runs entirely on local power.
- The Configuration (
config.yaml): This is the patchbay on the studio wall. It determines exactly how signals are routed between components. - The Strategy Pattern (
provider: "auto"): This is the studio's automated routing system. The console scans the patchbay: "I see a high-end microphone plugged into input 1 (Anthropic API Key). I will route all primary vocals there. If that mic fails, I will instantly route the signal to the local dynamic mic on input 2."
Multi-Agent Concurrency and the Subagent Factory
The architecture becomes incredibly powerful—and complex—when we introduce concurrency.
In advanced multi-agent workflows, a parent agent might need to spawn multiple child agents (subagents) to work on tasks in parallel. For example, a lead research agent might spawn three subagents to summarize three different documents simultaneously.
What happens if the parent agent is running on Claude 3.5 Sonnet (expensive, slow, highly analytical) and the subagents are performing simple text-extraction tasks that could easily be handled by a cheap local model?
We must implement Context Propagation and a Subagent Factory.
To prevent asynchronous tasks from stepping on each other's toes, we use thread-local storage or asynchronous context variables (contextvars) to isolate each agent's execution state.
# model_tools.py - Thread-Local Event Loop Isolation
import asyncio
import threading
# Thread-local storage to ensure each worker thread has its own event loop
_worker_thread_local = threading.local()
def get_worker_loop() -> asyncio.AbstractEventLoop:
"""
Retrieves or creates a unique event loop isolated to the current execution thread.
This prevents race conditions when parent and child agents execute concurrently.
"""
loop = getattr(_worker_thread_local, 'loop', None)
if loop is None or loop.is_closed():
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
_worker_thread_local.loop = loop
return loop
When the parent agent calls a delegation tool to spawn a subagent, the framework snapshots the parent's configuration, applies any specific overrides defined in config.yaml for subagents, and passes that new configuration block to a fresh factory invocation:
# delegation_tool.py - Spawning a Subagent with Isolated Brains
def delegate_task(task_description: str, subagent_type: str) -> str:
"""
Spawns a child agent with specialized configuration to execute a subtask.
"""
# 1. Load system-wide configuration
base_config = load_system_config()
# 2. Apply specialized overrides for subagents
subagent_model = base_config.get("delegation", {}).get("model", "ollama/llama3")
subagent_provider = base_config.get("delegation", {}).get("provider", "ollama")
# 3. Use the Subagent Factory to spin up an isolated agent
loop = get_worker_loop()
child_agent = AIAgent(
provider=subagent_provider,
model=subagent_model,
system_instruction=f"You are a specialized subagent executing: {subagent_type}"
)
# 4. Run the child agent's cognitive loop in isolation
result = loop.run_until_complete(child_agent.think(task_description))
return result
This architecture ensures Neural Sovereignty. The parent agent's context window is not cluttered with the raw, low-level data processed by the child. The child's execution latency does not block the parent's main thread, and API costs are kept perfectly optimized.
Heterogeneous Intelligence Optimization: The Closed Learning Loop
A truly autonomous agent does not just execute; it learns from its environment over time. This is where we close the loop.
However, high-quality learning requires different cognitive profiles for different tasks. If you use a massive, high-latency model like Claude Opus for simple tasks like compressing chat history or parsing web HTML, you are burning money. Conversely, if you use a lightweight local model to generate your core long-term memory entries, your memory will quickly become corrupted with hallucinations and logical errors.
We solve this with Heterogeneous Intelligence Optimization—mapping specific tasks to the exact tier of intelligence they require:
+-----------------------------------+
| COGNITIVE ROUTER |
| (Inversion of Control) |
+-----------------+-----------------+
|
+--------------------------+--------------------------+
| | |
v v v
+------------------+ +------------------+ +------------------+
| CORTEX BRAIN | | CEREBELLUM BRAIN | | HIPPOCAMPUS |
| (Claude Sonnet) | | (Gemini Flash) | | (Local Qwen) |
| | | | | |
| - Executive Fn | | - Vision Tasks | | - Memory Comp. |
| - Code Writing | | - Web Scraping | | - Summarization |
+------------------+ +------------------+ +------------------+
- The Cortex (Executive Function): Powered by a frontier model (e.g., Claude 3.5 Sonnet). It handles high-level reasoning, orchestrates tool calls, and decides what actions to take.
- The Cerebellum (Fine Motor Control & Vision): Powered by a fast, multimodal model (e.g., Gemini 1.5 Flash). It handles web extraction, processes images/screenshots, and executes high-speed parsing.
- The Hippocampus (Memory Consolidation): Powered by a local, highly-efficient model (e.g., Qwen 2.5 Coder 7B running locally). It runs background cron jobs to compress old conversation logs, summarize daily context, and write clean markdown entries to the agent's persistent memory store (
MEMORY.md).
Here is how we represent this complete, production-ready neuroanatomy in our configuration layer, complete with strict Performance Budgets:
# production-config.yaml - Production Neuroanatomy & Performance Budgets
agent:
name: "Hermes-Arch"
version: "2.4.0"
# Primary Cognitive Engine (The Cortex)
cortex:
provider: "anthropic"
model: "claude-3-5-sonnet-20241022"
temperature: 0.2
max_tokens: 4096
timeout_policies:
request_timeout_seconds: 30
retry_attempts: 3
backoff_factor: 2.0
# Secondary Specialized Engines (The Cerebellum)
cerebellum:
vision:
provider: "openai"
model: "gpt-4o-mini"
request_timeout_seconds: 15
web_extraction:
provider: "openrouter"
model: "google/gemini-flash-1.5"
request_timeout_seconds: 10
# Background Memory Engines (The Hippocampus)
hippocampus:
context_compression:
provider: "ollama"
model: "qwen2.5-coder:7b"
request_timeout_seconds: 120 # Local models need more breathing room
max_tokens: 2048
embedding_generation:
provider: "ollama"
model: "nomic-embed-text"
request_timeout_seconds: 5
Configuration as a First-Class Citizen
In this architecture, configuration is not an afterthought or a collection of magic strings. It is a typed, hierarchical structure that represents the biological blueprint of your agent.
By defining strict performance budgets (like request_timeout_seconds: 120 for local cold-starts versus request_timeout_seconds: 15 for rapid cloud APIs), the agent's scheduler knows exactly when to assume a provider is dead, trigger a fallback strategy, save the current context safely to the database, and inform the user. It never crashes blindly.
Conclusion: The Engineering of Emergence
As technical builders, our role is shifting. We are no longer just programmers writing procedural, line-by-line instructions. When we build autonomous systems, we are System Architects of Cognitive Networks.
By decoupling the agent's soul (its persistent memory) from its brain (the LLM provider), we build systems that are:
- Cost-Resilient: Automatically routing low-level tasks to free local silicon.
- Infrastructure-Agile: Swapping underlying models instantly with a two-line config change.
- Highly Available: Automatically falling back to alternative cloud providers or local setups if an API goes down.
Treat your configuration with the respect an architect gives to the foundation of a cathedral. When you design your provider layer with clean abstractions, you aren't just writing code—you are architecting a mind.
Let's Discuss
- The Cost vs. Latency Tradeoff: Have you experimented with offloading background tasks (like summarization or embedding generation) to local models like Llama 3 or Qwen? What was the impact on your agent's overall latency and execution cost?
- Handling Provider Failures: In a production multi-agent system, how do you handle state recovery when a primary provider (e.g., Anthropic) times out halfway through a complex, multi-step agent trajectory? Do you prefer state-rollback or dynamic mid-flight routing?
The concepts and code demonstrated here are drawn directly from the comprehensive roadmap laid out in the ebook Hermes Agent, The Self-Evolving AI Workforce: details link, you can find also my programming ebooks with AI here: Programming & AI eBooks.
Top comments (0)