DEV Community: Michal Szalinski

Why Your AI Agent Works in Dev and Breaks in Prod

Michal Szalinski — Mon, 08 Jun 2026 00:02:43 +0000

Your agent nailed every test case. You shipped it. Within 48 hours, users report hallucinated outputs, silently dropped tool calls, and responses that bear zero resemblance to what worked on your machine. You reload the same prompt locally. It works perfectly. Welcome to the most predictable failure mode in AI engineering: the dev-to-prod gap.

This is Crucible C01. We dissect the five failure modes that kill agents in production and give you the tools to catch them before your users do.

The Idea (60 Seconds)

Developers test agents in idealized conditions: deterministic inputs, warm context windows, generous API latency budgets, and sequential tool calls. Production exposes the opposite environment: cold starts strip context, rate limits compress timing, and parallel calls introduce race conditions. The agent that performed flawlessly at temperature 0 on a 2k-token context window collapses at temperature 0.7 on an 8k-token window.

The five failure modes are temperature drift and context window overflow first; silent API errors and prompt drift follow; race conditions complete the set. Each one has a detection pattern and a fix, and this article delivers both plus the CLI tool to automate the detection.

Why This Matters

AI agent failures differ from traditional software failures in one critical way: they are stochastic. A web API either returns 200 or 500. An AI agent returns something that looks plausible 90% of the time and is catastrophically wrong 10% of the time. That 10% is invisible in manual testing and devastating in production.

The economics compound fast because every failed agent interaction wastes tokens, and wasted tokens cost money. At scale, a subtly broken agent burns budget faster than a working one because it retries, loops, and rephrases instead of succeeding. A single temperature drift bug can double your API spend.

Reliability is the differentiator. The market is flooding with AI wrappers. The ones that survive will be the ones that work consistently, under load, with real user inputs. Crucible exists to make your agent one of the survivors.

Walkthrough

Mode 1: Temperature Drift

Detection pattern: Run the same prompt at your dev temperature and your prod temperature. Hash the outputs. Hash divergence signals drift.

The fix: Pin temperature to 0 in both environments. If you need sampling variance for creativity, isolate it to a single generation step and wrap the rest of the pipeline in deterministic calls. Document the temperature in your agent config file. Treat it like a database connection string: an infrastructure parameter, always explicit, zero room for runtime guesses.

Mode 2: Context Window Overflow

Detection pattern: Instrument your agent to log cumulative token count per conversation. When it crosses 75% of your model’s context limit, flag the conversation. Watch for truncated outputs, repeated phrases, or instructions that the model appears to have forgotten.

The fix: Implement a context compaction strategy. Summarize older turns and replace them with a compressed summary token block. Set hard token budgets per turn and per conversation. When the budget is exhausted, either summarize or start a fresh context window with a recovery prompt that preserves the task state.

Mode 3: Silent API Errors

Detection pattern: Log every API call’s HTTP status code and response body. Count calls that return non-200 statuses. If your agent has retry logic, log whether the retry succeeded. A pattern of failed retries with continued execution signals swallowed errors.

The fix: Treat API errors as hard failures by default. Wrap every API call in a circuit breaker that halts the agent on persistent errors. Log the error, notify the orchestrator, and return a structured failure to the caller. Silent continuation on error state is the single most dangerous production behavior in any agent system.

Mode 4: Prompt Drift

Detection pattern: Version your system prompts. On every agent run, hash the active prompt and compare it to the canonical hash. When outputs diverge between runs, diff the prompt versions first.

The fix: Lock system prompts in version control. Deploy prompt changes through the same review pipeline as code changes. Run regression tests: execute a benchmark suite against the old prompt, then the new prompt, and diff the results. Any change that shifts more than 10% of benchmark outputs requires manual review.

Mode 5: Race Conditions in Parallel Tool Calls

Detection pattern: Enable request-order logging. When your agent dispatches parallel calls, log the dispatch order and the completion order. Any inversion signals a potential race condition.

The fix: Avoid parallel tool calls unless you can guarantee idempotent, order-independent results. When parallelism is necessary, implement a reconciliation step that sorts responses by a sequence token before the agent processes them. Better yet, use a deterministic execution model: serialize all tool calls, accept the latency cost, and gain correctness.

The Prompt Toolkit

1. Agent Failure Analyst Prompt

<role>
You are an Agent Failure Analyst for the Crucible diagnostic framework.
</role>

<input>
  <agent_architecture>
    {{AGENT_ARCHITECTURE_DESCRIPTION}}
  </agent_architecture>
  <failure_scenario>
    {{FAILURE_SCENARIO_DESCRIPTION}}
  </failure_scenario>
</input>

<task>
Analyze the agent architecture against the failure scenario. Identify which of the six failure modes are present or likely:

1. temperature_drift , Dev and prod temperature settings diverge.
2. context_overflow , Token count exceeds model context limit.
3. token_limit , Response truncation due to max_tokens ceiling.
4. prompt_drift , System prompt edits propagate uncontrolled cascading changes.
5. api_latency , Timeouts or rate limits cause silent failures.
6. race_condition , Parallel tool calls return out of order.

For each identified failure mode, provide:
- evidence: Specific architectural features or scenario details that indicate this mode.
- severity: critical, high, medium, low.
- reproduction_steps: Exact sequence to trigger the failure.
- fix_strategy: Concrete architectural change to eliminate the failure mode.
</task>

<output_format>
<analysis>
  <failure_mode name="..." present="true|false">
    <evidence>...</evidence>
    <severity>...</severity>
    <reproduction_steps>
      <step order="1">...</step>
      <step order="2">...</step>
    </reproduction_steps>
    <fix_strategy>...</fix_strategy>
  </failure_mode>
  <summary>...</summary>
  <priority_fixes>
    <fix order="1">...</fix>
    <fix order="2">...</fix>
  </priority_fixes>
</analysis>
</output_format>

2. agentprobe CLI

The agentprobe command-line tool scans your agent configuration for common failure modes, traces live runs with full instrumentation, diffs two runs to locate divergence points, and replays failed traces to test determinism.

Install and run:

cp agentprobe.py /usr/local/bin/agentprobe
chmod +x /usr/local/bin/agentprobe
agentprobe scan --config agent_config.json
agentprobe trace --config agent_config.json --prompt "Analyze the Q3 report"
agentprobe diff --run-a trace_001.json --run-b trace_002.json
agentprobe replay --trace trace_001.json

Download: agentprobe.py

Caveats

The five failure modes cover the most common production breakdowns, yet they remain an incomplete set. Model-specific quirks, provider-specific rate limit architectures, and custom orchestration logic introduce failure modes unique to your stack. Treat these five as your baseline scan, then extend the detection patterns to match your architecture.

The agentprobe tool instruments API calls and logs token counts, yet it relies on the provider reporting accurate token usage. Some providers approximate. Cross-check token counts against your own tokenizer when precision matters.

Determinism is a spectrum, all-or-zero. Temperature 0 reduces variance dramatically, yet even at temperature 0, some models exhibit minor non-determinism due to floating-point accumulation differences across hardware. Replay results that match 99% of tokens are as good as deterministic for practical purposes.

Philosophy

The Crucible stance: test in conditions that match production, or accept production failures as inevitable. Every shortcut in your testing pipeline compounds into a failure in your production pipeline. Agents are stochastic systems. Stochastic systems demand systematic testing, systematic observation, and systematic repair.

The dev-to-prod gap is avoidable. It requires treating your agent’s non-determinism as a first-class engineering concern, designing for the worst case from day one, and instrumenting everything. The tools in this article automate the detection. The fixes are architectural. The discipline is yours.

Crucible C01 is the first article in the Crucible Series by ArchonHQ. Each article dissects a specific AI agent failure mode and delivers the prompts and tools to eliminate it. Subscribe for full access to the series.

Subscribe now

This article was originally published on ArchonHQ — practical AI that wins every day. Subscribe free to get new articles in your inbox.

Build Your Own MCP Server from Scratch

Michal Szalinski — Fri, 05 Jun 2026 00:04:01 +0000

Every AI agent ships with the same bottleneck: it can only reason over what it can reach. MCP servers dissolve that boundary. They expose tools, resources, and prompts to any compliant client over a JSON-RPC wire format so lean you can implement it in an afternoon. Yet most developers grab a framework, copy a template, and ship something they can barely debug. Forge starts differently. You will build an MCP server from the bare protocol up, understand every byte on the wire, and gain the mental model that makes every future server trivial.

The Idea (60 Seconds)

MCP is a JSON-RPC 2.0 protocol. A client sends a request. Your server returns a response. Three request types power the core loop:

initialize, handshake. Client and server exchange capabilities.
tools/list, discovery. Server returns every tool it offers, each with a JSON Schema describing its inputs.
tools/call, execution. Client names a tool and passes arguments. Server runs the handler and returns structured content.

Transport is either stdio (JSON-RPC over stdin/stdout) or HTTP (Streamable HTTP). Stdio is the simplest place to start: read a line from stdin, parse it, dispatch, write a line to stdout.

That is the entire architecture. Everything else is error handling, schema validation, and ergonomics.

Why This Matters

MCP servers are the new APIs. Where REST gave machines endpoints, MCP gives agents tools with typed inputs and structured outputs. Every integration layer from IDE assistants to autonomous workflows converges on this protocol. The standard is young. The primitives are stable. The surface area is small enough to hold in your head all at once.

Knowing the wire format gives you three advantages frameworks obscure:

Debugging , when a tool call fails, you can read the raw JSON-RPC message and pinpoint the fault in seconds.
Portability , any language, any runtime, any transport. Write a server in Bash if you want. The protocol is the contract.
Evolution, MCP will add capabilities. Understanding the base protocol lets you adopt new features by extension, always, sidestepping full rewrites.

Forge articles build on this foundation. If you understand the three core requests and the JSON-RPC envelope, every subsequent pattern is just a new handler.

ArchonHQ is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Walkthrough

The JSON-RPC Envelope

Every message shares the same shape:

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "tools/call",
  "params": {
    "name": "get_weather",
    "arguments": { "city": "Portland" }
  }
}

The response mirrors it:

{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "content": [
      { "type": "text", "text": "72°F, clear skies" }
    ]
  }
}

Errors swap result for error:

{
  "jsonrpc": "2.0",
  "id": 1,
  "error": {
    "code": -32602,
    "message": "Invalid params: missing 'city'"
  }
}

Three fields matter: jsonrpc (always "2.0"), id (correlates request to response), and method (the dispatch key).

Tool Schema Design

Each tool advertises itself through a JSON Schema object. A well-designed schema is the difference between a tool agents use and one they fumble.

{
  "name": "get_weather",
  "description": "Retrieve current weather for a given city",
  "inputSchema": {
    "type": "object",
    "properties": {
      "city": {
        "type": "string",
        "description": "City name, e.g. Portland"
      }
    },
    "required": ["city"]
  }
}

Rules for effective schemas:

Mark every required parameter in required. Agents rely on this to construct valid calls.
Add description to each property. The agent reads descriptions to decide which tool to invoke and what values to pass.
Use enum for constrained values. This prevents hallucinated inputs.
Keep schemas flat. Nested objects are valid but harder for agents to populate correctly.

The Request Lifecycle

Your server runs a loop:

Step Action 1 Read a JSON-RPC line from stdin 2 Parse the method 3 Dispatch to the matching handler 4 Handler returns a result or raises an error 5 Serialize the response to JSON 6 Write it to stdout

In Python with asyncio:

async def handle_message(message):
    method = message.get("method")
    if method == "initialize":
        return {"capabilities": {"tools": {}}}
    elif method == "tools/list":
        return {"tools": list_tools()}
    elif method == "tools/call":
        return await call_tool(message["params"])
    else:
        return {"error": {"code": -32601, "message": f"Method {method} unseen"}}

Building the Server

Start with the mcpbuild CLI. Run mcpbuild init my-server and you get a project scaffold:

my-server/
  pyproject.toml
  server.py
  tools/
    __init__.py

server.py contains the JSON-RPC read loop and dispatch table. Each tool is a function registered by name. The add-tool command generates a stub handler and appends the tool schema to the registry. The run command boots the server on stdio (default) or HTTP transport.

The full CLI ships with this article. Download it, make it executable, and build your first MCP server in minutes.

The Prompt Toolkit

MCP Server Architect Prompt

Feed this prompt a server concept. Get back a complete specification ready for implementation.

<prompt>
<role>You are an MCP Server Architect. You produce complete MCP server specifications from a concept description.</role>
<input>
{{SERVER_CONCEPT}}
</input>
<output_format>
Return a specification with these sections:

1. **Server Identity**: name, version, description
2. **Tools**: For each tool, provide:
   - name (snake_case)
   - description (one sentence, action verb)
   - inputSchema (valid JSON Schema, flat preferred)
   - output shape (content types returned)
   - error cases (expected failure modes)
3. **Transport**: stdio or HTTP with rationale
4. **Auth**: required or none; if required, specify mechanism (API key header, OAuth scope, etc.)
5. **Error Handling Strategy**: per-tool error codes, fallback behavior, logging approach
</output_format>
<constraints>
- Every tool must have a description an agent can use for tool selection.
- Every inputSchema must include property-level descriptions.
- Prefer enum constraints over free-text where values are bounded.
- Transport choice must include latency and deployment context rationale.
- Error codes must use JSON-RPC standard codes where applicable (-32600, -32601, -32602) and custom codes in the -32000 to -32099 range for server-specific errors.
</constraints>
</prompt>

mcpbuild CLI

The mcpbuild CLI scaffolds, runs, and validates MCP servers from the terminal. Five commands cover the full lifecycle:

Command Description init <name> Scaffold a new MCP server project with pyproject.toml, server.py, and tool stubs add-tool Interactive: enter tool name, description, and input schema JSON; generates a handler stub and registers the tool run Start the server (defaults to stdio transport; pass --transport http --port 8080 for HTTP) validate Check the server against the MCP spec: tool schemas are valid JSON Schema, error handlers exist for every tool, transport config is sound test Send test tools/list and tools/call messages to a running server and verify responses match the spec

Download the full implementation below. Single file, zero dependencies beyond the standard library and asyncio.

# Download
curl -O https://drive.google.com/file/d/1b1WFnBv0ZYcgQEW8KIVOKQtDzEKjPotm/view?usp=drive_link


chmod +x mcpbuild.py

# Scaffold a project
./mcpbuild.py init weather-server

# Add a tool interactively
cd weather-server
../mcpbuild.py add-tool

# Run on stdio
../mcpbuild.py run

# Validate
../mcpbuild.py validate

# Test against a running server
../mcpbuild.py test

The CLI is a single Python file. Read it, modify it, make it yours. It uses raw JSON-RPC over stdio so you see exactly what flows between client and server.

Caveats

MCP is evolving. The spec adds capabilities and the reference implementations shift. The wire format is stable, but higher-level features like sampling, elicitation, and structured logging may change. Build on the core three methods (initialize, tools/list, tools/call) and you stay safe.

Stdio transport pairs with process-based hosting (Claude Desktop, IDE extensions). HTTP transport pairs with remote hosting. Pick the one that matches your deployment. Mixing both in one server adds complexity best reserved for later.

Schema validation is your first line of defense. Validate every incoming tools/call against the tool’s inputSchema before the handler runs. Reject early with a -32602 Invalid params error. This prevents malformed data from reaching your business logic.

Philosophy

Forge believes in building on the protocol, around it. Frameworks accelerate while protocols ground. When you understand the JSON-RPC message format, the dispatch table, and the schema contract, frameworks become optional convenience rather than required dependency, letting you debug faster, ship leaner, and adapt when the spec evolves.

The best MCP server is the one you can explain on a whiteboard. Tool schema in, content out. Everything else is detail.

This is F01 in the ArchonHQ Forge Series. The next article, F02, covers tool schema design patterns that make agents invoke your tools correctly on the first try. Subscribe to ArchonHQ to unlock every Forge article, CLI tool, and prompt kit.

Subscribe now

--- This article was originally published on ArchonHQ — practical AI that wins every day. Subscribe free to get new articles in your inbox.

Build Your Own Private RAG Knowledge Base

Michal Szalinski — Wed, 03 Jun 2026 00:02:08 +0000

Every query you send to a cloud RAG service leaves your perimeter. Your documents, your questions, your retrieved context, all of it traverses networks you control, stored on servers you trust, accessible to compliance teams you have yet to meet. The convenience is seductive and the cost is invisible until it becomes painfully visible. Bastion builds differently: your knowledge base stays on your hardware, your embeddings stay local, and your audit trail stays complete. This article gives you the architecture and the tooling to run Retrieval Augmented Generation entirely on your own terms.

The Idea (60 Seconds)

RAG augments a language model with retrieved context from your own documents. The typical pipeline ships your data to a cloud vector database and calls a remote embedding API for every query. A private RAG system replaces every cloud dependency with a local equivalent:

Storage and embeddings: Run models like all-MiniLM-L6-v2 or bge-large-en-v1.5 locally via sentence-transformers, stored in SQLite with a cosine similarity function. Zero external database server dependencies.
Chunking: Split documents with fixed-size overlap, semantic boundaries, or markdown-aware strategies.
Generation: Route prompts to a local LLM through Ollama, llama.cpp, or any local inference engine.
Audit trail: Log every query and every retrieved chunk to a SQLite table. Compliance becomes a SQL query.

The result is a fully functional RAG system that sends zero bytes to external services.

ArchonHQ is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Why This Matters

Privacy is a constraint that sharpens design. When you eliminate cloud dependencies, you also eliminate latency from network round trips, vendor lock-in from proprietary APIs, and data exposure from third-party processing. Your compliance team can audit the entire query history with a single SQL statement. Your security team can verify that zero data leaves the network perimeter. Your finance team can predict costs exactly, because local compute has a fixed price.

Regulated industries , healthcare, legal, defense, finance , operate under data residency rules. Sending patient records or classified documents through an external embedding API violates those rules by design. A private RAG system satisfies the rules by architecture, by policy, or by procedural override. The architecture itself enforces the boundary.

Performance improves too. Local embeddings on a modern GPU or Apple Silicon reach hundreds of embeddings per second. SQLite handles millions of vectors with sub-millisecond lookups when you pre-filter by collection. The bottleneck shifts from network latency to disk I/O, which you control entirely.

Walkthrough

This article was originally published on ArchonHQ — practical AI that wins every day. Subscribe free to get new articles in your inbox.

Run Your Own LLM on a Laptop: The Complete Guide

Michal Szalinski — Tue, 02 Jun 2026 00:04:42 +0000

This article was originally published on ArchonHQ — practical AI that wins every day. Subscribe free to get new articles in your inbox.

The Offer Equation

Michal Szalinski — Tue, 02 Jun 2026 00:03:26 +0000

This article was originally published on ArchonHQ — practical AI that wins every day. Subscribe free to get new articles in your inbox.

Running AI Agents on a Laptop GPU - My 6GB VRAM Setup That Actually Works

Michal Szalinski — Tue, 02 Jun 2026 00:03:14 +0000

This article was originally published on ArchonHQ — practical AI that wins every day. Subscribe free to get new articles in your inbox.

Running AI Agents on a Laptop GPU: My 6GB VRAM Setup That Actually Works

Michal Szalinski — Tue, 26 May 2026 21:03:00 +0000

You've seen the posts. "I'm running a 70B parameter model on my home server with 48GB VRAM." Cool story. Most of us are staring at a laptop with 6GB of VRAM and 32GB of system RAM, wondering if personal AI agents are beyond our reach.

They're within reach. I'm writing this on my everyday laptop in Melbourne, and my AI crew is running in the background right now. Water reminders, posture nudges, health research, meal planning, coding help, all happening locally, privately, and fast enough to keep pace.

Here's the setup, the models, the use cases, and the honest performance numbers.

Running AI Agents on a Laptop GPU - My 6GB VRAM Setup That Actually Works

How I run a personal AI crew on an everyday RTX 3060. Zero enterprise budget. Zero PhD. Zero excuses.

archonhq.ai

Your Content Is a Production Pipeline, Build It Like One

Michal Szalinski — Sun, 24 May 2026 21:40:00 +0000

You told yourself you'd post weekly. It's been six weeks. Your Substack dashboard mocks you with that sad "0 posts this month" counter.

Meanwhile, the AI bros on X are posting three times a day about "content leverage" while clearly using the same ChatGPT template as everyone else. Quantity up, quality sideways, audience numb.

There's a third option. You can treat content the way you treat production software: as a pipeline with intake, quality control, assembly, finishing, distribution, and feedback. Skip any step and you get either silence or garbage. Run every step and you get consistent, high-quality output while you sleep.

I know because I built it. This article you're reading? It came out of that pipeline. The other articles in this series? Same pipeline. Six Python scripts, five cron jobs, one environment file, zero frameworks.

Here's the architecture.

Your Content Is a Production Pipeline , Build It Like One

A system that discovers ideas, filters them, drafts them, QA's them, generates visuals, publishes, distributes, and measures

archonhq.ai

This article was originally published on ArchonHQ — practical AI that wins every day. Subscribe free to get new articles in your inbox.

Build Your Own Cline Alternative in 200 Lines

Michal Szalinski — Fri, 22 May 2026 21:30:00 +0000

Your AI coding assistant vanishes overnight. Cline gets abandoned. Roo Code stops responding to issues. The VS Code extension that automated your file operations, ran terminal commands, and integrated with your preferred AI models suddenly throws deprecation warnings.

You’re back to copying code snippets manually. Context switching between terminal and editor. Explaining the same codebase structure to ChatGPT every session. The 40% productivity boost from autonomous coding assistance evaporates because someone else controlled the tools you relied on.

What if you could build your own AI coding assistant in an afternoon, own the entire stack, and customize it exactly for your workflow?

The Idea (60 Seconds)

You’ll create a minimal VS Code extension that handles file operations, executes terminal commands, and connects to any OpenAI-compatible API. The 200-line implementation provides autonomous coding capabilities through a simple chat interface that can read your codebase, modify files, and run commands. Setup takes 30 minutes. The result gives you permanent control over your AI coding workflow.

ArchonHQ is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Why Build This, Beyond Waiting for Alternatives

Dependency risk drops to zero. Commercial tools get discontinued. Open source projects get abandoned. Your custom extension lives in your codebase under your control. Zero external dependencies means zero abandonment risk.

Customization becomes unlimited. You control the prompts, the model endpoints, and the file operation logic. Add project-specific commands. Integrate with your deployment scripts. Modify the behavior to match your exact workflow.

API flexibility stays open. Connect to OpenAI, Anthropic, local Ollama instances, or any OpenAI-compatible endpoint. Switch providers by changing one configuration line. Your tool adapts to whatever AI infrastructure you prefer.

Walkthrough

This article was originally published on ArchonHQ — practical AI that wins every day. Subscribe free to get new articles in your inbox.

How to Give Claude Perfect Memory

Michal Szalinski — Wed, 20 May 2026 21:13:00 +0000

Every time you start a new Claude session, you burn tokens re-establishing context. Over a month, that compounds into hours of wasted time and inconsistent outputs. A memory system pays for itself on day one. Layer one alone saves ten minutes per session. Layer three makes Claude genuinely useful for long-running projects where consistency across weeks matters.

Layer two takes about an hour and changes how Claude operates entirely.

Layer three turns Claude into a self-evolving second brain, trained on all your data, with persistent search and recall across every conversation you've ever had.

Here are all three.

https://open.substack.com/pub/michalszalinski/p/how-to-give-claude-perfect-memory?r=8ecvg&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true

This article was originally published on ArchonHQ — practical AI that wins every day. Subscribe free to get new articles in your inbox.

Clone Hermes Agent's Architecture for Your Own AI Assistant

Michal Szalinski — Tue, 19 May 2026 21:03:00 +0000

Your AI assistant forgets the conversation context after three exchanges. The tool calling fails when you chain multiple operations. The memory system breaks when handling complex workflows that span multiple sessions.

You’re cobbling together OpenAI function calls with custom prompt engineering while fighting race conditions in multi-step processes. The assistant that worked for simple Q&A completely falls apart when you need it to research, analyze, and execute a series of dependent tasks.

Meanwhile, Nous Research’s Hermes Agent handles complex workflows flawlessly. Multi-turn conversations maintain perfect context. Tool execution chains together seamlessly. The architecture scales from simple queries to sophisticated automation.

The Idea (60 Seconds)

You’ll reverse-engineer Hermes Agent’s core design patterns to build a production-grade AI assistant framework. The implementation uses a modular plugin system, persistent memory management, and standardized tool interfaces that handle complex workflows reliably. Setup takes 2 hours. The result gives you an assistant architecture that scales from basic chat to autonomous task execution.

Why This Architecture, Beyond Simple Function Calling

Memory persistence solves context degradation. Standard chat implementations lose context as conversations grow. Hermes uses structured memory that maintains conversation state, user preferences, and task history across sessions. Your assistant remembers what you discussed yesterday and builds on previous work.

Plugin modularity enables unlimited expansion. Function calling requires hardcoded tool definitions. The Hermes pattern uses a plugin interface where tools register themselves dynamically. Add new capabilities by dropping Python files into a plugins directory. Zero core code changes.

Execution planning prevents tool chaos. Naive implementations call tools randomly based on user input. Hermes creates execution plans that sequence tool calls logically, handle dependencies, and recover from failures. The difference between “search for Python tutorials” and “search for Python tutorials, summarize the top 3, create a learning plan, and schedule practice sessions.”

ArchonHQ is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Walkthrough

1. Core Agent Framework

Create the base agent class that handles conversation flow and tool coordination:

# agent.py
import json
import asyncio
from typing import Dict, List, Any, Optional
from dataclasses import dataclass
from datetime import datetime

@dataclass
class Message:
    role: str
    content: str
    timestamp: datetime
    metadata: Dict[str, Any] = None

class HermesAgent:
    def __init__(self, model_client, memory_store, plugin_manager):
        self.model = model_client
        self.memory = memory_store
        self.plugins = plugin_manager
        self.conversation_id = None

    async def process_message(self, user_input: str) -> str:
        # Load conversation context
        context = await self.memory.get_context(self.conversation_id)

        # Create execution plan
        plan = await self.create_execution_plan(user_input, context)

        # Execute plan steps
        results = []
        for step in plan.steps:
            result = await self.execute_step(step)
            results.append(result)

        # Generate response
        response = await self.synthesize_response(results, user_input)

        # Store conversation state
        await self.memory.store_exchange(
            self.conversation_id, user_input, response, results
        )

        return response

2. Memory Management System

Implement persistent memory that maintains context across sessions:

# memory.py
import sqlite3
import json
from typing import Dict, List, Optional

class MemoryStore:
    def __init__(self, db_path: str):
        self.db_path = db_path
        self.init_database()

    def init_database(self):
        conn = sqlite3.connect(self.db_path)
        conn.execute('''
            CREATE TABLE IF NOT EXISTS conversations (
                id TEXT PRIMARY KEY,
                created_at TIMESTAMP,
                last_active TIMESTAMP,
                context_summary TEXT
            )
        ''')
        conn.execute('''
            CREATE TABLE IF NOT EXISTS messages (
                id INTEGER PRIMARY KEY,
                conversation_id TEXT,
                role TEXT,
                content TEXT,
                timestamp TIMESTAMP,
                metadata TEXT,
                FOREIGN KEY (conversation_id) REFERENCES conversations (id)
            )
        ''')
        conn.commit()
        conn.close()

    async def get_context(self, conversation_id: str) -> Dict:
        conn = sqlite3.connect(self.db_path)

        # Get recent messages
        messages = conn.execute('''
            SELECT role, content, timestamp, metadata 
            FROM messages 
            WHERE conversation_id = ? 
            ORDER BY timestamp DESC 
            LIMIT 20
        ''', (conversation_id,)).fetchall()

        # Get conversation summary
        summary = conn.execute('''
            SELECT context_summary 
            FROM conversations 
            WHERE id = ?
        ''', (conversation_id,)).fetchone()

        conn.close()

        return {
            'messages': [
                {
                    'role': msg[0], 
                    'content': msg[1], 
                    'timestamp': msg[2],
                    'metadata': json.loads(msg[3] or '{}')
                } 
                for msg in reversed(messages)
            ],
            'summary': summary[0] if summary else None
        }

3. Plugin System Architecture

Build the modular tool interface that enables dynamic capability expansion:

# plugins.py
import importlib
import os
from abc import ABC, abstractmethod
from typing import Dict, Any, List

class Plugin(ABC):
    @property
    @abstractmethod
    def name(self) -> str:
        pass

    @property
    @abstractmethod
    def description(self) -> str:
        pass

    @abstractmethod
    async def execute(self, parameters: Dict[str, Any]) -> Any:
        pass

    @abstractmethod
    def get_schema(self) -> Dict:
        pass

class PluginManager:
    def __init__(self, plugins_dir: str):
        self.plugins_dir = plugins_dir
        self.plugins: Dict[str, Plugin] = {}
        self.load_plugins()

    def load_plugins(self):
        for filename in os.listdir(self.plugins_dir):
            if filename.endswith('.py') and filename != '__init__.py':
                module_name = filename[:-3]
                spec = importlib.util.spec_from_file_location(
                    module_name, 
                    os.path.join(self.plugins_dir, filename)
                )
                module = importlib.util.module_from_spec(spec)
                spec.loader.exec_module(module)

                # Find Plugin subclasses
                for attr_name in dir(module):
                    attr = getattr(module, attr_name)
                    if (isinstance(attr, type) and 
                        issubclass(attr, Plugin) and 
                        attr != Plugin):
                        plugin_instance = attr()
                        self.plugins[plugin_instance.name] = plugin_instance

    def get_available_tools(self) -> List[Dict]:
        return [
            {
                'name': plugin.name,
                'description': plugin.description,
                'schema': plugin.get_schema()
            }
            for plugin in self.plugins.values()
        ]

4. Example Plugin Implementation

Create a web search plugin that follows the standard interface:

# plugins/web_search.py
import aiohttp
import json
from plugins import Plugin

class WebSearchPlugin(Plugin):
    @property
    def name(self) -> str:
        return "web_search"

    @property
    def description(self) -> str:
        return "Search the web for current information"

    async def execute(self, parameters):
        query = parameters.get('query')
        max_results = parameters.get('max_results', 5)

        # Use your preferred search API
        async with aiohttp.ClientSession() as session:
            url = f"https://api.search.brave.com/res/v1/web/search"
            headers = {"X-Subscription-Token": "your_api_key"}
            params = {"q": query, "count": max_results}

            async with session.get(url, headers=headers, params=params) as response:
                data = await response.json()

        results = []
        for item in data.get('web', {}).get('results', []):
            results.append({
                'title': item.get('title'),
                'url': item.get('url'),
                'description': item.get('description')
            })

        return {'results': results, 'query': query}

    def get_schema(self):
        return {
            'type': 'object',
            'properties': {
                'query': {'type': 'string', 'description': 'Search query'},
                'max_results': {'type': 'integer', 'description': 'Maximum results to return'}
            },
            'required': ['query']
        }

5. Execution Planning

Implement the planning system that sequences tool calls intelligently:

# planner.py
from typing import List, Dict
from dataclasses import dataclass

@dataclass
class ExecutionStep:
    tool_name: str
    parameters: Dict
    depends_on: List[str] = None
    step_id: str = None

class ExecutionPlanner:
    def __init__(self, model_client, plugin_manager):
        self.model = model_client
        self.plugins = plugin_manager

    async def create_plan(self, user_input: str, context: Dict) -> List[ExecutionStep]:
        available_tools = self.plugins.get_available_tools()

        planning_prompt = f"""
        User request: {user_input}
        Available tools: {json.dumps(available_tools, indent=2)}

        Create a step-by-step execution plan. Each step should use one tool.
        Consider dependencies between steps.

        Respond with a JSON array of steps:
        [
            {{
                "step_id": "step_1",
                "tool_name": "web_search",
                "parameters": {{"query": "Python tutorials"}},
                "depends_on": []
            }}
        ]
        """

        response = await self.model.complete(planning_prompt)
        steps_data = json.loads(response)

        return [ExecutionStep(**step) for step in steps_data]

Caveats

Model quality determines planning effectiveness. The execution planner relies on the language model understanding tool capabilities and sequencing logic. Weaker models create inefficient plans or miss dependencies. GLM-5.1 level capability becomes essential for complex workflows.

Memory storage grows indefinitely. The SQLite implementation accumulates conversation history permanently. Add cleanup routines for conversations older than 30 days or implement conversation archiving to prevent database bloat.

Plugin isolation remains minimal. Plugins execute in the same Python process with full system access. Malicious or buggy plugins can crash the entire agent. Consider sandboxing for production deployments handling untrusted plugins.

Philosophy

Building your own agent architecture creates compound advantages over time. Each plugin you add increases the system’s capabilities exponentially. The memory system learns your preferences and work patterns. The execution planner gets better at sequencing tasks for your specific use cases.

The Hermes architecture pattern scales from personal assistant to team automation platform. Start with web search and file operations. Add calendar integration, code analysis, and deployment tools. The modular design grows with your needs while maintaining reliability.

You own the entire stack. Zero vendor dependencies. Zero API rate limits. Zero feature deprecation risk.

Build Yours

Start with the core agent framework and memory system. Build one plugin. Test the execution planning with simple two-step workflows. The architecture becomes clear once you see it running.

What’s the first capability you’ll add to your agent? Drop your plugin ideas in the comments.

Subscribe now

This article was originally published on ArchonHQ — practical AI that wins every day. Subscribe free to get new articles in your inbox.

Your ICP Is a Trap

Michal Szalinski — Sun, 17 May 2026 23:24:33 +0000

Hook

You spend six weeks building an AI agent that automates invoice processing for small businesses. You launch. Crickets. You posted in three Discord servers, sent 40 cold DMs, ran $200 in ads. Zero paying customers. The product works. The demos are smooth. Sales stay at zero.

The problem stared you in the face the whole time. Your ICP was "small business owners who need automation." That describes 30 million people and excites exactly zero of them. You defined your ideal customer by demographics, by role, by company size. You listed who they are. You failed to ask whether they care, whether they spend, whether you can reach them, and whether you have any right to win.

The Idea (60 Seconds)

Your Ideal Customer Profile is a trap when it answers the wrong question. Most builders define ICP by demographics: age, income, job title, company size. Those attributes describe a person. They fail to predict behavior.

A strong ICP answers one question: Who is actively trying to solve this problem right now, has the ability to pay, and can be reached?

Urgency and situation beat demographics every time. A 42-year-old CFO at a logistics company drowning in manual reconciliation is your ICP. "CFOs at mid-market companies" is a demographic label that includes thousands of people perfectly happy with their spreadsheets.

The 4-Filter Test screens your ICP before you invest a single build hour. Pain. Market. Access. Fit. Each filter eliminates weak assumptions. Pass all four, and you have a target worth building for.

Two complementary question sequences sharpen the result. The Narrowing Funnel, derived from Alex Hormozi's framework, starts broad and drills to urgency. The Lighthouse Client Method, created by Rmosh, grounds your ICP in a real human being instead of an abstract persona.

Why This Matters

Every AI builder hits the same wall. You learn prompt engineering. You master agent frameworks. You ship something that works. Then you realize you built it for everybody, which means you built it for an audience of zero.

Generic ICPs produce generic messaging. Generic messaging produces low conversion and high churn. You attract people who kind of sort of need your thing. They sign up, poke around, and leave. Your retention numbers look like a cliff.

The cost compounds fast. Six weeks of building for the wrong audience means six weeks of code you may need to rewrite, six weeks of positioning you need to undo, and six weeks of motivation burned on a product zero people wanted.

The antidote is simple and ruthless: filter before you build. The 4-Filter Test takes 30 minutes and saves months.

Walkthrough

The 4-Filter Test

Run your ICP through these four filters in order. Fail any single one, and you stop. Revisit your assumptions. Pick a different target. Do it all before writing a single line of code.

Filter 1: Pain. Are real people experiencing this problem and actively seeking solutions?

This is the urgency filter. People complain about many things. People seek solutions for far fewer. Your ICP must have a problem painful enough that they are already looking for help, googling alternatives, posting in forums, asking colleagues.

Test: Search for the problem in Reddit, Twitter, industry Slack channels. If people are posting about it and asking for recommendations, pain is real. If you find only vague complaints, the pain is too low to drive purchase behavior.

Example: "Bookkeeping is tedious" is a complaint. "I spent 12 hours last weekend reconciling invoices and I am still behind" is a pain signal. The second person buys. The first person scrolls past your ad.

Filter 2: Market. Is there a group spending money on solutions already?

Existing spend proves willingness to pay. If zero people are spending money to solve this problem, you are fighting human inertia and budget allocation at the same time. That is a losing battle.

Test: Search for existing products, agencies, consultants, or freelancers serving this problem. Check their pricing pages. Look for G2 or Capterra listings. Paid competitors validate the market. Zero competitors usually signals zero market, and first-mover advantage is a myth for solo builders.

Example: Automation tools for real estate agents exist everywhere, and agents pay for them. That signals a market. A tool for "people who want to journal more creatively" faces a market of free alternatives and low willingness to pay.

Filter 3: Access. Can you reach these people through channels you can actually use?

A perfect ICP locked behind an unreachable channel is useless. If your target is Fortune 500 CTOs and your only channel is a Twitter account with 200 followers, you lack access. Access means you can put your message in front of your ICP repeatedly, at low cost, starting this week.

Test: List every channel where your ICP spends time. Then honestly assess whether you can show up there. Do you have followers there? Do you know someone who does? Can you write content they read? Can you cold-email them effectively?

Example: React developers are reachable through Twitter, Dev.to, Discord, GitHub, and conference communities. Mid-market hospital administrators are reachable through expensive trade shows and closed networks. Pick the ICP you can actually reach.

Filter 4: Fit. Does your skill or experience give you an edge with this group?

You need earned advantage. Domain knowledge, professional network, technical expertise, or lived experience that lets you build something better or faster than a random competitor. Fit is your moat at the earliest stage.

Test: Ask yourself what you know about this ICP that most people lack. If the answer is "zero," you are competing on execution alone against people who have both execution and insight.

Example: A former tax accountant building automation for tax firms has massive fit. A career developer building automation for dental practices has zero fit. Both can build the product. The former builds the right product faster.

The Narrowing Funnel (Hormozi-Derived)

Once your ICP passes all four filters, sharpen it with this question sequence. Each question narrows the field.

Who specifically? Start broad: "Business owners." Narrow: "E-commerce business owners." Narrower: "E-commerce business owners doing $1M to $10M in revenue." Each level removes people who dilute your message.
What is their situation? Describe the context that creates the problem. "E-commerce owners managing inventory across three warehouses with a team of five and lacking a dedicated operations person."
What is the painful version? Find the acute symptom. "SKU mismatches causing stockouts on best-selling items during peak season." This is what keeps them up at night.
What triggers them to seek help right now? Identify the event that converts latent frustration into active purchasing. "Black Friday inventory errors cost them $50K in lost sales last year, and Q4 is eight weeks away." That is urgency.
What is the outcome they would pay for? State the result in their language. "Eliminate SKU mismatches so every order ships correct and on time." The outcome, the result, rather than the feature.

The Lighthouse Client Method (Rmosh)

The Narrowing Funnel gives you a precise segment. The Lighthouse Client Method grounds it in a real human being.

Identify one person you would love to help. A specific individual. A former colleague, a client you worked with, a person from a community you belong to. Someone you can picture clearly.
Map their entire day. From morning to evening, what do they do? Where do they spend time? What tools do they open? What meetings drain them? What tasks feel like wasted effort?
Find the friction point they complain about most. The thing they mention unprompted. The task that makes them groan. The process they describe as "the worst part of my week."
Build for that person, then generalize. Create the solution that eliminates their specific friction. Then ask: who else has this same friction in this same context? Those people are your ICP.

This method works because it anchors your product in observed behavior instead of imagined needs. You solve a real problem for a real person. Other people with the same problem recognize themselves in your messaging because it describes their actual experience.

The Two Big Beginner Mistakes

Mistake 1: "My ICP is everyone who might pay." This feels safe. It is the opposite of safe. Broad targeting produces generic messaging. Generic messaging converts at a fraction of specific messaging. You attract marginal customers who churn fast because the product serves everyone poorly instead of serving someone exceptionally.

Fix: Define your ICP by best-fit criteria and disqualifiers. Write down who you serve and who you deliberately exclude. Disqualifiers sharpen your positioning as much as qualifiers. "We help e-commerce operators doing $1M to $10M. Enterprise teams and solopreneurs fall outside our focus."

Mistake 2: Choosing a niche based on passion or identity, assuming the market rewards authenticity. Passion is a starting point. It falls short as a standalone strategy. The market rewards value, and value requires craft. Building for a niche you love where you lack skill and produces mediocre products competing against people with genuine expertise.

Fix: Replace passion-first with craft plus pull. Craft means your skill gives you an edge. Pull means the market signals demand. When craft and pull align, you have a sustainable position. When they misalign, you have a hobby.

The Prompt Toolkit

ICP Extraction Prompt

Copy the prompt below, replace the placeholder with your business idea, and paste it into any LLM.

<role>You are a ruthless ICP analyst. You eliminate weak assumptions and surface the truth about whether a business idea has a viable target customer.</role>

<task>Run the 4-Filter Test on the business idea below. Score each filter from 1 to 10. For each filter, provide the score, the reasoning, and the specific evidence a builder should gather to validate or invalidate the score. Be brutally honest. Affirmative evidence only; discard wishful thinking.</task>

<filters>
 <filter name="Pain">
 <question>Are real people experiencing this problem and actively seeking solutions right now?</question>
 <scoring_guide>10 = People post daily in public forums begging for a fix. 5 = People complain occasionally. 1 = You assume the pain exists based on logic alone.</scoring_guide>
 </filter>
 <filter name="Market">
 <question>Is there a group already spending money to solve this problem?</question>
 <scoring_guide>10 = Multiple paid products with pricing pages and reviews. 5 = One or two niche tools exist. 1 = Zero paid solutions exist.</scoring_guide>
 </filter>
 <filter name="Access">
 <question>Can you reach these people through channels you can actually use this week?</question>
 <scoring_guide>10 = You already have an audience or direct connection. 5 = You can reach them through public communities. 1 = They hide behind gatekeepers and enterprise sales cycles.</scoring_guide>
 </filter>
 <filter name="Fit">
 <question>Does your skill or experience give you an edge with this group?</question>
 <scoring_guide>10 = You have years of domain expertise and a network. 5 = You have adjacent skills. 1 = You have zero connection to this world.</scoring_guide>
 </filter>
</filters>

<output_format>
Return your response in this exact structure:
<result>
 <idea_summary>One-sentence restatement of the idea</idea_summary>
 <filter name="Pain" score="X">Reasoning and evidence to gather</filter>
 <filter name="Market" score="X">Reasoning and evidence to gather</filter>
 <filter name="Access" score="X">Reasoning and evidence to gather</filter>
 <filter name="Fit" score="X">Reasoning and evidence to gather</filter>
 <total_score>X/40</total_score>
 <verdict>PASS if total >= 28, CONDITIONAL if 20-27, FAIL if below 20</verdict>
 <next_action>One specific thing the builder should do next</next_action>
</result>
</output_format>

<business_idea>
[PASTE YOUR BUSINESS IDEA HERE]
</business_idea>

Lighthouse Client Prompt

Copy the prompt below, answer the questions honestly, and paste it into any LLM.

<role>You are a product strategist who specializes in grounding abstract customer profiles in real human behavior. Your method is the Lighthouse Client Method: find one real person, observe their actual day, and surface the friction that drives purchasing.</role>

<task>Walk me through the Lighthouse Client Method step by step. Ask me one question at a time. Wait for my answer before proceeding to the next step. Complete all four steps.</task>

<steps>
 <step number="1" name="Identify">
 <instruction>Ask me to name one specific person I would love to help. This must be a real individual I can picture clearly: a former colleague, a past client, someone from a community I belong to. Ask for their first name (or alias), their role, and their industry.</instruction>
 </step>
 <step number="2" name="Map the Day">
 <instruction>Ask me to describe this person's typical workday from morning to evening. Prompt me to include: what tools they open, what meetings they attend, what tasks consume their time, and what feels like wasted effort. Probe for specifics.</instruction>
 </step>
 <step number="3" name="Find the Friction">
 <instruction>Ask me to identify the single task this person complains about most. The thing they mention unprompted. The process that makes them groan. Ask what they have tried to fix it and why those attempts fell short.</instruction>
 </step>
 <step number="4" name="Generalize">
 <instruction>Based on everything I shared, produce a one-paragraph ICP statement in this format: "People like [name], who are [role] at [type of company], who struggle with [specific friction] because [root cause], and who would pay for [outcome]."</instruction>
 </step>
</steps>

<output_format>
After I complete all four steps, output:
<lighthouse_result>
 <client_profile>Summary of the person I described</client_profile>
 <friction_point>The specific pain you identified</friction_point>
 <icp_statement>The one-paragraph ICP statement from Step 4</icp_statement>
 <validation_checklist>Three specific actions I should take this week to confirm this friction exists for five more people</validation_checklist>
</lighthouse_result>
</output_format>

ICP Validation CLI

Save the script below as icp_check.py, set your OPENROUTER_API_KEY environment variable, and run it.

import argparse, os, json, urllib.request

def main():
 p = argparse.ArgumentParser(description="4-Filter ICP Assessment via OpenRouter")
 p.add_argument("idea", help="Your business idea description")
 p.add_argument("--model", default="google/gemini-2.0-flash-001")
 args = p.parse_args()
 key = os.environ.get("OPENROUTER_API_KEY", "")
 assert key, "Set OPENROUTER_API_KEY env var"
 prompt = f"""Score this business idea on the 4-Filter ICP Test. Each filter gets 1-10.
Filters: Pain (active problem seekers?), Market (existing spend?), Access (reachable channels?), Fit (your edge?).
Idea: {args.idea}
Return JSON only: {{"pain": int, "market": int, "access": int, "fit": int, "total": int, "verdict": "PASS|CONDITIONAL|FAIL"}}"""
 body = json.dumps({"model": args.model, "messages": [{"role": "user", "content": prompt}]}).encode()
 req = urllib.request.Request("https://openrouter.ai/api/v1/chat/completions", data=body,
 headers={"Authorization": f"Bearer {key}", "Content-Type": "application/json"})
 resp = json.loads(urllib.request.urlopen(req).read())
 r = json.loads(resp["choices"][0]["message"]["content"])
 print(f"Pain: {r['pain']}/10 | Market: {r['market']}/10 | Access: {r['access']}/10 | Fit: {r['fit']}/10")
 print(f"Total: {r['total']}/40 | Verdict: {r['verdict']}")

if __name__ == "__main__":
 main()

Caveats

The 4-Filter Test eliminates bad ICPs fast. It can also create false confidence if you lie to yourself on any filter. Confirmation bias is the enemy. Run each filter assuming your ICP fails, and look for evidence that it passes. The opposite approach, seeking evidence that confirms your hope, leads to the same wasted months the test is designed to prevent.

The Lighthouse Client Method risks overfitting to one person. Your lighthouse client may have idiosyncratic needs that diverge from the broader market. After building for them, validate that the problem generalizes. Talk to five more people in the same segment. If three of five describe the same friction, you have product-market signal. If only one of five does, you have a consulting client.

Markets shift. An ICP that passes all four filters today may fail in six months as conditions change. Revisit the test quarterly. Treat your ICP as a hypothesis, and treat revenue as the experiment result.

Philosophy

The best product strategy starts with ruthless selection, and selection means elimination. Every person you exclude from your ICP makes your messaging sharper for the people who remain. Every filter you apply removes a possible path to wasted effort.

Building AI tools is easier than ever. The moat has moved from technical execution to problem selection. The builders who win are the ones who chose the right problem before they wrote a single function. The 4-Filter Test is how you choose correctly.

Specificity is generosity. A vague ICP leaves every reader uncertain whether this product serves them. A precise ICP tells the right people, "this was built for you, and you can see it." That clarity converts.

This is the first entry in the Caliber Series, a paid column on building and selling AI tools. The next article breaks down how to validate your ICP in 48 hours using zero but free tools and five conversations. Upgrade to access the full series.