If you've been building AI agents this year, here's something counterintuitive: the best "AI agents" in production aren't actually that agentic. They're mostly well-engineered software with LLMs sprinkled in at strategic points.
That's the core insight behind 12-factor-agents — a GitHub repo that's quietly accumulated 19,788 stars and 475 HN points in under 14 months. Created by Dex Horthy (dhorthy) from HumanLayer, it's inspired by Heroku's original 12 Factor Apps methodology, but applied to the messy reality of shipping LLM-powered software.
Most teams building with agent frameworks hit a wall at 70-80% reliability. The ones who break through? They don't build greenfield agent systems from scratch. They borrow small, modular patterns and embed them into their existing products.
So let's dig into 5 hidden production patterns from the 12-factor-agents repo that most developers completely overlook.
1. Treat Your LLM Calls Like Database Queries (Not Magic)
Why most people get this wrong: They treat every LLM interaction as a unique snowflake, stuffing infinite context and hoping for the best. This leads to wildly inconsistent outputs, astronomical token costs, and zero observability.
The pattern: Structure your LLM interactions like you would a database query — with clear inputs, deterministic transformations, and measurable outputs.
from langchain.prompts import PromptTemplate
from langchain.output_parsers import StructuredOutputParser, ResponseSchema
# Define schemas for your inputs and outputs
response_schemas = [
ResponseSchema(name="action", description="The specific action to take"),
ResponseSchema(name="confidence", description="Confidence score 0-1"),
ResponseSchema(name="reasoning", description="Brief explanation"),
]
parser = StructuredOutputParser.from_response_schemas(response_schemas)
# Now your LLM call is like a typed database query
template = PromptTemplate(
template="""Extract structured data from the user request.
{format_instructions}
Request: {query}
""",
input_variables=["query"],
partial_variables={"format_instructions": parser.get_format_instructions()}
)
# This gives you deterministic, parseable output
# instead of free-form text you have to regex-parse
Why it works: When your LLM outputs are typed and structured, you can validate them, log them, and build error handling around them. No more json.decoder.jsondecodeerror panics in production at 3am.
Data: This pattern is cited across the 12-factor-agents repo, with examples in the "structured outputs" principle.
2. Build Feedback Loops, Not One-Shot Agents
Why most people get this wrong: They build an agent that does ONE big thing in one shot. When it fails (and it will), there's no recovery mechanism.
The pattern: Decompose complex tasks into small, recoverable steps with explicit feedback signals between each step.
import asyncio
from enum import Enum
class AgentState(Enum):
PENDING = "pending"
RUNNING = "running"
NEEDS_REVIEW = "needs_review"
FAILED = "failed"
COMPLETE = "complete"
class Step:
def __init__(self, name: str, action_fn, review_fn=None):
self.name = name
self.action_fn = action_fn
self.review_fn = review_fn or (lambda x: x is not None)
self.result = None
self.state = AgentState.PENDING
async def execute(self):
self.state = AgentState.RUNNING
self.result = await self.action_fn(self.result)
if self.review_fn(self.result):
self.state = AgentState.COMPLETE
else:
self.state = AgentState.NEEDS_REVIEW
return self.result
# Build a pipeline of small, reviewable steps
steps = [
Step("classify", classify_intent),
Step("fetch_context", fetch_relevant_docs),
Step("generate_response", generate_answer),
Step("validate_quality", validate_output),
]
# Each step can fail independently, be retried, or escalate
for step in steps:
result = await step.execute()
if step.state == AgentState.NEEDS_REVIEW:
# Route to human review or retry with modified prompt
await escalate_for_review(step)
Why it works: When tasks are broken into small steps with explicit review points, you can: (1) catch failures early, (2) route edge cases to humans, and (3) collect training data from the review process.
Data: The 12-factor-agents repo emphasizes "feedback loops" as a core principle, with the insight that production agents need human-in-the-loop checkpoints.
3. Memory Is Not Just Context Window — It's a System
Why most people get this wrong: They dump everything into the context window and call it "memory." This doesn't scale, doesn't persist across sessions, and costs a fortune in tokens.
The pattern: Implement a tiered memory system with working memory, episodic memory, and semantic memory.
from dataclasses import dataclass
from typing import Any
import json
@dataclass
class MemoryEntry:
content: str
memory_type: str # "working" | "episodic" | "semantic"
importance: float # 0-1
timestamp: float
class TieredMemory:
def __init__(self, max_working=5, max_episodic=50):
self.working = [] # Current conversation context
self.episodic = [] # Recent sessions
self.semantic = {} # Long-term learned facts
self.max_working = max_working
self.max_episodic = max_episodic
def store(self, entry: MemoryEntry):
if entry.memory_type == "working":
self.working.append(entry)
if len(self.working) > self.max_working:
# Compress oldest working memory into episodic
self.episodic.append(self._compress(self.working[:-1]))
self.working = [self.working[-1]]
elif entry.memory_type == "semantic":
self.semantic[entry.key] = entry # Vectorized storage in prod
def get_context(self, query: str) -> str:
# Retrieve relevant memories based on query
relevant = [
*self.working[-3:], # Recent working memory
*self._search_episodic(query, top_k=3),
*self._search_semantic(query, top_k=5),
]
return "\n".join([m.content for m in relevant])
def _compress(self, entries):
# Summarize old entries to save space
summary = f"Previous session summary: {len(entries)} interactions"
return MemoryEntry(content=summary, memory_type="episodic",
importance=0.5, timestamp=entries[0].timestamp)
Why it works: Tiered memory lets you retain important information long-term while keeping your working context lean. It's how human experts actually remember things.
4. Evolvability — Your Prompts Need Version Control
Why most people get this wrong: They edit prompts directly in code, never track changes, and have no way to A/B test or roll back when a prompt update breaks production.
The pattern: Treat prompts as code. Store them in versioned files, use a prompt registry, and deploy prompt changes through your CI/CD pipeline.
# prompts/reservation/v1.yaml
system: |
You are a restaurant reservation assistant.
Always confirm: date, time, party size, name, phone.
If the requested time is unavailable, suggest alternatives.
preamble: |
Today's date: {date}
Restaurant hours: 11am-10pm daily
# prompts/reservation/v2.yaml — A/B test
system: |
You are a friendly restaurant reservation assistant.
Be warm and conversational while collecting:
- Date (YYYY-MM-DD format)
- Time (HH:MM 24-hour format)
- Party size (1-20 guests)
- Guest name and phone number
If unavailable, offer 2 nearby alternatives.
---
# Load prompts from registry (like feature flags)
from prompt_registry import PromptRegistry
registry = PromptRegistry(
backend="file", # In prod: use feature flag service
path="./prompts"
)
async def get_prompt(task: str, version: str = "current") -> str:
return await registry.get(task, version=version)
# In your agent:
async def handle_reservation(user_input: str):
prompt = await get_prompt("reservation", version="v2") # A/B test
response = await llm.generate(prompt.format(
date=today,
user_input=user_input
))
await log_prompt_version("reservation", "v2", response)
Why it works: When prompts are versioned and deployed through CI/CD, you can: roll back instantly, A/B test systematically, and blame changes through git history.
5. Observability Is Non-Negotiable — Here's the Minimal Stack
Why most people get this wrong: They add print() statements or at best log to stdout. When the agent does something unexpected in production, they have no way to trace why.
The pattern: Log every LLM call with inputs, outputs, token usage, latency, and a correlation ID. Build dashboards for cost, latency, and failure rates.
import time
import uuid
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
tracer = trace.get_tracer(__name__)
async def traced_llm_call(
prompt: str,
model: str = "gpt-4o",
metadata: dict = None
):
call_id = str(uuid.uuid4())
start = time.time()
with tracer.start_as_current_span(f"llm_call_{call_id}") as span:
span.set_attribute("llm.call_id", call_id)
span.set_attribute("llm.model", model)
span.set_attribute("llm.prompt_tokens_estimate", len(prompt) // 4)
try:
response = await openai.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
)
elapsed = time.time() - start
span.set_attribute("llm.completion_tokens", response.usage.completion_tokens)
span.set_attribute("llm.prompt_tokens", response.usage.prompt_tokens)
span.set_attribute("llm.latency_ms", elapsed * 1000)
span.set_attribute("llm.cost_usd", response.usage.total_tokens * 0.00001)
# Structured log for downstream analysis
print(json.dumps({
"event": "llm_call",
"call_id": call_id,
"model": model,
"latency_ms": elapsed * 1000,
"tokens": response.usage.total_tokens,
"cost_usd": response.usage.total_tokens * 0.00001,
"metadata": metadata
}))
return response.choices[0].message.content
except Exception as e:
span.record_exception(e)
span.set_status(trace.StatusCode.ERROR)
print(json.dumps({
"event": "llm_error",
"call_id": call_id,
"error": str(e),
"metadata": metadata
}))
raise
Why it works: Without observability, you're flying blind. With structured tracing, you can debug failures, optimize costs, and prove ROI to stakeholders.
What's the Core Lesson?
If there's one takeaway from 12-factor-agents, it's this: the best production AI systems aren't the most "agentic" ones — they're the best-engineered ones.
Don't build a greenfield agent framework from scratch. Take the patterns that work (structured outputs, feedback loops, tiered memory, versioned prompts, observability) and embed them into your existing product.
The repo is a living document — it's already been updated with community contributions since its initial launch. If you're building LLM applications in production, it's worth reading cover to cover.
Links:
- 12-factor-agents on GitHub (19,788 ⭐)
- HN Discussion — 475 points, 78 comments
- Dex Horthy — Creator (HumanLayer)
- swyx's talk summary of the principles
Data sources:
- GitHub API: 19,788 stargazers, TypeScript, 1,498 forks (as of May 2026)
- HN Algolia: 475 points, 78 comments
- HN Algolia "vibe coding" query: 865+784+616+434+353+405 points across multiple viral threads (Simon Willison, Bram Cohen, Fast.ai)
- Google News AI tech trend: Agentic AI / LLM reliability in production
Top comments (0)