DEV Community

韩

Posted on

12-factor Agents: 5 Hidden Production Patterns Nobody Tells You About (19K GitHub Stars)

If you've been building AI agents this year, here's something counterintuitive: the best "AI agents" in production aren't actually that agentic. They're mostly well-engineered software with LLMs sprinkled in at strategic points.

That's the core insight behind 12-factor-agents — a GitHub repo that's quietly accumulated 19,788 stars and 475 HN points in under 14 months. Created by Dex Horthy (dhorthy) from HumanLayer, it's inspired by Heroku's original 12 Factor Apps methodology, but applied to the messy reality of shipping LLM-powered software.

Most teams building with agent frameworks hit a wall at 70-80% reliability. The ones who break through? They don't build greenfield agent systems from scratch. They borrow small, modular patterns and embed them into their existing products.

So let's dig into 5 hidden production patterns from the 12-factor-agents repo that most developers completely overlook.


1. Treat Your LLM Calls Like Database Queries (Not Magic)

Why most people get this wrong: They treat every LLM interaction as a unique snowflake, stuffing infinite context and hoping for the best. This leads to wildly inconsistent outputs, astronomical token costs, and zero observability.

The pattern: Structure your LLM interactions like you would a database query — with clear inputs, deterministic transformations, and measurable outputs.

from langchain.prompts import PromptTemplate
from langchain.output_parsers import StructuredOutputParser, ResponseSchema

# Define schemas for your inputs and outputs
response_schemas = [
    ResponseSchema(name="action", description="The specific action to take"),
    ResponseSchema(name="confidence", description="Confidence score 0-1"),
    ResponseSchema(name="reasoning", description="Brief explanation"),
]
parser = StructuredOutputParser.from_response_schemas(response_schemas)

# Now your LLM call is like a typed database query
template = PromptTemplate(
    template="""Extract structured data from the user request.
{format_instructions}

Request: {query}
""",
    input_variables=["query"],
    partial_variables={"format_instructions": parser.get_format_instructions()}
)

# This gives you deterministic, parseable output
# instead of free-form text you have to regex-parse
Enter fullscreen mode Exit fullscreen mode

Why it works: When your LLM outputs are typed and structured, you can validate them, log them, and build error handling around them. No more json.decoder.jsondecodeerror panics in production at 3am.

Data: This pattern is cited across the 12-factor-agents repo, with examples in the "structured outputs" principle.


2. Build Feedback Loops, Not One-Shot Agents

Why most people get this wrong: They build an agent that does ONE big thing in one shot. When it fails (and it will), there's no recovery mechanism.

The pattern: Decompose complex tasks into small, recoverable steps with explicit feedback signals between each step.

import asyncio
from enum import Enum

class AgentState(Enum):
    PENDING = "pending"
    RUNNING = "running"
    NEEDS_REVIEW = "needs_review"
    FAILED = "failed"
    COMPLETE = "complete"

class Step:
    def __init__(self, name: str, action_fn, review_fn=None):
        self.name = name
        self.action_fn = action_fn
        self.review_fn = review_fn or (lambda x: x is not None)
        self.result = None
        self.state = AgentState.PENDING

    async def execute(self):
        self.state = AgentState.RUNNING
        self.result = await self.action_fn(self.result)

        if self.review_fn(self.result):
            self.state = AgentState.COMPLETE
        else:
            self.state = AgentState.NEEDS_REVIEW

        return self.result

# Build a pipeline of small, reviewable steps
steps = [
    Step("classify", classify_intent),
    Step("fetch_context", fetch_relevant_docs),
    Step("generate_response", generate_answer),
    Step("validate_quality", validate_output),
]

# Each step can fail independently, be retried, or escalate
for step in steps:
    result = await step.execute()
    if step.state == AgentState.NEEDS_REVIEW:
        # Route to human review or retry with modified prompt
        await escalate_for_review(step)
Enter fullscreen mode Exit fullscreen mode

Why it works: When tasks are broken into small steps with explicit review points, you can: (1) catch failures early, (2) route edge cases to humans, and (3) collect training data from the review process.

Data: The 12-factor-agents repo emphasizes "feedback loops" as a core principle, with the insight that production agents need human-in-the-loop checkpoints.


3. Memory Is Not Just Context Window — It's a System

Why most people get this wrong: They dump everything into the context window and call it "memory." This doesn't scale, doesn't persist across sessions, and costs a fortune in tokens.

The pattern: Implement a tiered memory system with working memory, episodic memory, and semantic memory.

from dataclasses import dataclass
from typing import Any
import json

@dataclass
class MemoryEntry:
    content: str
    memory_type: str  # "working" | "episodic" | "semantic"
    importance: float  # 0-1
    timestamp: float

class TieredMemory:
    def __init__(self, max_working=5, max_episodic=50):
        self.working = []    # Current conversation context
        self.episodic = []   # Recent sessions
        self.semantic = {}    # Long-term learned facts
        self.max_working = max_working
        self.max_episodic = max_episodic

    def store(self, entry: MemoryEntry):
        if entry.memory_type == "working":
            self.working.append(entry)
            if len(self.working) > self.max_working:
                # Compress oldest working memory into episodic
                self.episodic.append(self._compress(self.working[:-1]))
                self.working = [self.working[-1]]
        elif entry.memory_type == "semantic":
            self.semantic[entry.key] = entry  # Vectorized storage in prod

    def get_context(self, query: str) -> str:
        # Retrieve relevant memories based on query
        relevant = [
            *self.working[-3:],  # Recent working memory
            *self._search_episodic(query, top_k=3),
            *self._search_semantic(query, top_k=5),
        ]
        return "\n".join([m.content for m in relevant])

    def _compress(self, entries):
        # Summarize old entries to save space
        summary = f"Previous session summary: {len(entries)} interactions"
        return MemoryEntry(content=summary, memory_type="episodic",
                          importance=0.5, timestamp=entries[0].timestamp)
Enter fullscreen mode Exit fullscreen mode

Why it works: Tiered memory lets you retain important information long-term while keeping your working context lean. It's how human experts actually remember things.


4. Evolvability — Your Prompts Need Version Control

Why most people get this wrong: They edit prompts directly in code, never track changes, and have no way to A/B test or roll back when a prompt update breaks production.

The pattern: Treat prompts as code. Store them in versioned files, use a prompt registry, and deploy prompt changes through your CI/CD pipeline.

# prompts/reservation/v1.yaml
system: |
  You are a restaurant reservation assistant.
  Always confirm: date, time, party size, name, phone.
  If the requested time is unavailable, suggest alternatives.

preamble: |
  Today's date: {date}
  Restaurant hours: 11am-10pm daily

# prompts/reservation/v2.yaml  — A/B test
system: |
  You are a friendly restaurant reservation assistant.
  Be warm and conversational while collecting:
  - Date (YYYY-MM-DD format)
  - Time (HH:MM 24-hour format)
  - Party size (1-20 guests)
  - Guest name and phone number
  If unavailable, offer 2 nearby alternatives.

---

# Load prompts from registry (like feature flags)
from prompt_registry import PromptRegistry

registry = PromptRegistry(
    backend="file",           # In prod: use feature flag service
    path="./prompts"
)

async def get_prompt(task: str, version: str = "current") -> str:
    return await registry.get(task, version=version)

# In your agent:
async def handle_reservation(user_input: str):
    prompt = await get_prompt("reservation", version="v2")  # A/B test
    response = await llm.generate(prompt.format(
        date=today, 
        user_input=user_input
    ))
    await log_prompt_version("reservation", "v2", response)
Enter fullscreen mode Exit fullscreen mode

Why it works: When prompts are versioned and deployed through CI/CD, you can: roll back instantly, A/B test systematically, and blame changes through git history.


5. Observability Is Non-Negotiable — Here's the Minimal Stack

Why most people get this wrong: They add print() statements or at best log to stdout. When the agent does something unexpected in production, they have no way to trace why.

The pattern: Log every LLM call with inputs, outputs, token usage, latency, and a correlation ID. Build dashboards for cost, latency, and failure rates.

import time
import uuid
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider

tracer = trace.get_tracer(__name__)

async def traced_llm_call(
    prompt: str,
    model: str = "gpt-4o",
    metadata: dict = None
):
    call_id = str(uuid.uuid4())
    start = time.time()

    with tracer.start_as_current_span(f"llm_call_{call_id}") as span:
        span.set_attribute("llm.call_id", call_id)
        span.set_attribute("llm.model", model)
        span.set_attribute("llm.prompt_tokens_estimate", len(prompt) // 4)

        try:
            response = await openai.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
            )

            elapsed = time.time() - start
            span.set_attribute("llm.completion_tokens", response.usage.completion_tokens)
            span.set_attribute("llm.prompt_tokens", response.usage.prompt_tokens)
            span.set_attribute("llm.latency_ms", elapsed * 1000)
            span.set_attribute("llm.cost_usd", response.usage.total_tokens * 0.00001)

            # Structured log for downstream analysis
            print(json.dumps({
                "event": "llm_call",
                "call_id": call_id,
                "model": model,
                "latency_ms": elapsed * 1000,
                "tokens": response.usage.total_tokens,
                "cost_usd": response.usage.total_tokens * 0.00001,
                "metadata": metadata
            }))

            return response.choices[0].message.content

        except Exception as e:
            span.record_exception(e)
            span.set_status(trace.StatusCode.ERROR)
            print(json.dumps({
                "event": "llm_error",
                "call_id": call_id,
                "error": str(e),
                "metadata": metadata
            }))
            raise
Enter fullscreen mode Exit fullscreen mode

Why it works: Without observability, you're flying blind. With structured tracing, you can debug failures, optimize costs, and prove ROI to stakeholders.


What's the Core Lesson?

If there's one takeaway from 12-factor-agents, it's this: the best production AI systems aren't the most "agentic" ones — they're the best-engineered ones.

Don't build a greenfield agent framework from scratch. Take the patterns that work (structured outputs, feedback loops, tiered memory, versioned prompts, observability) and embed them into your existing product.

The repo is a living document — it's already been updated with community contributions since its initial launch. If you're building LLM applications in production, it's worth reading cover to cover.

Links:

Data sources:

  • GitHub API: 19,788 stargazers, TypeScript, 1,498 forks (as of May 2026)
  • HN Algolia: 475 points, 78 comments
  • HN Algolia "vibe coding" query: 865+784+616+434+353+405 points across multiple viral threads (Simon Willison, Bram Cohen, Fast.ai)
  • Google News AI tech trend: Agentic AI / LLM reliability in production

Top comments (0)