Two Kinds of AI Agents (And Why You Need Both)

Bo Romir — Sun, 05 Apr 2026 23:11:16 +0000

Persistent context agents vs. stateless decision functions. When to build which, how to evaluate each, and why the industry is conflating two fundamentally different problems.

Andrej Karpathy ended his 2025 Year in Review with this:

"I don't think the industry has realized anywhere near 10% of their potential even at present capability."

I think the reason is that we're treating "AI agent" as one category when it's actually two. And the evaluation strategies, architectures, and failure modes are completely different for each.

The Two Agent Archetypes

Type 1: The Persistent Context Agent

This is an AI that lives alongside you. It has memory. It knows your codebase, your preferences, your ongoing projects, what happened yesterday. It accumulates context over time and uses that context to be more helpful.

Examples:

Claude Code running on your machine with access to your files, git history, and environment
A Slack-integrated ops assistant that knows your team's systems, reads channels, tracks tasks across days
Cursor, which maintains a model of your entire project and evolves its understanding as you code
An executive assistant that manages your calendar, email, and knows your contacts and preferences

Karpathy described this paradigm in his 2025 review when discussing Claude Code:

"It runs on your computer and with your private environment, data and context... it's not just a website you go to like Google, it's a little spirit/ghost that 'lives' on your computer."

The key insight: the value of a Type 1 agent is proportional to the context it has accumulated. A fresh instance with no memory is dramatically less useful than one that has been running for weeks.

Type 2: The Stateless Decision Function

This is an AI that receives structured input, makes one decision, and returns. No memory. No ongoing context. No personality. It runs identically the thousandth time as the first.

Examples:

A content moderation classifier: takes a post, returns APPROVE or FLAG
A loan underwriting decision: takes applicant data, returns APPROVE/DENY with confidence
A code review checker: takes a diff, returns a list of issues
A fraud detection function: takes a transaction, returns LEGITIMATE or SUSPICIOUS

Anthropic described this pattern in "Building Effective Agents":

"For many applications, optimizing single LLM calls with retrieval and in-context examples is usually enough."

The key insight: the value of a Type 2 agent is proportional to its accuracy at volume. It doesn't need to remember anything. It needs to be right 97%+ of the time on the same input every time.

Why This Distinction Matters

The industry is collapsing these into one concept called "agents" and that's causing real problems.

When OpenAI announced their Agents SDK and Responses API, they described agents as "systems that independently accomplish tasks on behalf of users." That definition covers both types. But the evaluation strategies, failure modes, and architectural decisions are completely different:

	Type 1: Persistent Context	Type 2: Stateless Decision
Value source	Accumulated context	Per-decision accuracy
State	Massive, evolving	None
Failure mode	Context drift, stale memory, wrong associations	Classification error, hallucinated reasoning
Evaluation	Hard. Qualitative + long-horizon.	Easy. Ground truth comparison at scale.
Improvement loop	Human feedback, memory curation	Automated tournament, A/B testing
Cost profile	High per-instance (long context, many calls)	Low per-decision ($0.001-0.01 each)
Scaling	One per user/team	Millions of identical calls
Deployment	Always-on, stateful	Stateless API endpoint

The mistake most teams make: building a Type 1 agent when they need a Type 2 function, or vice versa.

Karpathy's Frameworks Applied

The "Verifiability" Lens

Karpathy's "Verifiability" post (Nov 2025) gives us the sharpest tool for deciding between types:

"Software 1.0 easily automates what you can specify. Software 2.0 easily automates what you can verify."

Type 2 agents thrive where you can verify. The transaction was fraudulent or it wasn't. The code compiles or it doesn't. The answer is right or it's wrong. Verification enables optimization: you can build a harness, run 300,000 evaluations, and find the prompt that maximizes your business-weighted scoring function.

Type 1 agents operate where you can't fully verify. Was that Slack summary helpful? Did the assistant make the right judgment call about which emails to flag? Was the memory it retained actually the important part of yesterday's meeting? These are judgment calls, not classification tasks. You can't score them with a simple function.

This connects to Karpathy's insight about RLVR (Reinforcement Learning from Verifiable Rewards):

"By training LLMs against automatically verifiable rewards... the LLMs spontaneously develop strategies that look like 'reasoning' to humans."

Type 2 agents benefit directly from verifiability: you can build your own mini-RLVR at the prompt level using a tournament harness. Type 1 agents can't, because their value emerges over time from accumulated context, not from any single verifiable output.

The "Jagged Intelligence" Lens

Karpathy's "Ghosts vs Animals" framing:

"LLMs display amusingly jagged performance characteristics - they are at the same time a genius polymath and a confused and cognitively challenged grade schooler."

This jaggedness manifests differently in each type:

Type 2: Jaggedness shows up as subclass failures. The model correctly classifies 99% of one category but only 60% of another edge case category. You can find these jagged edges through automated testing and smooth them with better context data.

Type 1: Jaggedness shows up as unpredictable context failures. The agent remembers a detail from three weeks ago but forgets something from yesterday. It handles one person's communication style brilliantly and completely misreads another. You can't find these edges through automated testing because they depend on the accumulated state.

The "Context Engineering" Lens

From Karpathy's YC talk on the new LLM app layer:

"LLM apps like Cursor bundle and orchestrate LLM calls for specific verticals. They do the 'context engineering.'"

Both types need context engineering, but it means something completely different for each:

Type 2 context engineering: What structured data do you put in the context window to maximize decision accuracy? Historical outcomes, similar past cases, entity-specific patterns. This is optimizable: you can A/B test different context configurations and measure which one improves accuracy.

Type 1 context engineering: What does the agent need to remember? How does it decide what's important? How does it structure its memory? How does it retrieve the right context for the current moment? This is closer to cognitive architecture than optimization. It's about building good systems for information management, not about finding the optimal input for a single call.

How to Evaluate Each Type

Evaluating Type 2: The Harness

For Type 2 agents, evaluation is a solved problem (at least in principle). You need:

Labeled ground truth data (>1,000 examples, ideally >5,000)
A scoring function that reflects business value (dollar-weighted, not just accuracy)
A tournament loop that tests prompt variants against the data

Jason Wei (OpenAI):

"It's critical to have a single-number metric."

For Type 2, this works. Your single number is the business-weighted score. Run the tournament. Ship the winner.

The tooling landscape:

Data volume	Error symmetry	Tool
<1,000 examples	Symmetric	Promptfoo, DSPy
>1,000	Symmetric	DSPy with custom metric
>1,000	Dollar-weighted	Custom harness

DSPy (Stanford) is particularly interesting here. From the paper:

"DSPy modules are parameterized, meaning they can learn how to apply compositions of prompting, finetuning, augmentation, and reasoning techniques."

It automates prompt search using typed signatures. If your task fits cleanly into a typed input/output contract and standard metrics suffice, DSPy is likely all you need.

But the moment your scoring involves dollars, domain-specific weights, or subclass analysis, you're building custom. No off-the-shelf tool handles "this $5,000 error is 500x worse than this $10 error" natively.

Evaluating Type 1: The Hard Problem

Type 1 evaluation is genuinely unsolved. Here's why:

The evaluation window is too long. The value of a Type 1 agent emerges over days and weeks. A single-turn eval tells you almost nothing. You need to evaluate trajectories, not outputs.

The state space is enormous. The agent's behavior depends on everything it has seen and remembered. Two identical inputs can produce different (and both correct) outputs depending on accumulated context.

The ground truth is subjective. Was that a good summary? Was that the right thing to remember? Was that an appropriate time to interrupt the user? These are judgment calls.

The failure modes are subtle. A Type 1 agent doesn't fail by producing a wrong classification. It fails by slowly drifting: accumulating stale context, developing incorrect associations, missing important signals while retaining unimportant ones.

What works (imperfectly):

Human evaluation at checkpoints. Periodically review the agent's memory, decisions, and outputs. Does it still have the right context? Is it making good judgment calls?
Regression testing on known scenarios. Feed the agent known situations and check if its behavior is reasonable. This doesn't test accumulated context but catches gross regressions.
Self-consistency checks. Does the agent contradict itself? Does its memory match reality?
User behavior signals. Are users correcting the agent more often? Asking the same question twice? Overriding its suggestions? These are implicit failure signals.

Hamel Husain's observation applies differently here:

"Success with AI hinges on how fast you can iterate."

For Type 2, iteration means running more prompt variants through the harness. For Type 1, iteration means watching the agent operate in the real world and tuning its memory architecture, retrieval logic, and decision boundaries based on observed failures.

The Complementary Architecture

The real insight: production systems usually need both types working together.

Consider a fintech company processing loan applications. They need:

A Type 1 agent that lives in their Slack, reads channels, tracks ongoing customer escalations, knows the team's history, understands each analyst's expertise, and can answer "what's the status of Account X?" based on weeks of accumulated context.
A Type 2 agent that processes 10,000 loan applications per day, making APPROVE/DENY/REVIEW decisions at 97% accuracy with a dollar-weighted scoring function.

The Type 1 agent can't do the Type 2 task at the required accuracy and volume. It doesn't need to be 97% right on a narrow classification. It needs to be broadly helpful with deep context.

The Type 2 agent can't do the Type 1 task at all. It has no memory, no ongoing context, no understanding of relationships or history.

But they can feed each other:

The Type 1 agent notices a pattern in team chats ("three analysts mentioned the same fraud vector this week") and flags it for investigation
The Type 2 agent's accuracy data reveals that applications from region Y have a 40% false positive rate, suggesting a rule adjustment
The Type 1 agent routes that insight to the right person with full context about why it matters

Dario Amodei's "complementary factors to intelligence" framework is useful here. Each agent type provides a factor that's complementary to the other:

Type 2 provides reliability at scale (the Type 1 agent's weakness)
Type 1 provides accumulated context and judgment (the Type 2 agent's impossibility)

The Karpathy Corollary

Karpathy's "Power to the People" essay (Apr 2025) makes a point about LLM impact that maps onto this framework:

"LLMs offer a very specific profile of capability - that of merely quasi-expert knowledge/performance, but simultaneously across a very wide variety of domains."

Type 1 agents exploit the breadth. They're the "quasi-expert across many domains" that helps a single person or team operate across more surface area than they could alone. Breadth + context = a really good generalist assistant.
Type 2 agents exploit the depth. On a narrow, well-defined task with good context data, they're not quasi-expert. They're better than the median human expert because they have access to historical patterns that no individual human can hold in memory. Depth + data = a specialist that outperforms.

His observation from "Animals vs Ghosts" also applies:

"Animals experience pressure for a lot more 'general' intelligence because of the highly multi-task and even actively adversarial multi-agent self-play environments they are min-max optimized within."

Type 1 agents are more animal-like: they need general intelligence across many tasks, adaptive behavior, social awareness. Type 2 agents are more ghost-like: spiky, narrow, optimized hard on a single verifiable dimension.

The Decision Framework

START
  |
  v
Does the task require memory across sessions?
Does value accumulate over time?
Is the task multi-domain and judgment-heavy?
  |
  YES to any --> TYPE 1: Persistent Context Agent
  |               Focus on: memory architecture, retrieval,
  |               context management, qualitative eval
  |
  NO to all
  |
  v
Is the task verifiable? (clear right/wrong)
Is it high volume? (>500/day)
Can errors be quantified in dollars?
  |
  YES to all --> TYPE 2: Stateless Decision Function
  |               Focus on: tournament harness, ground truth
  |               data, dollar-weighted scoring, A/B testing
  |
  YES to some --> TYPE 2 with simpler eval
  |               (DSPy, Promptfoo, standard metrics)
  |
  NO to all --> TYPE 1 (probably)
                or reconsider if you need an agent at all

Anthropic's advice remains the sanity check:

"We recommend finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all."

What's Missing From the Ecosystem

Type 1 eval frameworks. Almost nonexistent. We have rich tooling for Type 2 (ground truth comparison) and almost nothing for evaluating persistent context agents over long horizons.
Memory architecture patterns. Type 1 agents need standardized approaches to: what to remember, when to forget, how to retrieve, how to update. Right now every team reinvents this.
Hybrid orchestration. Systems that let a Type 1 agent delegate to Type 2 functions seamlessly. The persistent agent handles context and judgment; the stateless function handles high-volume decisions.
Context search optimization. For Type 2, the biggest gains come from optimizing what data goes in the context window (not the prompt text). DSPy optimizes prompts beautifully. Nobody has built the equivalent for context window configuration.
Cost-aware deployment patterns. Type 1 agents are expensive per-instance (always-on, long context). Type 2 agents are cheap per-call. Understanding the cost structure determines what to automate and what to leave to humans.

The Bottom Line

The industry keeps asking "how do I build an AI agent?" as if it's one question.

It's two questions:

How do I build a reliable, testable decision function that can process thousands of identical decisions per day? Answer: a stateless Type 2 agent with a tournament eval harness.
How do I build an AI that accumulates context, exercises judgment, and gets more useful over time? Answer: a persistent Type 1 agent with memory architecture and qualitative evaluation.

They're different architectures, different eval strategies, different failure modes, different cost structures.

You probably need both. Know which one you're building.

Bo Romir writes about applied AI architecture and the infrastructure decisions that determine whether AI systems actually work in production.

References:

Karpathy, A. "2025 LLM Year in Review" (2025)
Karpathy, A. "Verifiability" (2025)
Karpathy, A. "Animals vs Ghosts" (2025)
Karpathy, A. "The Space of Minds" (2025)
Karpathy, A. "Power to the People" (2025)
Karpathy, A. "Software 2.0" (2017)
Karpathy, A. "Software Is Changing (Again)" - YC Talk (2025)
Amodei, D. "Machines of Loving Grace" (2024)
Anthropic. "Building Effective Agents" (2024)
Husain, H. "Your AI Product Needs Evals" (2024)
Wei, J. "Successful Language Model Evals" (2024)
Khattab, O. et al. "DSPy" (Stanford, 2023)
OpenAI. "New Tools for Building Agents" (2025)

DEV Community: Bo Romir