DEV Community

Cover image for Building Scalable AI Agent Systems: Three Evolutions
web3nomad.eth
web3nomad.eth

Posted on

Building Scalable AI Agent Systems: Three Evolutions

I. December 2025

We needed to add a new feature to atypica.AI: group discussions (discussionChat).

This should've been simple. We already had interviewChat—one-on-one conversations where users deeply engage with AI-simulated personas. Group discussion was just scaling from 1-to-1 to 1-to-many: 3-8 personas engaging simultaneously, watching perspectives collide and insights emerge.

In theory, we just needed to:

  1. Reuse the interview logic
  2. Adjust prompts to simulate group dynamics
  3. Tweak the UI to show multiple speakers

The reality: We had to modify 12 files.

prisma/schema.prisma          # New Discussion table
src/ai/tools/discussionChat.ts  # New tool
src/ai/tools/saveDiscussion.ts   # Save tool
src/app/(study)/agents/studyAgent.ts     # Add tool to agent
src/app/(study)/agents/fastInsightAgent.ts  # Add again
src/app/(study)/agents/productRnDAgent.ts   # And again
... 6 more files
Enter fullscreen mode Exit fullscreen mode

Worse, we discovered this:

// studyAgentRequest.ts (493 lines)
export async function studyAgentRequest(context) {
  const result = await streamText({
    model: llm("claude-sonnet-4"),
    system: studySystem(),
    messages,
    tools: {
      webSearch,
      interview,
      scoutTask,
      saveAnalyst,
      generateReport
      // ... 15 tools
    },
    onStepFinish: async (step) => {
      // Save messages
      // Track tokens
      // Send notifications
      // ... 120 lines of logic
    }
  });
}

// fastInsightAgentRequest.ts (416 lines)
// 95% identical code

// productRnDAgentRequest.ts (302 lines)
// 95% identical code
Enter fullscreen mode Exit fullscreen mode

Three nearly identical agent wrappers.
Every new feature required copy-pasting across all three.
Every bug fix meant changing it three times.

That moment, we realized: something was fundamentally wrong.

Not that our code wasn't elegant.
Not that we lacked abstraction.
But that we were building AI Agent systems with traditional software engineering thinking.

This article chronicles how we escaped this trap—through three architectural evolutions, rethinking how AI Agents should be built from first principles.


II. Rethinking: What is an AI Agent?

Before refactoring, we stopped to ask a fundamental question:

What's the essential difference between AI Agents and traditional software?

The World of Traditional Software

Traditional software is built on state machines:

class ResearchSession {
  state: 'IDLE' | 'PLANNING' | 'RESEARCHING' | 'REPORTING';
  data: {
    interviews: Interview[];
    findings: Finding[];
    reports: Report[];
  };

  transition(event: Event) {
    switch (this.state) {
      case 'IDLE':
        if (event.type === 'START') this.state = 'PLANNING';
        break;
      case 'PLANNING':
        if (event.type === 'PLAN_COMPLETE') this.state = 'RESEARCHING';
        break;
      // ... more state transitions
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

This model's core assumptions:

  • State is explicit: I know exactly where I am
  • Transitions are deterministic: Given state + event, next state is unique
  • Control is precise: if-else covers all paths

This works beautifully for traditional software. But for AI Agents?

The World of AI Agents

LLMs don't work this way:

const messages = [
  { role: 'user', content: 'Want to understand young people's coffee preferences' },
  { role: 'assistant', content: 'I can help you conduct user research...' },
  { role: 'assistant', toolCalls: [{ name: 'scoutTask', args: {...} }] },
  { role: 'tool', content: 'Observed 5 user segments...' },
  { role: 'assistant', content: 'Based on observations, I suggest interviewing 18-25 coffee enthusiasts...' },
  { role: 'assistant', toolCalls: [{ name: 'interviewChat', args: {...} }] },
  // ...
];
Enter fullscreen mode Exit fullscreen mode

Where's the "state" here?

  • Not in a state field
  • But in the entire conversation history

The AI infers from conversation history:

  • What research does the user want?
  • How far have we progressed?
  • What should happen next?

This is a completely different paradigm.

Three Core Insights

From this observation, we derived three insights that shaped our architectural evolution.

Insight 1: Conversation as State

Traditional approach: Maintain explicit state

// ❌ Traditional: Explicit state management
interface ResearchState {
  stage: 'planning' | 'researching' | 'reporting';
  completedInterviews: number;
  pendingTasks: Task[];
}

// Need synchronization: state and conversation history can diverge
Enter fullscreen mode Exit fullscreen mode

AI-native approach: Infer state from conversation

// ✅ AI-native: Conversation is state
const messages = [...conversationHistory];

// AI infers state from history, no explicit sync needed
const result = await streamText({
  messages,
  // AI knows what to do
});
Enter fullscreen mode Exit fullscreen mode

Why is conversation superior to state machines?

  1. Natural alignment: LLMs work on message history natively
  2. Strong fault tolerance: State machines are hard to recover from errors; conversations can be "rewound" and replayed
  3. Easy extension: Adding new capabilities doesn't require modifying state graphs

Insight 2: Reasoning-Execution Separation

How humans make decisions:

  1. Understand intent: "What am I trying to achieve?" → Clarify goals
  2. Choose method: "How do I do it?" → Execution steps

AI Agents should follow the same pattern:

// Plan Mode: Understanding intent
"User says: want to understand young people's coffee preferences"
   Analyze: needs qualitative research
   Decide: use group discussion method
   Output: complete research plan

// Study Agent: Executing plan
"Received research plan"
   Call discussionChat
   Analyze discussion results
   Generate insights report
Enter fullscreen mode Exit fullscreen mode

Why separate?

  • Reasoning needs deep thinking (use Claude Sonnet 4)
  • Execution needs fast response (can use smaller models)
  • Separation of concerns, single responsibility

Insight 3: Simple Over Precise

Facing the "AI forgetfulness" problem, we could:

Option A: Vector DB + Semantic Search

// Precise matching of relevant memories
const query_embedding = await embed(user_message);
const relevant_memories = await vectorDB.search(query_embedding, top_k=5);
Enter fullscreen mode Exit fullscreen mode
  • ✅ Precise retrieval
  • ❌ Requires embedding, indexing, complex queries
  • ❌ High maintenance cost

Option B: Markdown Files + Full Loading

// Simple and transparent
const memory = await readFile(`memories/${userId}.md`);
const messages = [
  { role: 'user', content: `<UserMemory>\n${memory}\n</UserMemory>` },
  ...conversationMessages
];
Enter fullscreen mode Exit fullscreen mode
  • ✅ Simple, transparent, user-editable
  • ✅ Leverages large context windows (Claude 200K tokens)
  • ✅ Easier to debug and understand

We chose Option B.

Why?

  1. Context windows changed the game: User memory typically < 10K, full loading is perfectly viable
  2. Simple solutions are more reliable: No embedding inconsistency, no retrieval failures
  3. User control: Memory is transparent, users can view and edit

Four Design Principles

From these three insights, we distilled the core principles of our architecture:

1. Messages as Source of Truth

  • All important information lives in messages
  • Database only stores derived state (like reports, study logs)
  • Similar to Event Sourcing: messages are the event log

2. Configuration over Code

  • Use configuration to express differences
  • Use code to express commonalities
  • Avoid over-abstraction

3. AI as State Manager

  • Let AI manage state transitions
  • Don't hand-write complex state machines
  • Adapt to LLM's capability boundaries

4. Simple, Transparent, Controllable

  • Simple beats complex
  • Transparent beats black box
  • User control beats AI automation

III. Step 1: Message-Driven Architecture

v2.2.0 - 2025-12-27

Problem: Dual Source of Truth

Initially, research data was scattered across three places:

// Place 1: analyst table
const analyst = await prisma.analyst.findUnique({
  where: { id }
});
console.log(analyst.studySummary);  // "Research summary..."

// Place 2: interviews table
const interviews = await prisma.interview.findMany({
  where: { analystId: id }
});
console.log(interviews.map(i => i.conclusion));  // ["Interview 1 conclusion", "Interview 2 conclusion"]

// Place 3: messages table
const messages = await prisma.chatMessage.findMany({
  where: { userChatId }
});
// webSearch results are here
Enter fullscreen mode Exit fullscreen mode

Generating reports required stitching from three places:

async function generateReport(analystId) {
  const analyst = await prisma.analyst.findUnique({
    where: { id: analystId },
    include: { interviews: true }  // JOIN!
  });

  const messages = await prisma.chatMessage.findMany({
    where: { userChatId: analyst.studyUserChatId }
  });

  // Stitch data together
  const reportData = {
    summary: analyst.studySummary,              // from analyst table
    interviewInsights: analyst.interviews.map(...),  // from interviews table
    webResearch: extractFromMessages(messages)    // from messages table
  };
}
Enter fullscreen mode Exit fullscreen mode

Problems:

  1. Data inconsistency: interviews.conclusion and interview content in messages could diverge
  2. Partial failures: When tool calls fail, data is half-saved, hard to trace full context
  3. Hard to extend: Adding discussionChat requires new table, new tool, new queries

Even worse, tool outputs were inconsistent:

// interviewChat: content in DB, returns reference
{
  toolName: 'interviewChat',
  output: { interviewId: 123 }  // Need another DB query
}

// scoutTaskChat: content in return value
{
  toolName: 'scoutTaskChat',
  output: {
    plainText: "Observation results...",  // Content directly returned
    insights: [...]
  }
}
Enter fullscreen mode Exit fullscreen mode

Agents couldn't handle this uniformly, leading to complex code.

Solution: Messages as Single Source

Core idea: All research content flows into the message stream. Database only stores derived state.

// ✅ New architecture: Unified output format
interface ResearchToolResult {
  plainText: string;  // Human-readable summary, required
  [key: string]: any; // Optional structured data
}

// interviewChat also returns plainText
{
  toolName: 'interviewChat',
  output: {
    plainText: "Interview summary: User Zhang San mentioned...",  // ← Full content here
    interviewId: 123  // Optional: DB reference
  }
}
Enter fullscreen mode Exit fullscreen mode

Key changes:

  1. Removed 5 specialized save tools

    • Deleted: saveInterview, saveDiscussion, saveScoutTask, ...
    • Reason: Agents output directly to messages, no explicit save needed
  2. Unified tool output format

    • All research tools return plainText
    • Agents can uniformly process all tool results
  3. Generate studyLog on demand

   // Don't pre-save, generate when needed
   if (!analyst.studyLog) {
     const messages = await loadMessages(studyUserChatId);
     const studyLog = await generateStudyLog(messages);  // ← Generate from messages
     await prisma.analyst.update({
       where: { id },
       data: { studyLog }
     });
   }
Enter fullscreen mode Exit fullscreen mode

Why This Design?

Reasoning from first principles:

  1. Conversation as context

    • LLMs need complete context to generate reports
    • Message history is naturally the most complete, most natural context
    • Avoids complexity of "reconstructing context from DB"
  2. LLMs excel at extraction

    • Generating structured content (studyLog) from conversations is LLM's strength
    • More flexible and reliable than hand-written parsing logic
  3. Shadow of Event Sourcing

    • Message sequence = event log
    • studyLog, report = derived state
    • Can be replayed and regenerated anytime

Comparison with other approaches:

Approach Pros Cons Why not chosen
Messages as source Data consistent, easy to extend Requires extra LLM call to generate studyLog ✅ Our choice
Traditional state management Precise control Complex state sync, hard to trace Doesn't suit LLM non-determinism
Remove DB entirely Extremely simple Frontend queries difficult, history hard to manage Need structured display
Event Sourcing Complete history, replayable High engineering complexity Over-engineered for current scale

Impact

Code simplification:

Deleted files:
- src/ai/tools/saveInterview.ts
- src/ai/tools/saveDiscussion.ts
- src/ai/tools/saveScoutTask.ts
- src/ai/tools/savePersona.ts
- src/ai/tools/saveWebSearch.ts

Simplified files (28):
- Agent configs no longer need save tools
- generateReport doesn't need multi-table JOINs
Enter fullscreen mode Exit fullscreen mode

Development efficiency:

Before:

Adding discussionChat:
1. Create Discussion table
2. Write discussionChat tool
3. Write saveDiscussion tool
4. Add both tools to 3 agents
5. Write discussion query logic
6. Modify generateReport query

Total: 12 files, 2-3 days
Enter fullscreen mode Exit fullscreen mode

After:

Adding discussionChat:
1. Write discussionChat tool (returns plainText)
2. Add tool to agent config
3. generateReport auto-supports (reads from messages)

Total: 3 files, 2-3 hours
Enter fullscreen mode Exit fullscreen mode

Cost trade-offs:

Benefits:

  • Simplified architecture: deleted 5 tools, simplified 28 files
  • Data consistency: full context traceable even on failures
  • Easy extension: adding new research methods goes from 12 steps → 3 steps

Costs:

  • studyLog generation requires extra LLM call (~2K tokens, ~$0.002)
  • Slightly higher token consumption for long conversations

Mitigation:

  • Prompt cache reduces repeated token cost by 90%
  • Architectural benefits far outweigh costs

III. Step 2: Intent Clarification + Unified Execution

v2.3.0 - 2026-01-06

Problem 1: Vague Requirements → Inefficient Dialogue

After implementing message-driven architecture, adding features became simpler. But user experience wasn't good enough.

When creating research, users often say:

"Want to understand young people's coffee preferences"

This isn't specific enough:

  • Which young people? 18-22 college students? Or 23-28 young professionals?
  • What method? In-depth interviews? Group discussions? Or social media observation?
  • What output? User personas? Market insights? Or product recommendations?

Traditional approach: AI asks multiple questions

AI: "Which age group do you want to research?"
User: "18-25 I guess"
AI: "What method? Interviews or surveys?"
User: "Interviews"
AI: "How many people?"
User: "Around 10"
Enter fullscreen mode Exit fullscreen mode

Problems:

  • Requires 3-5 conversation rounds
  • Poor user experience (feels like filling forms)
  • AI can't proactively suggest best approaches

Problem 2: 95% Duplicate Code

While adding features became simpler, we discovered a bigger technical debt:

$ wc -l src/app/(study)/agents/*AgentRequest.ts
493 studyAgentRequest.ts
416 fastInsightAgentRequest.ts
302 productRnDAgentRequest.ts
Enter fullscreen mode Exit fullscreen mode

Three nearly identical agent wrappers, totaling 1,211 lines.

Code duplication mainly in:

  • Message loading and processing (~80 lines each)
  • File attachment handling (~60 lines each)
  • MCP integration (~40 lines each)
  • Token tracking (~50 lines each)
  • Notification sending (~30 lines each)

Every new feature (like webhook integration) required changing all three places.

Solution: Plan Mode + baseAgentRequest

Our solution has two parts:

Part 1: Plan Mode (Intent Clarification Layer)

A separate agent dedicated to intent clarification:

// src/app/(study)/agents/configs/planModeAgentConfig.ts

export async function createPlanModeAgentConfig() {
  return {
    model: "claude-sonnet-4-5",
    systemPrompt: planModeSystem({ locale }),
    tools: {
      requestInteraction,  // Interact with user
      makeStudyPlan,       // Display complete plan, one-click confirm
    },
    maxSteps: 5,  // Max 5 steps to complete clarification
  };
}
Enter fullscreen mode Exit fullscreen mode

Workflow:

sequenceDiagram
    participant User
    participant PlanMode as Plan Mode Agent
    participant StudyAgent as Study Agent

    User->>PlanMode: "Want to understand young people's coffee preferences"

    PlanMode->>PlanMode: Analyze requirements
    Note over PlanMode: - Target: 18-25 years old<br/>- Research type: qualitative insights<br/>- Best method: group discussion

    PlanMode->>User: Display complete plan
    Note over PlanMode,User: 【Research Plan】<br/>Goal: Understand 18-25 coffee preferences<br/>Method: Group discussion (5-8 people)<br/>Duration: ~40 minutes<br/>Output: Consumer insights report<br/><br/>[Confirm Start] [Modify Plan]

    User->>PlanMode: [Confirm Start]

    PlanMode->>StudyAgent: Intent recorded in messages
    Note over StudyAgent: Read intent from conversation history<br/>Execute research plan

    StudyAgent->>User: Start research execution
Enter fullscreen mode Exit fullscreen mode

Key design:

  • Plan Mode's decisions are recorded in messages
  • Study Agent infers intent from messages, no explicit passing needed
  • Avoids complexity of context passing

Part 2: baseAgentRequest (Unified Executor)

Merge three duplicate agent wrappers into one generic executor:

// src/app/(study)/agents/baseAgentRequest.ts (577 lines)

interface AgentRequestConfig<TOOLS extends ToolSet> {
  model: LLMModelName;
  systemPrompt: string;
  tools: TOOLS;
  maxSteps?: number;

  specialHandlers?: {
    // Dynamically control which tools are available
    customPrepareStep?: (options) => {
      messages,
      activeTools?: (keyof TOOLS)[]
    };

    // Custom post-processing logic
    customOnStepFinish?: (step, context) => Promise<void>;
  };
}

async function executeBaseAgentRequest<TOOLS>(
  baseContext: BaseAgentContext,
  config: AgentRequestConfig<TOOLS>,
  streamWriter: UIMessageStreamWriter
) {
  // Phase 1: Initialization
  // Phase 2: Prepare Messages
  // Phase 3: Universal Attachment Processing
  // Phase 4: Universal MCP and Team System Prompt
  // Phase 5: Load Memory and Inject into Context
  // Phase 6: Main Streaming Loop
  // Phase 7: Universal Notifications
}
Enter fullscreen mode Exit fullscreen mode

Agent routing:

// src/app/(study)/api/chat/route.ts

if (!analyst.kind) {
  // Plan Mode - intent clarification
  const config = await createPlanModeAgentConfig(agentContext);
  await executeBaseAgentRequest(agentContext, config, streamWriter);

} else if (analyst.kind === AnalystKind.productRnD) {
  // Product R&D Agent
  const config = await createProductRnDAgentConfig(agentContext);
  await executeBaseAgentRequest(agentContext, config, streamWriter);

} else {
  // Study Agent (comprehensive research, fast insights, testing, creative, etc.)
  const config = await createStudyAgentConfig(agentContext);
  await executeBaseAgentRequest(agentContext, config, streamWriter);
}
Enter fullscreen mode Exit fullscreen mode

Each agent only needs to define configuration:

// src/app/(study)/agents/configs/studyAgentConfig.ts

export async function createStudyAgentConfig(params) {
  return {
    model: "claude-sonnet-4",
    systemPrompt: studySystem({ locale }),
    tools: buildStudyTools(params),  // ← Tools this agent needs

    specialHandlers: {
      // Custom tool control
      customPrepareStep: async ({ messages }) => {
        const toolUseCount = calculateToolUsage(messages);
        let activeTools = undefined;

        // After report generation, restrict available tools
        if ((toolUseCount[ToolName.generateReport] ?? 0) > 0) {
          activeTools = [
            ToolName.generateReport,
            ToolName.reasoningThinking,
            ToolName.toolCallError,
          ];
        }

        return { messages, activeTools };
      },

      // Custom post-processing
      customOnStepFinish: async (step) => {
        // After saving research intent, auto-generate title
        const saveAnalystTool = findTool(step, ToolName.saveAnalyst);
        if (saveAnalystTool) {
          await generateChatTitle(studyUserChatId);
        }
      },
    },
  };
}
Enter fullscreen mode Exit fullscreen mode

Why This Design?

Reasoning-execution separation rationale:

  1. Matches cognitive model

    • Human decision-making: first figure out "what to do", then consider "how to do it"
    • System 1 (intuition) vs System 2 (reasoning)
    • Plan Mode = System 2, Study Agent = System 1
  2. Single responsibility

    • Plan Mode: focuses on intent understanding, doesn't need to know execution details
    • Study Agent: focuses on research execution, doesn't need to handle clarification
    • Each is simpler and easier to maintain
  3. Messages as protocol

    • Plan Mode's decisions → messages
    • Study Agent reads intent from messages
    • Loosely coupled without losing context

Unified executor rationale:

  1. Extract, Don't Rebuild

    • Extract common patterns from three similar implementations
    • Not designing abstraction layer from scratch
  2. Configuration over Inheritance

    • Agent differences expressed through configuration
    • No inheritance or polymorphism
  3. Plugin-based Lifecycle

    • customPrepareStep: dynamic tool control
    • customOnStepFinish: custom post-processing
    • Preserve extension points, don't hard-code all logic

Comparison with other approaches:

Approach Pros Cons Why not chosen
Plan Mode + baseAgentRequest Remove duplicate code, separate reasoning-execution One more abstraction layer ✅ Our choice
Continue copy-pasting Simple and direct Tech debt accumulates, hard to maintain Unsustainable long-term
Fully generic agent Least code Sacrifices specialization and control Can't handle business differences
Microservices split Independent deployment Over-engineered, adds ops complexity Unnecessary at current scale

Impact

Code complexity:

Deleted:
- studyAgentRequest.ts (493 lines)
- fastInsightAgentRequest.ts (416 lines)
- productRnDAgentRequest.ts (302 lines)
Total: -1,211 lines

Added:
+ baseAgentRequest.ts (577 lines)
+ planModeAgentConfig.ts (120 lines)
+ studyAgentConfig.ts (180 lines)
+ productRnDAgentConfig.ts (80 lines)
Total: +957 lines

Net reduction: -254 lines
Enter fullscreen mode Exit fullscreen mode

But more importantly:

  • Cyclomatic Complexity: 12.3 → 6.7 (45% reduction)
  • Code duplication: 95% → 0%

Development efficiency:

Before:

Adding MCP integration:
1. Modify studyAgentRequest.ts
2. Modify fastInsightAgentRequest.ts
3. Modify productRnDAgentRequest.ts
4. Test three agents

Time: 2-3 days
Enter fullscreen mode Exit fullscreen mode

After:

Adding MCP integration:
1. Modify baseAgentRequest.ts
2. All agents automatically gain new capability

Time: 2-3 hours
Enter fullscreen mode Exit fullscreen mode

User experience:

Before:

User: "Want to understand young people's coffee preferences"
AI: "Which age group do you want to research?"
User: "18-25"
AI: "What method do you want to use?"
User: "Interviews I guess"
AI: "How many people?"
... (3-5 conversation rounds)
Enter fullscreen mode Exit fullscreen mode

After:

User: "Want to understand young people's coffee preferences"
AI displays complete plan:
┌─────────────────────────────────────┐
│ 【Research Plan】                   │
│ Goal: Understand 18-25 coffee prefs │
│ Method: Group discussion (5-8 ppl)  │
│ Duration: ~40 minutes               │
│ Output: Consumer insights report    │
│                                     │
│ [Confirm Start] [Modify Plan]       │
└─────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Intent clarification: 3-5 conversation rounds → 1 confirmation


III. Step 3: Persistent Memory

v2.3.0 - 2026-01-08

Problem: AI "Amnesia"

With intent clarification and unified architecture, the research workflow was smooth. But long-term users reported a problem:

"Why does the AI ask me what industry I'm in every single time?"

The AI doesn't remember users. Every conversation feels like the first meeting:

  • "What industry are you in?"
  • "Which dimensions do you care about?"
  • "What's your research goal?"

Users feel the AI is "forgetful", the experience lacks personalization.

Root cause:

LLMs are stateless. Each conversation:

const result = await streamText({
  messages: currentConversation,  // ← Only current conversation
  // No context from historical conversations
});
Enter fullscreen mode Exit fullscreen mode

Although we have historical conversations in the DB:

  1. Cross-conversation info lost: Each research is an independent session
  2. Important info buried: Key information in long conversations is hard to extract
  3. No persistent memory: No long-term memory of "who the user is"

Solution: Two-Tier Memory Architecture

We need a persistent memory system. But how to design it?

Inspired by Anthropic's CLAUDE.md approach:

  • Simple Markdown files
  • User-viewable and editable
  • Fully loaded into context

We adopted a similar approach but added automatic update mechanisms.

Data Model

model Memory {
  id      Int  @id @default(autoincrement())
  userId  Int? // User-level memory
  teamId  Int? // Team-level memory
  version Int  // Version management

  // Two-tier architecture
  core    String @default("") @db.Text  // Core memory (Markdown)
  working Json   @default("[]")         // Working memory (JSON, to be consolidated)

  changeNotes String @db.Text  // Update notes

  @@unique([userId, version])
  @@index([userId, version(sort: Desc)])
}
Enter fullscreen mode Exit fullscreen mode

Two-tier architecture:

  1. Core Memory (core)

    • Markdown format, human-readable
    • Long-term stable user information
    • Example:
     # User Information
     - Industry: Consumer goods product manager
     - Focus: Young consumer preferences, emerging trends
    
     # Research Style
     - Prefers qualitative research (interviews, discussions)
     - Values authentic user voices over statistics
    
  2. Working Memory (working)

    • JSON format, structured
    • New information to be consolidated
    • Example:
     [
       { "info": "User recently focused on coffee market", "source": "chat_123" },
       { "info": "Prefers group discussion method", "source": "chat_124" }
     ]
    

Automatic Update Mechanism

Two-stage update:

// src/app/(memory)/actions.ts

async function updateMemory({ userId, conversationContext }) {
  let memory = await loadLatestMemory(userId);

  // Step 1: Reorganize when threshold exceeded (Claude Sonnet 4.5)
  if (memory.core.length > 8000 || memory.working.length > 20) {
    memory = await reorganizeMemory(memory, conversationContext);
  }

  // Step 2: Extract new information (Claude Haiku 4.5)
  const newInfo = await extractMemoryUpdate(memory.core, conversationContext);

  if (newInfo) {
    // Step 3: Insert new information at specified location
    await insertMemoryInfo(memory, newInfo);
  }
}
Enter fullscreen mode Exit fullscreen mode

Memory Update Agent (Haiku 4.5):

  • Extract new user information from conversations
  • Low cost (~$0.001/time)
  • Runs in background after each conversation

Memory Reorganize Agent (Sonnet 4.5):

  • Consolidate working memory into core memory
  • Remove redundancy, merge similar information
  • Slightly higher cost (~$0.02/time), but infrequently triggered

Integration into Conversation Flow

// src/app/(study)/agents/baseAgentRequest.ts

// Phase 5: Load Memory
const memory = await loadUserMemory(userId);

if (memory?.core) {
  // Inject at conversation start
  modelMessages = [
    {
      role: 'user',
      content: `<UserMemory>\n${memory.core}\n</UserMemory>`
    },
    ...modelMessages
  ];
}

// Phase 6: Streaming
const result = await streamText({
  messages: modelMessages,  // ← Includes user memory
  // ...
});

// Phase 7: Non-blocking memory update
waitUntil(
  updateMemory({ userId, conversationContext: messages })
);
Enter fullscreen mode Exit fullscreen mode

Why This Design?

Why Markdown over Vector DB?

  1. Context window is large enough

    • Claude 3.5 Sonnet: 200K tokens
    • User memory typically < 10K characters (~3K tokens)
    • Full loading is simpler and more accurate than retrieval
  2. Simple and transparent

    • Markdown is user-readable and editable
    • No embeddings, no vector search, no complex indexing
    • Aligns with Anthropic's philosophy: user control
  3. Avoid premature optimization

    • Don't need real-time retrieval (low conversation frequency)
    • Don't need precise matching (full text provides enough context)
    • Start with simple solution, optimize when necessary

Comparison with mainstream approaches:

Approach Storage Control Retrieval atypica choice rationale
Anthropic (CLAUDE.md) File-based User-driven Full loading ✅ Simple, transparent, effective with large context
OpenAI Vector DB (speculated) AI + user confirmation Semantic retrieval ❌ Black box, weak user control
Mem0 Vector + Graph + KV AI-driven Hybrid retrieval ❌ Over-engineered, high maintenance cost
MemGPT OS-inspired tiered AI self-managed Tiered retrieval ❌ Conceptually complex, utility unproven

We chose Anthropic's simple approach because:

  1. Fits current scale (personal assistant, not enterprise knowledge base)
  2. User controllable (transparent, editable)
  3. As context windows grow, this approach becomes better

Impact

User experience:

Before:

First conversation:
User: "Want to do coffee research"
AI: "What industry are you in?"
User: "Consumer goods"
AI: "What dimensions do you care about?"
...

Second conversation (a week later):
User: "Want to do tea beverage research"
AI: "What industry are you in?"  # ← Asks again
Enter fullscreen mode Exit fullscreen mode

After:

First conversation:
User: "Want to do coffee research"
AI: "What industry are you in?"
User: "Consumer goods product manager"
# AI remembers

Second conversation (a week later):
User: "Want to do tea beverage research"
AI: "Based on your background as a consumer goods PM, I suggest..."  # ← Remembers!
Enter fullscreen mode Exit fullscreen mode

System cost:

Memory Update (per conversation):
- Model: Claude Haiku 4.5
- Tokens: ~5K
- Cost: ~$0.001

Memory Reorganize (every 20 conversations):
- Model: Claude Sonnet 4.5
- Tokens: ~15K
- Cost: ~$0.02

Average cost: ~$0.002/conversation
Enter fullscreen mode Exit fullscreen mode

Response time:

Memory loading: +50ms (non-blocking)
Memory update: background, doesn't affect response
Enter fullscreen mode Exit fullscreen mode

Low cost, fast response, completely acceptable.


IV. Architecture Comparison: Our Unique Choices

Now let's step back and see how atypica's architecture differs from mainstream AI Agent frameworks.

State Management: Messages vs Memory Classes

atypica LangChain Core Difference
Messages as source ConversationBufferMemory We believe conversation history is the best state
Generate studyLog on demand Pre-compute summary Avoid sync issues, traceable on failures
DB stores derived state DB stores core state Similar to Event Sourcing

Why different?

LangChain's design is influenced by traditional software, believing "state should be explicitly stored and managed."

We believe, for LLMs:

  • Conversation history = complete state
  • Derived state (studyLog) can be regenerated
  • Simpler, more fault-tolerant

Agent Architecture: Configuration vs Graph

atypica LangGraph Core Difference
Configuration-driven Graph-driven We use configuration to express differences, code for commonalities
Single executor Node orchestration Avoid over-abstraction, good enough is enough
Messages as protocol Explicit node communication Loosely coupled without losing context

Why different?

LangGraph pursues generality, using graph orchestration to express arbitrarily complex flows.

We believe, for our scenarios:

  • Configuration-driven is simpler: 99% of needs can be met with configuration
  • Single executor is sufficient: Don't need graph orchestration's flexibility
  • Simpler is more reliable: Fewer abstraction layers, easier to debug

Memory System: Markdown vs Vector DB

atypica Mem0 Core Difference
Markdown files Vector + Graph + KV We choose simple and transparent over precise and complex
Full loading Semantic retrieval When context window is large enough, full text is better
User-editable AI black box User trust comes from transparency

Why different?

Mem0 pursues precise retrieval, using multiple databases in hybrid.

We believe, for personal assistants:

  • Simple solution is enough: User memory typically < 10K
  • Transparent beats precise: Users can view and edit memory
  • Gets better as context grows: At 1M tokens in the future, this approach will crush Vector DB

Core Philosophy Differences

atypica's choices:

  • Simple, transparent, controllable
  • Adapt to LLM characteristics (large context, non-determinism)
  • Start from real pain points, not pursuing architectural perfection

Mainstream frameworks' choices:

  • Precise, complex, automatic
  • Port traditional software engineering patterns
  • Pursue generality and flexibility

Who's right or wrong?

Neither is wrong. It's just:

  • Our scenario (personal research assistant) suits simple approaches better
  • As context windows grow, simple approaches become better
  • User trust comes from transparency, not AI magic

V. Quantitative Impact

Specific impact from three evolutions:

Code Complexity

Duplicate code:
Before: 1,211 lines (three agent wrappers)
After: 0 lines
Reduction: 100%

Total lines of code:
Before: 1,211 lines (duplicates) + others
After: 577 lines (base) + 380 lines (configs) = 957 lines
Net reduction: 254 lines (21%)

Cyclomatic Complexity (code complexity metric):
Before: avg 12.3
After: avg 6.7
Reduction: 45%
Enter fullscreen mode Exit fullscreen mode

Development Efficiency

Task Before After Improvement
Add new research method 12 files, 2-3 days 3 files, 2-3 hours 10x
Add new capability (MCP) Modify 3 places, 1 day Modify 1 place, 2 hours 4x
Fix bug Change 3 agents Change 1 base 3x

System Performance

Token consumption (with prompt cache):
- studyLog generation: ~2K tokens (~$0.002)
- Memory update: ~5K tokens (~$0.005)
- Average per conversation: +$0.007

Response time:
- Memory loading: +50ms (non-blocking)
- Plan Mode: +2s (one-time)
- studyLog generation: background, doesn't affect response
Enter fullscreen mode Exit fullscreen mode

Cost and performance impact negligible.

User Experience

Intent clarification:
Before: average 3.2 conversation rounds
After: 1 plan display + 1 confirmation
Improvement: 3x efficiency

AI "memory":
Before: repetitive questions every conversation
After: auto-load user preferences
Improvement: personalized experience

Research startup time:
Before: ~5 minutes (multiple rounds of clarification)
After: ~1 minute (one-click confirm)
Improvement: 5x efficiency
Enter fullscreen mode Exit fullscreen mode

VI. Lessons Learned

What did we learn from three evolutions?

What We Did Right

1. Incremental refactoring, not big bang

We didn't rewrite the entire system at once. Three evolutions, each step:

  • Delivers value independently
  • Maintains backward compatibility (keeping analyst.studySummary field)
  • Can be rolled back

This let us quickly validate ideas and reduce risk.

2. Start from real pain points

Don't pursue architectural perfection, instead:

  • Message-driven: because adding discussionChat was too complex
  • Unified execution: because duplicate code was too much
  • Persistent memory: because users reported AI forgetfulness

Let problems drive design, not design drive problems.

3. Embrace LLM characteristics

Don't treat LLMs as traditional software:

  • Don't hand-write state machines, let AI infer state from conversations
  • Leverage large context windows, rather than pursuing precise retrieval
  • Let AI generate studyLog, rather than hand-writing parsers

Adapt to LLM's capability boundaries, rather than fighting them.

Costs We Paid

1. Learning curve for abstraction layer

baseAgentRequest requires understanding to modify:

  • 6 phases of execution flow
  • Timing of customPrepareStep and customOnStepFinish
  • Generic constraints and type inference

But: clear interfaces and documentation lowered the barrier.

2. Cost of on-demand generation

studyLog generation requires LLM call (~$0.002/time).

But:

  • Prompt cache reduces cost by 90%
  • Architectural benefits >> small cost
  • Acceptable

3. Limitations of simple solutions

Markdown memory isn't suitable for:

  • Large-scale knowledge bases (> 100K tokens)
  • Complex relational queries
  • Multi-dimensional retrieval

But:

  • Good enough for personal assistant scenarios
  • Can upgrade to Vector DB in the future
  • Solve 80% of problems first

Unexpected Benefits

1. Confidence from type safety

// Fully type-safe tool handling
const tool = step.toolResults.find(
  t => !t.dynamic && t.toolName === ToolName.generateReport
) as StaticToolResult<Pick<StudyToolSet, ToolName.generateReport>>;

if (tool?.output) {
  const token = tool.output.reportToken;  // ← TypeScript knows this field exists
}
Enter fullscreen mode Exit fullscreen mode

During refactoring, the compiler catches 99% of issues.

2. Flexibility of configuration-driven

Adding webhook integration only requires:

// baseAgentRequest.ts
if (webhookUrl) {
  await sendWebhook(webhookUrl, step);
}
Enter fullscreen mode Exit fullscreen mode

All agents automatically gain new capability, no config changes needed.

3. Power of messages as protocol

Plan Mode and Study Agent communicate through messages:

  • Decoupled: can be modified independently
  • Without losing context: complete decision process in messages
  • Traceable: can replay when problems occur

This was an unexpected benefit.


VII. Future Directions

Three evolutions brought atypica closer to general-purpose agents. But there's more to do.

Short-term (3-6 months)

1. Skills Library

  • Further modularize tools
  • Users can compose their own agents
  • Like GPTs, but more flexible

2. Multi-Agent Collaboration

  • Not just serial execution
  • Parallel research, cross-validation
  • Like AutoGPT, but more controllable

Long-term (1-2 years)

3. Evolve toward GEA

  • GEA = General Execution Architecture
  • Not just research agents, but a universal AI Agent execution framework
  • Can run any type of agent

4. Self-Improving Agents

  • Agents learn from past executions
  • Continuously optimize prompts and strategies
  • Get smarter with use

Unchanging Principles

No matter how we evolve, we stick to:

  • Simple beats complex
  • Transparent beats black box
  • User control beats AI automation

VIII. Conclusion

Building AI Agent systems is not a simple extension of traditional software engineering.

We need to rethink:

  • What is state? (Conversation history)
  • What is an interface? (Message protocol)
  • What is control flow? (AI reasoning)

atypica's three evolutions are essentially three cognitive upgrades:

  1. From database thinking → data flow thinking

    • Don't maintain explicit state, infer state from messages
  2. From code reuse → configuration-driven

    • Don't pursue perfect abstraction, use configuration to express differences
  3. From stateless → memory-enhanced

    • Don't rely on precise retrieval, use simple and transparent methods

These choices may not be the most "advanced."

But they are:

  • Simple: easy to understand, easy to debug
  • Transparent: users know what AI is doing
  • Controllable: users can intervene and adjust
  • Good enough: solve 80% of problems

And this, perhaps, is the key to building reliable AI systems.

https://atypica.ai/blog/towards-general-agent

Top comments (0)