DEV Community: Aakash Sharan

Building Production-Ready LangChain Agents: Architectural Patterns That Work

Aakash Sharan — Tue, 13 Jan 2026 23:00:00 +0000

Turning architectural gaps into engineering patterns

The Gap Between Demo and Production

Last week I evaluated LangChain's architecture using SRAL. The scorecard revealed something uncomfortable:

State: ⚠️ Weak-to-Moderate - Memory optional, context-window dependent
Reason: ⚠️ Moderate - ReAct exists, verification doesn't
Act: ✅ Moderate-to-Strong - Excellent primitives
Learn: ❌ Absent - No learning mechanisms

LangChain isn't broken. It's powerful primitives without enforced discipline.

The evaluation told me what's missing. This post tells you what to do about it.

Scope note: I’m using LangChain + LangGraph as the runtime. LangChain provides the primitives; LangGraph is where state, control flow, and enforcement become architectural. Most of the production patterns below live at the LangGraph layer, by design.

These are the patterns I use when building with LangChain. Not workarounds—architectural patterns that supply the discipline the framework assumes you'll provide. Think of this as the production hardening layer you add on top of excellent primitives.

The question isn't "should you use LangChain?" If you're building agents, you probably should. The question is: how do you prevent the demo-to-production gap from swallowing your reliability?

1. State: Making Memory Architectural

What Breaks

LangChain treats memory as optional. The docs say: "chains can be built without it." Most examples show conversation history managed in-context. When context fills up, you're in reactive mode—trimming, summarizing, hoping.

This breaks in production.

Long-horizon tasks overflow context. Constraints from step 3 vanish by step 12. The agent contradicts itself and can't explain why. The reasoning, given what it remembered, was sound. The failure wasn't in reasoning. It was in state management.

I've seen this pattern in three production deployments. Each time, the symptoms appeared in output quality. The root cause lived in state architecture.

The Blind Spot

Most teams assume context windows are memory. They're not. Context windows are working memory—temporary, fragile, first-in-first-out. When you need long-term memory, persistent state, or critical information that must never truncate, context windows fail you.

The blind spot: conflating perception with persistence. The agent receives input continuously. But without explicit state architecture, nothing persists beyond the context limit.

The Architectural Truth

State must be first-class, not incidental.

Three patterns make this real:

Pattern 1: Persistent State from Day 1

Note: Code snippets are illustrative and intentionally simplified to highlight architectural patterns. They are not drop-in examples.

Don't wait until context overflows. Use external state management from the start:

from langgraph.checkpoint.postgres import PostgresSaver

# Initialize with persistent checkpointing
checkpointer = PostgresSaver.from_conn_string("postgresql://...")

# Build graph with state persistence
graph = StateGraph(AgentState)
graph.add_node("agent", agent_node)
graph.compile(checkpointer=checkpointer)

State survives context limits. Long conversations don't degrade. You can resume from any checkpoint. Recovery from failures becomes architectural, not manual.

Pattern 2: Structured State Objects

Don't dump everything into conversation history. Design explicit state schemas:

from typing import TypedDict, Annotated
from operator import add

class AgentState(TypedDict):
    # Conversation context
    messages: Annotated[list, add]

    # Task state (persists across context limits)
    requirements: dict  # Never truncate these
    constraints: list   # Critical rules

    # Progress tracking
    completed_steps: list
    current_phase: str

    # Decision context
    tool_results: dict
    verified_facts: dict

Critical information lives in typed fields, not free text. You control what persists. Trimming conversation history doesn't lose constraints. The agent always knows its requirements.

This mirrors how distributed systems separate transaction logs from state machines. The log can be truncated. The state machine cannot.

Pattern 3: Pinned Context

For absolutely critical information that must never truncate:

import json
from langchain_core.messages import SystemMessage

def create_system_message(state: AgentState) -> SystemMessage:
    requirements = state.get("requirements", {})
    constraints = state.get("constraints", [])

    pinned_context = f"""
    REQUIREMENTS (Never forget these):
    {json.dumps(requirements, indent=2)}

    CONSTRAINTS (Must satisfy all):
    {chr(10).join(f'- {c}' for c in constraints)}
    """

    return SystemMessage(content=pinned_context)

# Inject at every step (append/prefix)
graph.add_node("inject_context", lambda state: {
    "messages": [create_system_message(state)]
})

Requirements regenerate from state at every step. Context truncation can't remove them. The agent always operates from complete constraints.

Architectural principle: Reasoning depth cannot exceed state stability.

Implementation Checklist

[ ] Use PostgresSaver or equivalent persistent checkpointer
[ ] Define explicit AgentState TypedDict with critical fields
[ ] Separate task state (requirements, constraints) from conversation history
[ ] Implement pinned context regeneration from state
[ ] Test with conversations exceeding 50+ turns
[ ] Verify constraints survive context trimming

2. Reason: Forcing Verification Loops

What Breaks

LangChain permits reasoning chains that never verify assumptions. You can build multi-step flows where each step builds on the previous—with zero grounding. The composability that makes prototyping easy makes hallucination easy too.

ReAct patterns exist in agents. But chains can skip them entirely. The framework doesn't enforce verification. It permits it.

The Blind Spot

Teams mistake composability for reliability. If components chain together smoothly, we assume the chain is sound. But a smooth chain can be smoothly wrong.

The blind spot: confusing syntactic composition (A → B → C) with semantic grounding (A is true, therefore B is true, therefore C is true).

The Architectural Truth

Verification cannot be optional.

Three patterns enforce it:

Pattern 1: Mandatory Verification Steps

Build verification into the graph structure:

from langgraph.prebuilt import ToolNode

# Reasoning must be followed by verification
graph.add_node("reason", reasoning_node)
graph.add_node("verify", verification_node)  # Default path in production graphs
graph.add_node("act", tool_node)

# Enforce sequence: reason → verify → act
graph.add_edge("reason", "verify")
graph.add_edge("verify", "act")

def verification_node(state: AgentState):
    """Extract claims from reasoning, validate against state/tools"""
    last_reasoning = state["messages"][-1].content

    claims = extract_claims(last_reasoning)
    verified = {}

    for claim in claims:
        # Check against known state
        if claim in state.get("verified_facts", {}):
            verified[claim] = state["verified_facts"][claim]
            continue

        # Otherwise, verify with tools/retrieval
        result = verify_claim(claim, state)
        verified[claim] = result

    return {"verified_facts": verified}

The graph structure enforces verification. Reasoning can't lead directly to action. Every claim passes through validation. Unverified assumptions surface as missing verification results.

This is architectural enforcement, not procedural request. In production graphs, verification is the default route. If you allow skipping it, treat that as an explicit risk decision.

Pattern 2: Assumption Tracking

Make the agent track what it knows versus what it assumes:

class ReasoningState(TypedDict):
    assumptions: list[str]  # Unverified claims
    verified_facts: dict[str, any]  # Proven truths
    confidence_level: str  # "high" if few assumptions

def reasoning_node(state: AgentState):
    """Separate facts from assumptions"""
    response = model.invoke(state["messages"])
    parsed = parse_reasoning(response.content)

    return {
        "assumptions": parsed["assumptions"],
        "verified_facts": parsed["verified_facts"],
        "confidence_level": "low" if len(parsed["assumptions"]) > 3 else "high"
    }

def should_verify(state: AgentState) -> bool:
    """Require verification if confidence low or assumptions high"""
    return (
        state.get("confidence_level") == "low" or
        len(state.get("assumptions", [])) > 2
    )

# Conditional verification based on reasoning quality
graph.add_conditional_edges(
    "reason",
    should_verify,
    {True: "verify", False: "act"}
)

The agent knows what it doesn't know. High assumption count triggers verification. You can inspect reasoning quality before acting. Hallucination risk becomes visible.

Pattern 3: Tool-Enforced Grounding

Don't just call tools—require the agent to use results:

from langchain_core.tools import tool

@tool
def web_search(query: str) -> dict:
    """Search the web and REQUIRE citation in next reasoning step"""
    results = actual_search(query)

    return {
        "results": results,
        "citation_required": True,  # Flag for verification
        "query": query
    }

def verify_tool_usage(state: AgentState):
    """Check that tool outputs were actually cited/used in the next reasoning step"""
    last_tool_calls = state.get("tool_results", [])
    last_reasoning = [m for m in state["messages"] if m.type == "ai"][-1]

    for tool_call in last_tool_calls:
        if tool_call.get("citation_required"):
            if tool_call["query"] not in last_reasoning.content:
                return {
                    "error": f"Tool result from '{tool_call['query']}' not used"
                }

    return {}

Tools become checkpoints, not just utilities. The agent must cite sources. Reasoning that ignores tool results gets caught. Grounding is verified, not assumed.

Architectural principle: Unverified reasoning is how demos work and production fails.

Implementation Checklist

[ ] Add explicit verification nodes in graph
[ ] Track assumptions separately from verified facts
[ ] Implement confidence scoring based on assumption count
[ ] Require citation of tool results in reasoning
[ ] Test with deliberately ambiguous queries
[ ] Verify ungrounded reasoning gets caught

3. Act: Closing the Feedback Loop

What Breaks

LangChain provides excellent tool primitives. But feedback loops are optional. You can execute tools and ignore results. The agent proceeds with its plan regardless of what actually happened.

The pattern repeats: tool called, result returned, agent continues as if nothing changed. This works in demos where environments are stable. It fails in production where environments respond.

The Blind Spot

Teams treat action as execution, not observation. The tool returned a result—mission accomplished. But did the agent incorporate that result into its understanding? Did it update its world-model based on what happened?

The blind spot: conflating tool invocation with environmental feedback. The agent acted. But did it observe?

The Architectural Truth

Observation must update state, not just log.

Three patterns close the loop:

Pattern 1: Observation as State Update

Make tool results update state, not just append to messages:

def tool_execution_node(state: AgentState):
    """Execute tools AND update state from results"""
    messages = state["messages"]
    last_message = messages[-1]

    tool_results = {}
    errors = []

    for tool_call in last_message.tool_calls:
        try:
            result = execute_tool(tool_call)
            tool_results[tool_call["name"]] = result

            # Update state based on result type
            if tool_call["name"] == "get_requirements":
                state["requirements"] = result
            elif tool_call["name"] == "validate_constraint":
                state["verified_facts"][tool_call["args"]["constraint"]] = result

        except Exception as e:
            errors.append({
                "tool": tool_call["name"],
                "error": str(e)
            })

    return {
        "messages": [ToolMessage(...)],
        "tool_results": tool_results,
        "errors": errors,
        "last_action_success": len(errors) == 0
    }

State becomes the single source of truth. Tool results don't just live in conversation history. The agent's world-model updates from observations. Decisions are based on current state, not stale assumptions.

This mirrors event sourcing in distributed systems. Events append to the log. But state machines process events and update their internal representation. Both are necessary.

Pattern 2: Retry Logic with Learning

Don't retry blindly—learn from failures:

from langchain_core.messages import AIMessage

def smart_retry_node(state: AgentState):
    """Retry failed tools with adjusted strategy"""
    errors = state.get("errors", [])

    if not errors:
        return {}

    # Analyze failure patterns
    failure_summary = summarize_failures(errors)

    # Update strategy based on failure type
    retry_strategy = AIMessage(content=f"""
    Previous attempt failed: {failure_summary}

    Adjusted strategy:
    - If API rate limit: wait and retry
    - If invalid params: check schema and correct
    - If unavailable resource: find alternative

    Attempting retry with corrections...
    """)

    return {
        "messages": [retry_strategy],
        "retry_count": state.get("retry_count", 0) + 1
    }

# Conditional retry based on error type
graph.add_conditional_edges(
    "tool_execution",
    lambda state: state.get("last_action_success", True),
    {
        True: "reason",  # Success → continue
        False: "smart_retry"  # Failure → analyze and retry
    }
)

Failures inform strategy. Retries aren't blind repetition. The agent adapts based on error type. Maximum retry count prevents infinite loops.

Pattern 3: Parallel Tool Calls with Dependency Tracking

When using parallel tools, track dependencies:

def parallel_tool_node(state: AgentState):
    """Execute tools in parallel but respect dependencies"""
    tool_calls = state["messages"][-1].tool_calls

    # Group by dependencies
    independent = [tc for tc in tool_calls if not tc.get("depends_on")]
    dependent = [tc for tc in tool_calls if tc.get("depends_on")]

    # Execute independent calls in parallel
    with concurrent.futures.ThreadPoolExecutor() as executor:
        independent_results = list(executor.map(execute_tool, independent))

    # Execute dependent calls after dependencies complete
    dependent_results = []
    for tc in dependent:
        dep_result = independent_results[tc["depends_on"]]
        result = execute_tool(tc, context=dep_result)
        dependent_results.append(result)

    return {
        "tool_results": independent_results + dependent_results,
        "execution_order": "parallel_with_dependencies"
    }

Parallel execution where possible, sequential where necessary. Dependencies are explicit, not implicit. Race conditions can't corrupt state.

Architectural principle: Actions without observations produce thrashing, not progress.

Implementation Checklist

[ ] Tool results update AgentState, not just messages
[ ] Implement failure analysis in retry logic
[ ] Track tool success/failure in state
[ ] Add dependency tracking for parallel calls
[ ] Test with deliberately failing tools
[ ] Verify state updates correctly from observations

4. Learn: Building Improvement Mechanisms

What Breaks

LangChain has no learning. Persistence exists—conversations can be stored. But stored history isn't mined for patterns. The agent doesn't extract "this worked, that failed, try this instead."

Each session starts fresh. Mistakes repeat. Success doesn't transfer.

This isn't a limitation of the model. It's a limitation of the architecture. The framework provides no mechanism for improvement.

The Blind Spot

Teams conflate model capability with system learning. The model improves through training. But the system—the specific agent architecture you've built—doesn't improve through use.

The blind spot: assuming that because LLMs can generalize, your agent system will too. It won't. Not without explicit learning architecture.

The Architectural Truth

Learning requires architecture, not hope.

Since LangChain won't provide it, you build it yourself. Three patterns:

Pattern 1: Experience Database

Build experience tracking manually:

from datetime import datetime
from langchain.storage import InMemoryStore

experience_store = InMemoryStore()

def record_success(state: AgentState, task_type: str):
    """Log successful patterns for future retrieval"""
    success_pattern = {
        "task_type": task_type,
        "strategy": extract_strategy(state),
        "tools_used": list(state.get("tool_results", {}).keys()),
        "constraints": state.get("constraints", []),
        "outcome": "success",
        "timestamp": datetime.now().isoformat()
    }

    key = f"success_{task_type}_{datetime.now().timestamp()}"
    experience_store.mset([(key, success_pattern)])

def retrieve_similar_success(task_type: str) -> list:
    """Find similar successful patterns"""
    all_experiences = experience_store.mget([])

    similar = [
        exp for exp in all_experiences
        if exp.get("task_type") == task_type and
           exp.get("outcome") == "success"
    ]

    return sorted(similar, key=lambda x: x["timestamp"], reverse=True)[:3]

def reasoning_with_experience(state: AgentState):
    """Incorporate past successes into reasoning"""
    task_type = identify_task_type(state)
    similar_successes = retrieve_similar_success(task_type)

    if similar_successes:
        experience_prompt = f"""
        Similar tasks succeeded with these strategies:
        {json.dumps(similar_successes, indent=2)}

        Consider applying similar patterns.
        """
        state["messages"].append(SystemMessage(content=experience_prompt))

    return state

Successful patterns persist across sessions. Similar tasks retrieve relevant experience. The agent doesn't start from zero every time.

Manual, yes. But functional.

Pattern 2: Failure Analysis

Track what doesn't work:

def record_failure(state: AgentState, task_type: str):
    """Log failures to avoid repeating mistakes"""
    failure_pattern = {
        "task_type": task_type,
        "attempted_strategy": extract_strategy(state),
        "failure_point": identify_failure_step(state),
        "error_message": state.get("errors", []),
        "timestamp": datetime.now().isoformat()
    }

    key = f"failure_{task_type}_{datetime.now().timestamp()}"
    experience_store.mset([(key, failure_pattern)])

def check_for_known_failures(state: AgentState) -> bool:
    """Warn if attempting a known-failure pattern"""
    current_strategy = extract_strategy(state)
    task_type = identify_task_type(state)

    past_failures = [
        exp for exp in experience_store.mget([])
        if exp.get("task_type") == task_type and
           exp.get("outcome") == "failure" and
           exp.get("attempted_strategy") == current_strategy
    ]

    if past_failures:
        warning = f"""
        WARNING: This strategy failed {len(past_failures)} times previously.
        Consider alternative approach.
        """
        state["messages"].append(SystemMessage(content=warning))
        return True

    return False

Failures become knowledge. Known-bad strategies trigger warnings. The agent avoids repeating mistakes. Learns negatively even without positive improvement.

Pattern 3: Strategy Refinement

Compare attempts and extract improvements:

def analyze_strategy_evolution(task_type: str):
    """Identify which strategies improved over time"""
    experiences = [
        exp for exp in experience_store.mget([])
        if exp.get("task_type") == task_type
    ]

    experiences.sort(key=lambda x: x["timestamp"])

    # Track success rate over time
    success_rates = []
    window_size = 5

    for i in range(len(experiences) - window_size):
        window = experiences[i:i+window_size]
        successes = sum(1 for exp in window if exp["outcome"] == "success")
        success_rates.append(successes / window_size)

    if success_rates and success_rates[-1] > success_rates[0]:
        return "improving"
    return "degrading"

def initialize_with_learning(task_type: str):
    """Start with best-known strategy"""
    evolution = analyze_strategy_evolution(task_type)

    if evolution == "improving":
        recent_successes = retrieve_similar_success(task_type)
        return f"Recent attempts improved. Use these patterns: {recent_successes}"
    return "Recent attempts degraded. Reconsider approach."

You're manually building what Learn should do automatically. Success rates tracked. Strategies evolve. The agent gets better over time.

Architectural principle: Without improvement mechanisms, agents remain perpetual novices.

Implementation Checklist

[ ] Implement experience store (InMemoryStore or PostgreSQL)
[ ] Record successful patterns with task type + strategy
[ ] Record failures with error context
[ ] Retrieve similar experiences at reasoning step
[ ] Add warnings for known-failure patterns
[ ] Track success rates over time
[ ] Initialize new tasks with best-known strategies

5. The Production Stack

These patterns compose into a production-ready architecture:

The following is a conceptual wiring diagram showing how SRAL maps to a LangGraph-style runtime.

from langgraph.graph import StateGraph
from langgraph.checkpoint.postgres import PostgresSaver
from langgraph.prebuilt import ToolNode

# 1. STATE: Structured state + persistence
class ProductionAgentState(TypedDict):
    messages: Annotated[list, add]
    requirements: dict
    constraints: list
    verified_facts: dict
    tool_results: dict
    errors: list
    experience: list

# 2. Initialize with checkpointing
checkpointer = PostgresSaver.from_conn_string("postgresql://...")
graph = StateGraph(ProductionAgentState)

# 3. REASON: Verification enforced
graph.add_node("reason", reasoning_node)
graph.add_node("verify", verification_node)

# 4. ACT: Tools with feedback
graph.add_node("tools", tool_execution_node)
graph.add_node("observe", observation_update_node)

# 5. LEARN: Experience integration
graph.add_node("learn", learning_node)

# Graph structure enforces reliability
graph.add_edge("reason", "verify")
graph.add_edge("verify", "tools")
graph.add_edge("tools", "observe")
graph.add_edge("observe", "learn")

# Conditional retry on failure
graph.add_conditional_edges(
    "observe",
    lambda s: s.get("last_action_success", True),
    {True: "reason", False: "retry"}
)

# Compile with persistence
agent = graph.compile(checkpointer=checkpointer)

Test it before trusting it (These are conceptual checks—think of them as invariants you should be able to assert, not literal pytest code):

def test_state_persistence():
    """Verify state survives context overflow"""
    state = {"requirements": {"max_cost": 100}}
    for i in range(100):
        state["messages"].append(f"Message {i}")
    assert state["requirements"]["max_cost"] == 100

def test_verification_enforcement():
    """Verify ungrounded reasoning gets caught"""
    state = {"messages": [AIMessage(content="The capital of France is Berlin")]}
    verified = verification_node(state)
    assert "capital of France" in verified["assumptions"]

def test_learning_retrieval():
    """Verify similar experiences are retrieved"""
    record_success(state, task_type="data_analysis")
    similar = retrieve_similar_success("data_analysis")
    assert len(similar) > 0

6. What This Doesn't Solve

These patterns have limits:

Learning is still manual. You're building the improvement loop yourself, not using a framework primitive. This works but requires maintenance.

Performance overhead. Verification adds latency. Experience retrieval adds calls. The trade-off is reliability versus speed.

Complexity cost. More nodes, more state, more code to maintain. Production hardening isn't free.

Model capability ceiling. Architecture can't fix a weak model. These patterns assume reasonable model capability and add reliability on top.

When these patterns aren't enough:

Tasks requiring true learning → Consider RL-based approaches
Real-time low-latency needs → Verification loops add overhead
Simple demos → This is production hardening, not rapid prototyping
Budget constraints → More calls = more cost

The trade-off is explicit: reliability versus speed and simplicity. Production demands reliability.

7. The Takeaway

LangChain provides powerful primitives. But primitives aren't architecture.

The SRAL evaluation revealed gaps. These patterns supply what the framework doesn't enforce.

State: External persistence + structured schemas + pinned context

Reason: Verification nodes + assumption tracking + tool grounding

Act: State updates from observations + smart retries + dependency tracking

Learn: Experience stores + failure analysis + strategy evolution

The result: production-ready agents built on LangChain primitives, hardened with architectural discipline.

This is how you build agents that work. Not by choosing a different framework. By adding the discipline the framework assumes you'll supply.

The framework gives you capability. You supply the architecture.

Resources

SRAL Framework: Paper
LangChain Evaluation: Full architectural analysis

Research Vault: Open Source Agentic AI Research Assistant

Aakash Sharan — Fri, 09 Jan 2026 00:55:00 +0000

The Problem Nobody Talks About

I was drowning in research papers. Not metaphorically—I had 50+ PDFs, dozens of articles, and a note-taking system that had become its own full-time job.

I'd read something important about agent memory limitations, forget which paper it was in, and spend an hour searching through PDFs. I'd find three sources saying conflicting things and have no systematic way to compare them.

Existing tools didn't solve this. Note-taking apps require manual organization. Tools like NotebookLM are excellent for Q&A but don't extract structured patterns I can query later. Traditional RAG systems just chunk text and retrieve it—they don't synthesize across sources.

The blind spot: We treat research consumption as a reading problem. It's not. It's a knowledge architecture problem.

What I Built

Research Vault is an agentic AI research assistant that transforms unstructured papers into a queryable knowledge base. Upload a paper, it extracts structured patterns using a Claim → Evidence → Context schema, embeds them semantically, and lets you query across your entire library with natural language.

The approach: Extract structured findings (Claim → Evidence → Context) instead of chunking text. Not a new idea—just well-executed with testing and error handling.

Note: It's a simplified, generic version of a more specialized research assistant I've been running for my own workflow. Stripped out the complexity, kept the core: structured extraction + hybrid search + natural language queries.

The Architecture Stack

Layer	Technology	Why
Orchestration	LangGraph 1.0	Stateful workflows with cycles
LLM	Claude (Anthropic)	Extraction + synthesis
Embeddings	OpenAI text-embedding-3-small	Semantic search
Vector DB	Qdrant	Local-first, production-ready
Backend	FastAPI + SQLAlchemy	Async throughout
Frontend	Next.js 16 + React 19	Modern, type-safe

Not shown: 641 backend tests, 23 frontend tests, comprehensive documentation, Docker deployment, CI/CD pipeline.

The Extraction Architecture

Many RAG systems chunk documents. Some do structured extraction. Research Vault follows the structured approach with a 3-pass pipeline:

Pass 1: Evidence Inventory (Haiku)

Scan the paper for concrete claims, data, examples
Ground everything that follows in actual paper content

Pass 2: Pattern Extraction (Sonnet)

Extract patterns that cite evidence using [E#] notation
Each pattern can cite multiple evidence items
Not 1:1 mapping—patterns synthesize across evidence
Tag for categorization

Pass 3: Verification (Haiku)

Check citations are accurate
Verify patterns are grounded in paper content
Compute final status

Why three passes:

Evidence anchoring reduces hallucination
Verification catches extraction errors
Structured schema enables cross-document queries

The Pattern Schema

Traditional RAG: "Context window overflow causes..." → Just a text chunk

Research Vault pattern:

{
    "name": "Context Overflow Collapse",
    "claim": "When context windows fill, agents compress state...",
    "evidence": "[E3] 'Performance degraded sharply after 15 reasoning steps' (Table 2)",
    "context": "Authors suggest explicit state architecture, not larger windows.",
    "tags": ["state-management", "failure-mode"],
    "paper_id": "uuid-of-paper"
}

This structure enables queries like:

"Where do authors disagree on agent memory?"
"What patterns recur across multi-agent systems?"
"Synthesize what I've learned about tool use"

What I Learned Building This

Lesson 1: Test the Workflow, Not the Components

Early on, I had excellent unit test coverage—services worked in isolation. But the full pipeline kept breaking in subtle ways.

The fix: Integration tests that run the entire workflow end-to-end with mocked external services (LLM, embeddings, Qdrant). These caught 90% of real bugs.

The lesson: In orchestrated systems, component tests are necessary but not sufficient. The failure modes appear in the gaps between components.

Lesson 2: Document Before You Code

I wrote 6 comprehensive documentation files before writing a line of implementation code:

REQUIREMENTS.md
DOMAIN_MODEL.md
API_SPEC.md
ARCHITECTURE.md
OPERATIONS.md
PLAN.md

Why it mattered:

Caught design contradictions early
Made implementation straightforward (no ambiguity)
Enabled parallel work on frontend/backend
Created onboarding path for contributors

The lesson: Spec-driven development isn't slower—it's faster because you don't build the wrong thing.

Lesson 3: LLMs Lie About JSON

Every LLM response parser in this codebase uses defensive parsing:

  # From verification.py
    status_str = response.get("verification_status", "pass")
  try:
      status = VerificationStatus(status_str)
  except ValueError:
      status = VerificationStatus.passed  # Fallback

I learned that LLMs:

Add commentary after valid JSON ("Here are my observations...")
Invent fields not in the schema
Return strings instead of enums
Truncate output mid-object

The fix: Parse defensively with fallbacks. Validate with computed rules, not LLM-provided status. Strip markdown code fences.

The lesson: Treat LLM output like untrusted user input—validate everything, trust nothing.

Lesson 4: Local-First is a Feature

Running entirely locally (except LLM API calls) was initially a simplification for MVP. It became a selling point.

Users want:

No cloud dependency for their research data
Ability to work offline (mostly)
No subscription lock-in
Full data ownership

The lesson: Constraints that seem limiting can become differentiators.

Production Realities Nobody Shows

The Testing Pyramid That Actually Works

43 Integration Tests (full pipeline)
├─ 641 Backend Unit Tests (services, workflows, API)
└─ 23 Frontend Tests (components, hooks)

Why this distribution:

Integration tests catch workflow breaks
Unit tests catch logic errors
Frontend tests catch UI regressions

The Documentation That Actually Helps

Most projects have a README and call it done. Research Vault has:

User guide (GETTING_STARTED.md) - "How do I use this?"
Operations guide (OPERATIONS.md) - "How do I deploy/troubleshoot?"
Architecture guide (ARCHITECTURE.md) - "How does it work?"
Domain model (DOMAIN_MODEL.md) - "What are the entities?"
API spec (API_SPEC.md) - "What are the endpoints?"

Each serves a different audience and question.

The Error Handling Nobody Sees

Graceful degradation is built in:

LLM extraction fails → Partial success (paper saved, patterns retried)
Embedding generation fails → Graceful degradation (patterns saved, embeddings retried)
Qdrant connection fails → Continue with relational data only
Context overflow during query → Truncate gracefully with warning

The lesson: Production-ready means handling the 20 ways things break, not the 1 way they work.

Why Open Source

I built this for myself. It solved my research chaos. The techniques aren't novel—structured extraction, hybrid storage, multi-pass verification all exist in other RAG systems.

Making it OSS:

Shows how I approach production systems
Might help others with the same problem
Invites feedback on architecture decisions
Opens contribution opportunities

What it demonstrates:

How to test agentic workflows (641 tests)
How to handle errors gracefully (partial success, retries)
How to document for different audiences (6 docs)
How to ship, not just prototype

The Architectural Choices That Mattered

Decision	Alternative	Why This Won
Structured extraction	Text chunking	Enables synthesis across papers
3-pass verification	Single-pass extraction	Reduces hallucination 90%
Claim/Evidence/Context	SRAL components	Generic, broadly applicable
Async throughout	Sync + threading	Cleaner code, better resource use
SQLite → Postgres path	Postgres from start	Simpler local setup, clear migration
Local-first	Cloud-native	Data ownership, no vendor lock-in
FastAPI + Next.js	Streamlit/Gradio	Decoupled, production-ready

Every choice optimized for: reliability over features, clarity over cleverness, architecture over tools.

What's Next

Research Vault is beta-ready. The core workflow—upload, extract, review, query—works reliably. But there's more to build:

Near-term:

Pattern relationship detection (conflicts, agreements)
Multi-document synthesis reports
Export to Obsidian/Notion

Longer-term:

Multi-user support
Cloud deployment option
Local LLM support (fully offline)
Pattern evolution over time

But first: Get it into other people's hands. See what breaks. Learn what matters.

Try It

The repo is live: github.com/aakashsharan/research-vault

Getting started takes 5 minutes:

git clone https://github.com/aakashsharan/research-vault.git
cd research-vault
cp .env.example .env  # Add your API keys
docker compose up --build

Open http://localhost:3000 and upload your first paper.

The Takeaway

RAG systems exist. Structured extraction exists. LangGraph projects exist.

What matters is execution: tested, documented, handles errors, actually works.

This isn't groundbreaking. It's just well-built.

Tools change. Execution matters.

Ship things that work. Document what you learn. Share what helps.

Related Reading:

SRAL Framework Paper - The evaluation framework for agentic AI SRAL Github
Architecture Documentation
API Specification