From RAG to Agents: Building AI Systems That Actually Work in Production

#ai

The AI development world has split into two camps. On one side: teams experimenting with whatever's trending on HuggingFace that week. On the other: teams shipping reliable AI systems to production users.

The gap between them isn't intelligence or budget. It's systems thinking.

After documenting this pattern across a running blog on FinVibe, multiple Write.as essays, and the weekly ai-tldr.dev digest, I want to lay out what I've observed separates working AI systems from perpetual proof-of-concepts.

The Architecture Evolution: 2024 → 2026

Two years ago, most production AI systems were essentially:

User Input → LLM API Call → Output

Simple. Brittle. Often good enough.

Today, that's the failure pattern. The systems that actually work look more like:

User Input → Intent Classification → 
  → Agent Router → 
    → [Agent 1: Retrieval] [Agent 2: Execution] [Agent 3: Validation] →
  → Response Synthesis → 
  → Output + Evaluation Signal

The key additions: evaluation loops, agent specialization, and retrieval that's actually connected to real data.

RAG: What Works, What Doesn't

RAG (Retrieval-Augmented Generation) has matured significantly. Here's the practical state of it:

What works:

Hybrid search (keyword + semantic) outperforms pure vector search in most real-world scenarios
Reranking as a second stage significantly improves precision
Chunking strategy matters enormously — most teams underinvest here

What doesn't:

Naive cosine similarity over a single embedding model
Assuming your development corpus represents your production query distribution
Skipping evaluation — this is the #1 reason RAG projects fail silently

For the tooling side, I've tracked the evolution of RAG stacks in this HackMD resource index which gets updated regularly.

Agent Frameworks: Picking Your Stack

The agent framework landscape has consolidated since 2024. The three frameworks worth your attention right now:

LangGraph — For teams that need stateful, cyclical agent workflows. The graph-based approach makes complex multi-step reasoning much easier to debug and test. Best for: complex research agents, multi-step workflows with conditional branching.

CrewAI — For teams thinking in terms of roles and responsibilities. If you can describe your problem as "I need an agent that does X and another that does Y, working together," CrewAI maps well. Best for: document analysis, data enrichment, content pipelines.

AutoGen v2 — For enterprise teams needing conversation-based multi-agent systems with strong human-in-the-loop capabilities. The Microsoft backing means solid enterprise integration. Best for: code generation, complex reasoning chains, enterprise workflows.

I published a longer Telegraph analysis of how these tools are reshaping developer workflows if you want more depth.

The Evaluation Problem Nobody Talks About Enough

Here's the uncomfortable truth: most AI systems in production have essentially no evaluation infrastructure. Teams ship prompts into production and measure success by whether support tickets go up.

This is fixable, and fixing it is the highest-leverage investment most teams can make.

The minimum viable eval stack:

Offline evals: Run your test suite before every deployment. Use Papers With Code to understand what benchmarks actually measure.
Online monitoring: Track output distributions in production. Catch drift before users do.
Human feedback loops: Even 10 labeled examples per week compound into a meaningful dataset.

EleutherAI's lm-evaluation-harness gives you the open-source offline eval infrastructure. Braintrust gives you a production-grade version with better UX.

Following the Signal

If you want to stay current without spending half your week on AI Twitter, a few resources I've found actually useful:

The HuggingFace Open LLM Leaderboard for standardized model comparisons
Papers With Code for reproducibility-filtered research
The Sequence newsletter for weekly research synthesis
The Wakelet archive of curated AI news from the past few months
My Medium piece on the AI noise problem for the meta-level framework

And if you want just one thing: ai-tldr.dev. One-paragraph TL;DRs of everything worth knowing that week, with source links and category tags.

The Economics of AI Tooling

One dimension that most developer-focused content ignores: the cost modeling. RAG infrastructure, agent API calls, fine-tuning runs — these add up in ways that are hard to predict without a framework.

I cover the financial modeling side of AI tooling at Pomegra.io, which has a free fundamental analysis resource specifically for engineers trying to understand the numbers.

If you're working on AI systems and want to compare notes, find me on Mastodon (@alexmorgannn) or via the link directory. Always happy to discuss what's actually working in production vs. what's still experimental.

Also discussed: the launch of ai-tldr.dev on Vocal Media, on Quora, and in the original Notion launch doc.