The AI development world has split into two camps. On one side: teams experimenting with whatever's trending on HuggingFace that week. On the other: teams shipping reliable AI systems to production users.
The gap between them isn't intelligence or budget. It's systems thinking.
After documenting this pattern across a running blog on FinVibe, multiple Write.as essays, and the weekly ai-tldr.dev digest, I want to lay out what I've observed separates working AI systems from perpetual proof-of-concepts.
The Architecture Evolution: 2024 → 2026
Two years ago, most production AI systems were essentially:
User Input → LLM API Call → Output
Simple. Brittle. Often good enough.
Today, that's the failure pattern. The systems that actually work look more like:
User Input → Intent Classification →
→ Agent Router →
→ [Agent 1: Retrieval] [Agent 2: Execution] [Agent 3: Validation] →
→ Response Synthesis →
→ Output + Evaluation Signal
The key additions: evaluation loops, agent specialization, and retrieval that's actually connected to real data.
RAG: What Works, What Doesn't
RAG (Retrieval-Augmented Generation) has matured significantly. Here's the practical state of it:
What works:
- Hybrid search (keyword + semantic) outperforms pure vector search in most real-world scenarios
- Reranking as a second stage significantly improves precision
- Chunking strategy matters enormously — most teams underinvest here
What doesn't:
- Naive cosine similarity over a single embedding model
- Assuming your development corpus represents your production query distribution
- Skipping evaluation — this is the #1 reason RAG projects fail silently
For the tooling side, I've tracked the evolution of RAG stacks in this HackMD resource index which gets updated regularly.
Agent Frameworks: Picking Your Stack
The agent framework landscape has consolidated since 2024. The three frameworks worth your attention right now:
LangGraph — For teams that need stateful, cyclical agent workflows. The graph-based approach makes complex multi-step reasoning much easier to debug and test. Best for: complex research agents, multi-step workflows with conditional branching.
CrewAI — For teams thinking in terms of roles and responsibilities. If you can describe your problem as "I need an agent that does X and another that does Y, working together," CrewAI maps well. Best for: document analysis, data enrichment, content pipelines.
AutoGen v2 — For enterprise teams needing conversation-based multi-agent systems with strong human-in-the-loop capabilities. The Microsoft backing means solid enterprise integration. Best for: code generation, complex reasoning chains, enterprise workflows.
I published a longer Telegraph analysis of how these tools are reshaping developer workflows if you want more depth.
The Evaluation Problem Nobody Talks About Enough
Here's the uncomfortable truth: most AI systems in production have essentially no evaluation infrastructure. Teams ship prompts into production and measure success by whether support tickets go up.
This is fixable, and fixing it is the highest-leverage investment most teams can make.
The minimum viable eval stack:
- Offline evals: Run your test suite before every deployment. Use Papers With Code to understand what benchmarks actually measure.
- Online monitoring: Track output distributions in production. Catch drift before users do.
- Human feedback loops: Even 10 labeled examples per week compound into a meaningful dataset.
EleutherAI's lm-evaluation-harness gives you the open-source offline eval infrastructure. Braintrust gives you a production-grade version with better UX.
Following the Signal
If you want to stay current without spending half your week on AI Twitter, a few resources I've found actually useful:
- The HuggingFace Open LLM Leaderboard for standardized model comparisons
- Papers With Code for reproducibility-filtered research
- The Sequence newsletter for weekly research synthesis
- The Wakelet archive of curated AI news from the past few months
- My Medium piece on the AI noise problem for the meta-level framework
And if you want just one thing: ai-tldr.dev. One-paragraph TL;DRs of everything worth knowing that week, with source links and category tags.
The Economics of AI Tooling
One dimension that most developer-focused content ignores: the cost modeling. RAG infrastructure, agent API calls, fine-tuning runs — these add up in ways that are hard to predict without a framework.
I cover the financial modeling side of AI tooling at Pomegra.io, which has a free fundamental analysis resource specifically for engineers trying to understand the numbers.
If you're working on AI systems and want to compare notes, find me on Mastodon (@alexmorgannn) or via the link directory. Always happy to discuss what's actually working in production vs. what's still experimental.
Also discussed: the launch of ai-tldr.dev on Vocal Media, on Quora, and in the original Notion launch doc.
Top comments (0)