How I built a production-ready AI assistant that decides when to search the web, process documents, and run multi-minute research tasks without losing progress if things go wrong.
Most "AI chatbot" tutorials stop at the same place: wrap an LLM, stream tokens, done. That's a prototype. Production is a different beast entirely.
Over the past three years building AI-native applications, I've shipped chatbots that need to do more than answer questions. They need to act: search the web for current information, process uploaded documents, run multi-step research that takes minutes, and deliver results even if the user closes the browser.
This article walks through the architecture I landed on after multiple production deployments. The key insight: agentic chat is a distributed systems problem, not just an AI problem.
The Architecture
Here's the simplified flow:
User message
→ Elysia API (auth + validation)
→ Vercel AI SDK (streaming + tool calling)
→ Claude decides: respond directly, or use a tool?
→ Tool: Web Search (Exa API, instant)
→ Tool: Document Lookup (pgvector RAG query)
→ Tool: Deep Research (Inngest background function, 1-5 min)
→ Stream response back to client
Three layers, each solving a different problem:
- Streaming layer. Vercel AI SDK handles the chat protocol, token streaming, and tool call orchestration.
- Tool layer. Claude decides when to invoke tools based on user intent.
- Durability layer. Inngest ensures long-running tasks complete, even if the server restarts.
Tool Calling: Let the AI Decide
The most important shift from a "chatbot" to an "agent" is tool calling. Instead of hardcoding "if user says X, do Y," you give the model a set of tools and let it choose.
Here's the shape of a tool definition with the Vercel AI SDK:
const webSearchTool = tool({
description: "Search the web for current information",
parameters: z.object({
query: z.string().describe("The search query"),
}),
execute: async ({ query }) => {
const results = await exa.searchAndContents(query, {
numResults: 5,
text: true,
});
return results.map((r) => ({
title: r.title,
url: r.url,
content: r.text,
}));
},
});
You register tools with the model, and Claude determines from the conversation whether to invoke them. Ask "what's the weather in Oslo?" and it calls web search. Ask "summarize the PDF I uploaded" and it queries the vector store. Ask something it already knows, and it just responds.
This is fundamentally different from building a routing layer yourself. The model handles intent classification as a side effect of generating a response.
The RAG Pipeline: Documents to Embeddings to Answers
Document processing follows a well-established pipeline, but the details matter more than most tutorials suggest.
Ingestion:
Upload (PDF/DOCX/image)
→ Unstructured.io (extraction + layout analysis)
→ Text chunks (semantic splitting, ~500 tokens each)
→ Embedding generation (OpenAI ada-002)
→ pgvector storage (Neon PostgreSQL)
I chose Unstructured.io because it handles the nasty cases: scanned PDFs, mixed layouts, tables, embedded images with OCR. If you've ever tried to extract clean text from a real-world PDF, you know that pdf-parse gives you garbage for anything complex.
Retrieval:
When the chatbot's document lookup tool fires, it runs a cosine similarity search against pgvector:
SELECT content, metadata, 1 - (embedding <=> $1) AS similarity
FROM document_chunks
WHERE project_id = $2
ORDER BY embedding <=> $1
LIMIT 5;
The results feed back into Claude's context as tool output, and it synthesizes an answer with citations.
Where It Gets Hard: Long-Running Tasks
Web search returns in seconds. Document lookup returns in milliseconds. But what about deep research? A task that runs for 1-5 minutes, spawning multiple sub-queries, synthesizing sources, and building a grounded report?
You can't hold an HTTP connection open for 5 minutes. You can't stream a response that takes 3 minutes to start generating. And if your serverless function times out at 60 seconds, your user gets nothing.
This is where durable execution matters.
Why Durable Execution Matters
A durable execution engine treats your function like a state machine. Each step is checkpointed. If the process crashes mid-way, it resumes from the last checkpoint, not from the beginning.
Here's the deep research function using Inngest:
const deepResearch = inngest.createFunction(
{ id: "deep-research", retries: 3 },
{ event: "research/start" },
async ({ event, step }) => {
// Step 1: Generate sub-queries from the user's question
const subQueries = await step.run("generate-queries", async () => {
return generateSubQueries(event.data.question);
});
// Step 2: Execute each sub-query (parallelized)
const searchResults = await Promise.all(
subQueries.map((query, i) =>
step.run(`search-${i}`, () => exa.searchAndContents(query))
)
);
// Step 3: Synthesize into a report
const report = await step.run("synthesize", async () => {
return synthesizeReport(searchResults, event.data.question);
});
// Step 4: Store result and notify user
await step.run("deliver", async () => {
await db.insert(researchReports).values({
projectId: event.data.projectId,
content: report,
});
await sendNotification(event.data.userId, "Research complete");
});
return report;
}
);
Each step.run() is a checkpoint. If search-2 fails due to a rate limit, Inngest retries that step, not the entire function. Steps 0 and 1 don't re-execute. The user gets a complete report even if the infrastructure hiccupped three times along the way.
Why not BullMQ or a simple queue? Because queues give you at-most-once or at-least-once delivery, but they don't give you step-level checkpointing. If your worker crashes after completing 3 of 5 sub-queries, a queue restarts the entire job. Durable execution restarts from step 4.
The User Experience
From the user's perspective:
- They ask a complex question
- The chatbot says "I'll research this in depth. This may take a few minutes."
- The chatbot sends an event to Inngest, which starts the background function
- The user can close the browser, go make coffee, whatever
- When the research completes, they get a notification
- They open the app and find a grounded report with citations
No progress is ever lost. No research restarts from scratch. The result is always delivered.
The Part Nobody Warns You About: Mobile
I mentioned this is a distributed systems problem. Nowhere is that more apparent than on mobile.
On web, the Vercel AI SDK gives you useChat() with streaming, tool call rendering, and state management. It's good.
On mobile (React Native / Expo), you're mostly on your own. The ecosystem for agentic chatbots on mobile is immature. Here's what I had to build from scratch:
-
Streaming handler. React Native doesn't have native
ReadableStreamsupport in all environments. You end up parsing SSE events manually. - Tool call UI. When the agent calls a tool, you need to show a loading state specific to that tool ("Searching the web..."), then render the tool result inline. No library does this for you on mobile.
- Background task completion. When a deep research task finishes while the app is backgrounded, you need push notifications. This means hooking Inngest's completion event into your push notification service.
- Auth across platforms. The chat session needs to be authenticated, which means mobile auth tokens need to flow through to the same API that handles web sessions.
The lesson: if you're planning to ship an agentic chatbot on both web and mobile, budget at least 3x the time you'd expect for the mobile portion.
What I'd Do Differently
After shipping this pattern multiple times:
- Start with durable execution from day one. Don't build a synchronous chatbot and bolt on background jobs later. Design for async from the start.
- Keep tools simple. Each tool should do one thing. Don't build a mega-tool that searches the web AND processes documents. Let the model compose simple tools.
- Test tool selection, not just tool execution. Write tests that verify: given this user message, does the model select the right tool? This catches regressions you won't find with unit tests.
- Stream partial progress for long tasks. Even if the full research takes 5 minutes, send periodic updates ("Found 3 relevant sources, synthesizing..."). Users tolerate waiting when they see progress.
The Stack
For reference, here's what I use:
| Layer | Tool | Why |
|---|---|---|
| Chat protocol | Vercel AI SDK | Best streaming + tool calling DX |
| LLM | Claude | Strong tool calling, long context |
| Web search | Exa API | Better relevance than Google Custom Search |
| Document extraction | Unstructured.io | Handles real-world PDFs |
| Embeddings storage | pgvector (Neon) | No separate vector DB to manage |
| Durable execution | Inngest | Step-level checkpointing, no Redis |
| API | Elysia | Type-safe, fast, composable |
| Web | TanStack Start | SSR + modern React |
Every piece is swappable. Don't like Neon? Use Supabase. Prefer LangChain over Vercel AI SDK? The architecture stays the same. The pattern is what matters, not the specific vendor.
If you're starting from scratch, my advice: get the durable execution layer right first. Everything else (streaming, tool calling, RAG) is well-documented. But the part where your AI tasks survive failures and always deliver results? That's what makes users trust your product.
Magnus Rødseth builds AI-native applications and is the creator of Eden Stack, a production-ready starter kit for AI-native SaaS.
Top comments (0)