Every backend team is getting the same request: "add AI to it." Most teams bolt on an OpenAI call in a route handler and call it done. Then they hit streaming, timeouts, cost explosions, and hallucination-powered data corruption.
Here's how to build backends that integrate LLMs properly — with streaming, tool use, cost controls, and graceful degradation.
The Architecture Problem
LLM calls are fundamentally different from your typical API call:
| Traditional API | LLM API |
|---|---|
| 50-200ms latency | 2-30 seconds latency |
| Deterministic output | Non-deterministic output |
| Fixed cost per call | Variable cost (by token) |
| Structured response | Unstructured text |
| Retry-safe | May produce different results |
If you treat an LLM call like a database query, you'll build a system that's slow, expensive, and unreliable. You need different patterns.
Pattern 1: Streaming Responses with SSE
Users stare at a blank screen for 10 seconds while your LLM generates a response. They leave. The fix: stream tokens as they arrive.
import { OpenAI } from "openai";
import { Router, Request, Response } from "express";
const openai = new OpenAI();
const router = Router();
router.post("/chat", async (req: Request, res: Response) => {
const { messages } = req.body;
res.setHeader("Content-Type", "text/event-stream");
res.setHeader("Cache-Control", "no-cache");
res.setHeader("Connection", "keep-alive");
const stream = await openai.chat.completions.create({
model: "gpt-4o",
messages,
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
res.write(`data: ${JSON.stringify({ content })}\n\n`);
}
}
res.write("data: [DONE]\n\n");
res.end();
});
Critical detail: Set a timeout on the SSE connection. LLM providers have outages. Without a timeout, your connection hangs forever.
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 30_000);
try {
const stream = await openai.chat.completions.create({
model: "gpt-4o",
messages,
stream: true,
}, { signal: controller.signal });
// ... stream handling
} finally {
clearTimeout(timeout);
}
Pattern 2: Tool Use (Function Calling)
LLMs can't access your database. But they can tell you what to access. This is tool use — the LLM describes the function call, your backend executes it, and you feed the result back.
const tools = [
{
type: "function" as const,
function: {
name: "get_order_status",
description: "Look up the status of a customer order by ID",
parameters: {
type: "object",
properties: {
order_id: { type: "string", description: "The order ID" },
},
required: ["order_id"],
},
},
},
];
// The tool execution map — YOUR code, not the LLM's
const toolHandlers: Record<string, (args: any) => Promise<any>> = {
get_order_status: async ({ order_id }) => {
const order = await db.orders.findById(order_id);
if (!order) return { error: "Order not found" };
return { status: order.status, eta: order.estimatedDelivery };
},
};
async function chatWithTools(messages: any[]) {
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages,
tools,
});
const msg = response.choices[0].message;
if (msg.tool_calls) {
const toolResults = await Promise.all(
msg.tool_calls.map(async (call) => {
const handler = toolHandlers[call.function.name];
if (!handler) throw new Error(`Unknown tool: ${call.function.name}`);
const args = JSON.parse(call.function.arguments);
const result = await handler(args);
return {
role: "tool" as const,
tool_call_id: call.id,
content: JSON.stringify(result),
};
})
);
// Feed results back and get final answer
return chatWithTools([...messages, msg, ...toolResults]);
}
return msg.content;
}
The security trap: Never let the LLM construct raw SQL or directly call arbitrary functions. Define a strict allowlist of tools. Validate every argument. The LLM is an untrusted input source — treat it like user input.
Pattern 3: Cost Control Middleware
One chatty user can burn through your entire monthly budget in a day. You need per-user, per-endpoint token budgets.
interface UsageRecord {
tokensUsed: number;
resetAt: number;
}
const usage = new Map<string, UsageRecord>();
function costGuard(maxTokensPerHour: number) {
return async (req: Request, res: Response, next: NextFunction) => {
const userId = req.user.id;
const now = Date.now();
let record = usage.get(userId);
if (!record || now > record.resetAt) {
record = { tokensUsed: 0, resetAt: now + 3_600_000 };
usage.set(userId, record);
}
if (record.tokensUsed >= maxTokensPerHour) {
return res.status(429).json({
error: "Token budget exceeded",
resetAt: new Date(record.resetAt).toISOString(),
});
}
// Attach usage tracker to request for post-response accounting
res.on("finish", () => {
const tokens = (res as any).__tokensUsed || 0;
record!.tokensUsed += tokens;
});
next();
};
}
router.post("/chat", costGuard(50_000), chatHandler);
For production, swap the in-memory Map with Redis and track costs in dollars, not just tokens. Different models have different pricing.
Pattern 4: Structured Output Validation
LLMs return strings. Your backend needs structured data. Use Zod to validate LLM output and retry on failure.
import { z } from "zod";
const SentimentSchema = z.object({
sentiment: z.enum(["positive", "negative", "neutral"]),
confidence: z.number().min(0).max(1),
summary: z.string().max(200),
});
async function analyzeSentiment(
text: string,
retries = 2
): Promise<z.infer<typeof SentimentSchema>> {
const response = await openai.chat.completions.create({
model: "gpt-4o",
response_format: { type: "json_object" },
messages: [
{
role: "system",
content: `Analyze sentiment. Return JSON: {"sentiment": "positive"|"negative"|"neutral", "confidence": 0-1, "summary": "brief explanation"}`,
},
{ role: "user", content: text },
],
});
const raw = JSON.parse(response.choices[0].message.content!);
const result = SentimentSchema.safeParse(raw);
if (result.success) return result.data;
if (retries > 0) {
console.warn("LLM output validation failed, retrying", result.error.flatten());
return analyzeSentiment(text, retries - 1);
}
throw new Error("LLM output validation failed after retries");
}
json_mode (or structured outputs if your provider supports it) is essential. Without it, the LLM might return markdown-wrapped JSON, extra commentary, or partial objects.
Pattern 5: Graceful Degradation
Your LLM provider goes down at 3 AM. Your entire product shouldn't go down with it.
async function smartSearch(query: string): Promise<SearchResult[]> {
try {
// Try AI-enhanced search first
const embedding = await openai.embeddings.create({
model: "text-embedding-3-small",
input: query,
});
return await vectorDb.search(embedding.data[0].embedding, { limit: 10 });
} catch (error) {
console.error("AI search failed, falling back to text search", error);
// Degrade to traditional full-text search
return await db.products.search(query, { limit: 10 });
}
}
Design every AI feature with a fallback. If the LLM is down, what does the user see? If the answer is "a broken page," you've built a fragile system.
Pattern 6: Prompt Caching and Deduplication
Same question, same answer, same cost? Cache it.
import crypto from "node:crypto";
function hashPrompt(messages: any[]): string {
return crypto
.createHash("sha256")
.update(JSON.stringify(messages))
.digest("hex");
}
const promptCache = new Map<string, { result: string; expiresAt: number }>();
async function cachedCompletion(messages: any[]): Promise<string> {
const key = hashPrompt(messages);
const cached = promptCache.get(key);
if (cached && Date.now() < cached.expiresAt) {
return cached.result;
}
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages,
});
const result = response.choices[0].message.content!;
promptCache.set(key, {
result,
expiresAt: Date.now() + 3_600_000, // 1 hour TTL
});
return result;
}
For production, use Redis with TTL. Consider semantic similarity caching for near-duplicate questions — but that adds complexity. Start simple.
Common Mistakes
1. No timeout on LLM calls. Provider outages happen weekly. Set a 30s abort signal on every call.
2. Passing raw user input as the system prompt. This enables prompt injection. Always separate system instructions from user content. Never interpolate user text into the system message.
3. Synchronous LLM calls in request handlers. A 10-second LLM call blocks your entire event loop (in Node.js without streaming). Use streaming or offload to a background job queue.
4. No cost tracking. You will get a surprise bill. Meter every request, set alerts, and implement per-user budgets from day one.
5. Trusting LLM output. The LLM will occasionally return malformed JSON, hallucinated data, or responses that don't match your schema. Always validate. Always have a retry strategy. Always have a fallback.
6. One model for everything. Use GPT-4o for complex reasoning, GPT-4o-mini for simple classification, and embeddings models for search. Matching the model to the task cuts costs 10-50x.
The Minimal Production Setup
If you're adding AI to an existing backend, start here:
- Streaming endpoint with SSE + timeout
- Structured output with Zod validation + retry
- Cost middleware with per-user token budgets
- Graceful degradation — every AI feature has a non-AI fallback
- Observability — log every LLM call with model, tokens, latency, and cost
You don't need LangChain. You don't need a vector database (yet). You need solid engineering patterns applied to a new kind of external dependency.
Part of my Production Backend Patterns series. Follow for more practical backend engineering that ships.
If this was useful, consider:
- Sponsoring on GitHub to support more open-source tools
- Buying me a coffee on Ko-fi
You Might Also Like
- BullMQ Job Queues in Node.js: Background Processing Done Right (2026 Guide)
- Building Your First MCP Server in TypeScript: Connect AI Agents to Anything
- Background Job Processing in Node.js: BullMQ, Queues, and Worker Patterns (2026)
Follow me for more production-ready backend content!
If this helped you, buy me a coffee on Ko-fi!
Top comments (0)