DEV Community

Young Gao
Young Gao

Posted on

Building AI-Ready Backends: Streaming, Tool Use, and LLM Integration Patterns (2026)

Every backend team is getting the same request: "add AI to it." Most teams bolt on an OpenAI call in a route handler and call it done. Then they hit streaming, timeouts, cost explosions, and hallucination-powered data corruption.

Here's how to build backends that integrate LLMs properly — with streaming, tool use, cost controls, and graceful degradation.

The Architecture Problem

LLM calls are fundamentally different from your typical API call:

Traditional API LLM API
50-200ms latency 2-30 seconds latency
Deterministic output Non-deterministic output
Fixed cost per call Variable cost (by token)
Structured response Unstructured text
Retry-safe May produce different results

If you treat an LLM call like a database query, you'll build a system that's slow, expensive, and unreliable. You need different patterns.

Pattern 1: Streaming Responses with SSE

Users stare at a blank screen for 10 seconds while your LLM generates a response. They leave. The fix: stream tokens as they arrive.

import { OpenAI } from "openai";
import { Router, Request, Response } from "express";

const openai = new OpenAI();
const router = Router();

router.post("/chat", async (req: Request, res: Response) => {
  const { messages } = req.body;

  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("Connection", "keep-alive");

  const stream = await openai.chat.completions.create({
    model: "gpt-4o",
    messages,
    stream: true,
  });

  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content;
    if (content) {
      res.write(`data: ${JSON.stringify({ content })}\n\n`);
    }
  }

  res.write("data: [DONE]\n\n");
  res.end();
});
Enter fullscreen mode Exit fullscreen mode

Critical detail: Set a timeout on the SSE connection. LLM providers have outages. Without a timeout, your connection hangs forever.

const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 30_000);

try {
  const stream = await openai.chat.completions.create({
    model: "gpt-4o",
    messages,
    stream: true,
  }, { signal: controller.signal });
  // ... stream handling
} finally {
  clearTimeout(timeout);
}
Enter fullscreen mode Exit fullscreen mode

Pattern 2: Tool Use (Function Calling)

LLMs can't access your database. But they can tell you what to access. This is tool use — the LLM describes the function call, your backend executes it, and you feed the result back.

const tools = [
  {
    type: "function" as const,
    function: {
      name: "get_order_status",
      description: "Look up the status of a customer order by ID",
      parameters: {
        type: "object",
        properties: {
          order_id: { type: "string", description: "The order ID" },
        },
        required: ["order_id"],
      },
    },
  },
];

// The tool execution map — YOUR code, not the LLM's
const toolHandlers: Record<string, (args: any) => Promise<any>> = {
  get_order_status: async ({ order_id }) => {
    const order = await db.orders.findById(order_id);
    if (!order) return { error: "Order not found" };
    return { status: order.status, eta: order.estimatedDelivery };
  },
};

async function chatWithTools(messages: any[]) {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages,
    tools,
  });

  const msg = response.choices[0].message;

  if (msg.tool_calls) {
    const toolResults = await Promise.all(
      msg.tool_calls.map(async (call) => {
        const handler = toolHandlers[call.function.name];
        if (!handler) throw new Error(`Unknown tool: ${call.function.name}`);
        const args = JSON.parse(call.function.arguments);
        const result = await handler(args);
        return {
          role: "tool" as const,
          tool_call_id: call.id,
          content: JSON.stringify(result),
        };
      })
    );

    // Feed results back and get final answer
    return chatWithTools([...messages, msg, ...toolResults]);
  }

  return msg.content;
}
Enter fullscreen mode Exit fullscreen mode

The security trap: Never let the LLM construct raw SQL or directly call arbitrary functions. Define a strict allowlist of tools. Validate every argument. The LLM is an untrusted input source — treat it like user input.

Pattern 3: Cost Control Middleware

One chatty user can burn through your entire monthly budget in a day. You need per-user, per-endpoint token budgets.

interface UsageRecord {
  tokensUsed: number;
  resetAt: number;
}

const usage = new Map<string, UsageRecord>();

function costGuard(maxTokensPerHour: number) {
  return async (req: Request, res: Response, next: NextFunction) => {
    const userId = req.user.id;
    const now = Date.now();
    let record = usage.get(userId);

    if (!record || now > record.resetAt) {
      record = { tokensUsed: 0, resetAt: now + 3_600_000 };
      usage.set(userId, record);
    }

    if (record.tokensUsed >= maxTokensPerHour) {
      return res.status(429).json({
        error: "Token budget exceeded",
        resetAt: new Date(record.resetAt).toISOString(),
      });
    }

    // Attach usage tracker to request for post-response accounting
    res.on("finish", () => {
      const tokens = (res as any).__tokensUsed || 0;
      record!.tokensUsed += tokens;
    });

    next();
  };
}

router.post("/chat", costGuard(50_000), chatHandler);
Enter fullscreen mode Exit fullscreen mode

For production, swap the in-memory Map with Redis and track costs in dollars, not just tokens. Different models have different pricing.

Pattern 4: Structured Output Validation

LLMs return strings. Your backend needs structured data. Use Zod to validate LLM output and retry on failure.

import { z } from "zod";

const SentimentSchema = z.object({
  sentiment: z.enum(["positive", "negative", "neutral"]),
  confidence: z.number().min(0).max(1),
  summary: z.string().max(200),
});

async function analyzeSentiment(
  text: string,
  retries = 2
): Promise<z.infer<typeof SentimentSchema>> {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    response_format: { type: "json_object" },
    messages: [
      {
        role: "system",
        content: `Analyze sentiment. Return JSON: {"sentiment": "positive"|"negative"|"neutral", "confidence": 0-1, "summary": "brief explanation"}`,
      },
      { role: "user", content: text },
    ],
  });

  const raw = JSON.parse(response.choices[0].message.content!);
  const result = SentimentSchema.safeParse(raw);

  if (result.success) return result.data;

  if (retries > 0) {
    console.warn("LLM output validation failed, retrying", result.error.flatten());
    return analyzeSentiment(text, retries - 1);
  }

  throw new Error("LLM output validation failed after retries");
}
Enter fullscreen mode Exit fullscreen mode

json_mode (or structured outputs if your provider supports it) is essential. Without it, the LLM might return markdown-wrapped JSON, extra commentary, or partial objects.

Pattern 5: Graceful Degradation

Your LLM provider goes down at 3 AM. Your entire product shouldn't go down with it.

async function smartSearch(query: string): Promise<SearchResult[]> {
  try {
    // Try AI-enhanced search first
    const embedding = await openai.embeddings.create({
      model: "text-embedding-3-small",
      input: query,
    });
    return await vectorDb.search(embedding.data[0].embedding, { limit: 10 });
  } catch (error) {
    console.error("AI search failed, falling back to text search", error);
    // Degrade to traditional full-text search
    return await db.products.search(query, { limit: 10 });
  }
}
Enter fullscreen mode Exit fullscreen mode

Design every AI feature with a fallback. If the LLM is down, what does the user see? If the answer is "a broken page," you've built a fragile system.

Pattern 6: Prompt Caching and Deduplication

Same question, same answer, same cost? Cache it.

import crypto from "node:crypto";

function hashPrompt(messages: any[]): string {
  return crypto
    .createHash("sha256")
    .update(JSON.stringify(messages))
    .digest("hex");
}

const promptCache = new Map<string, { result: string; expiresAt: number }>();

async function cachedCompletion(messages: any[]): Promise<string> {
  const key = hashPrompt(messages);
  const cached = promptCache.get(key);

  if (cached && Date.now() < cached.expiresAt) {
    return cached.result;
  }

  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages,
  });

  const result = response.choices[0].message.content!;
  promptCache.set(key, {
    result,
    expiresAt: Date.now() + 3_600_000, // 1 hour TTL
  });

  return result;
}
Enter fullscreen mode Exit fullscreen mode

For production, use Redis with TTL. Consider semantic similarity caching for near-duplicate questions — but that adds complexity. Start simple.

Common Mistakes

1. No timeout on LLM calls. Provider outages happen weekly. Set a 30s abort signal on every call.

2. Passing raw user input as the system prompt. This enables prompt injection. Always separate system instructions from user content. Never interpolate user text into the system message.

3. Synchronous LLM calls in request handlers. A 10-second LLM call blocks your entire event loop (in Node.js without streaming). Use streaming or offload to a background job queue.

4. No cost tracking. You will get a surprise bill. Meter every request, set alerts, and implement per-user budgets from day one.

5. Trusting LLM output. The LLM will occasionally return malformed JSON, hallucinated data, or responses that don't match your schema. Always validate. Always have a retry strategy. Always have a fallback.

6. One model for everything. Use GPT-4o for complex reasoning, GPT-4o-mini for simple classification, and embeddings models for search. Matching the model to the task cuts costs 10-50x.

The Minimal Production Setup

If you're adding AI to an existing backend, start here:

  1. Streaming endpoint with SSE + timeout
  2. Structured output with Zod validation + retry
  3. Cost middleware with per-user token budgets
  4. Graceful degradation — every AI feature has a non-AI fallback
  5. Observability — log every LLM call with model, tokens, latency, and cost

You don't need LangChain. You don't need a vector database (yet). You need solid engineering patterns applied to a new kind of external dependency.


Part of my Production Backend Patterns series. Follow for more practical backend engineering that ships.


If this was useful, consider:


You Might Also Like

Follow me for more production-ready backend content!


If this helped you, buy me a coffee on Ko-fi!

Top comments (0)