Young Gao

Posted on Mar 21

Building AI-Ready Backends: Streaming, Tool Use, and LLM Integration Patterns (2026)

#ai #typescript #llm #backend

Every backend team is getting the same request: "add AI to it." Most teams bolt on an OpenAI call in a route handler and call it done. Then they hit streaming, timeouts, cost explosions, and hallucination-powered data corruption.

Here's how to build backends that integrate LLMs properly — with streaming, tool use, cost controls, and graceful degradation.

The Architecture Problem

LLM calls are fundamentally different from your typical API call:

Traditional API	LLM API
50-200ms latency	2-30 seconds latency
Deterministic output	Non-deterministic output
Fixed cost per call	Variable cost (by token)
Structured response	Unstructured text
Retry-safe	May produce different results

If you treat an LLM call like a database query, you'll build a system that's slow, expensive, and unreliable. You need different patterns.

Pattern 1: Streaming Responses with SSE

Users stare at a blank screen for 10 seconds while your LLM generates a response. They leave. The fix: stream tokens as they arrive.

import { OpenAI } from "openai";
import { Router, Request, Response } from "express";

const openai = new OpenAI();
const router = Router();

router.post("/chat", async (req: Request, res: Response) => {
  const { messages } = req.body;

  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("Connection", "keep-alive");

  const stream = await openai.chat.completions.create({
    model: "gpt-4o",
    messages,
    stream: true,
  });

  for await (const chunk of stream) {
    const content = chunk.choices[0]?.delta?.content;
    if (content) {
      res.write(`data: ${JSON.stringify({ content })}\n\n`);
    }
  }

  res.write("data: [DONE]\n\n");
  res.end();
});

Critical detail: Set a timeout on the SSE connection. LLM providers have outages. Without a timeout, your connection hangs forever.

const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 30_000);

try {
  const stream = await openai.chat.completions.create({
    model: "gpt-4o",
    messages,
    stream: true,
  }, { signal: controller.signal });
  // ... stream handling
} finally {
  clearTimeout(timeout);
}

Pattern 2: Tool Use (Function Calling)

LLMs can't access your database. But they can tell you what to access. This is tool use — the LLM describes the function call, your backend executes it, and you feed the result back.

const tools = [
  {
    type: "function" as const,
    function: {
      name: "get_order_status",
      description: "Look up the status of a customer order by ID",
      parameters: {
        type: "object",
        properties: {
          order_id: { type: "string", description: "The order ID" },
        },
        required: ["order_id"],
      },
    },
  },
];

// The tool execution map — YOUR code, not the LLM's
const toolHandlers: Record<string, (args: any) => Promise<any>> = {
  get_order_status: async ({ order_id }) => {
    const order = await db.orders.findById(order_id);
    if (!order) return { error: "Order not found" };
    return { status: order.status, eta: order.estimatedDelivery };
  },
};

async function chatWithTools(messages: any[]) {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages,
    tools,
  });

  const msg = response.choices[0].message;

  if (msg.tool_calls) {
    const toolResults = await Promise.all(
      msg.tool_calls.map(async (call) => {
        const handler = toolHandlers[call.function.name];
        if (!handler) throw new Error(`Unknown tool: ${call.function.name}`);
        const args = JSON.parse(call.function.arguments);
        const result = await handler(args);
        return {
          role: "tool" as const,
          tool_call_id: call.id,
          content: JSON.stringify(result),
        };
      })
    );

    // Feed results back and get final answer
    return chatWithTools([...messages, msg, ...toolResults]);
  }

  return msg.content;
}

The security trap: Never let the LLM construct raw SQL or directly call arbitrary functions. Define a strict allowlist of tools. Validate every argument. The LLM is an untrusted input source — treat it like user input.

Pattern 3: Cost Control Middleware

One chatty user can burn through your entire monthly budget in a day. You need per-user, per-endpoint token budgets.

interface UsageRecord {
  tokensUsed: number;
  resetAt: number;
}

const usage = new Map<string, UsageRecord>();

function costGuard(maxTokensPerHour: number) {
  return async (req: Request, res: Response, next: NextFunction) => {
    const userId = req.user.id;
    const now = Date.now();
    let record = usage.get(userId);

    if (!record || now > record.resetAt) {
      record = { tokensUsed: 0, resetAt: now + 3_600_000 };
      usage.set(userId, record);
    }

    if (record.tokensUsed >= maxTokensPerHour) {
      return res.status(429).json({
        error: "Token budget exceeded",
        resetAt: new Date(record.resetAt).toISOString(),
      });
    }

    // Attach usage tracker to request for post-response accounting
    res.on("finish", () => {
      const tokens = (res as any).__tokensUsed || 0;
      record!.tokensUsed += tokens;
    });

    next();
  };
}

router.post("/chat", costGuard(50_000), chatHandler);

For production, swap the in-memory Map with Redis and track costs in dollars, not just tokens. Different models have different pricing.

Pattern 4: Structured Output Validation

LLMs return strings. Your backend needs structured data. Use Zod to validate LLM output and retry on failure.

import { z } from "zod";

const SentimentSchema = z.object({
  sentiment: z.enum(["positive", "negative", "neutral"]),
  confidence: z.number().min(0).max(1),
  summary: z.string().max(200),
});

async function analyzeSentiment(
  text: string,
  retries = 2
): Promise<z.infer<typeof SentimentSchema>> {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    response_format: { type: "json_object" },
    messages: [
      {
        role: "system",
        content: `Analyze sentiment. Return JSON: {"sentiment": "positive"|"negative"|"neutral", "confidence": 0-1, "summary": "brief explanation"}`,
      },
      { role: "user", content: text },
    ],
  });

  const raw = JSON.parse(response.choices[0].message.content!);
  const result = SentimentSchema.safeParse(raw);

  if (result.success) return result.data;

  if (retries > 0) {
    console.warn("LLM output validation failed, retrying", result.error.flatten());
    return analyzeSentiment(text, retries - 1);
  }

  throw new Error("LLM output validation failed after retries");
}

json_mode (or structured outputs if your provider supports it) is essential. Without it, the LLM might return markdown-wrapped JSON, extra commentary, or partial objects.

Pattern 5: Graceful Degradation

Your LLM provider goes down at 3 AM. Your entire product shouldn't go down with it.

async function smartSearch(query: string): Promise<SearchResult[]> {
  try {
    // Try AI-enhanced search first
    const embedding = await openai.embeddings.create({
      model: "text-embedding-3-small",
      input: query,
    });
    return await vectorDb.search(embedding.data[0].embedding, { limit: 10 });
  } catch (error) {
    console.error("AI search failed, falling back to text search", error);
    // Degrade to traditional full-text search
    return await db.products.search(query, { limit: 10 });
  }
}

Design every AI feature with a fallback. If the LLM is down, what does the user see? If the answer is "a broken page," you've built a fragile system.

Pattern 6: Prompt Caching and Deduplication

Same question, same answer, same cost? Cache it.

import crypto from "node:crypto";

function hashPrompt(messages: any[]): string {
  return crypto
    .createHash("sha256")
    .update(JSON.stringify(messages))
    .digest("hex");
}

const promptCache = new Map<string, { result: string; expiresAt: number }>();

async function cachedCompletion(messages: any[]): Promise<string> {
  const key = hashPrompt(messages);
  const cached = promptCache.get(key);

  if (cached && Date.now() < cached.expiresAt) {
    return cached.result;
  }

  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages,
  });

  const result = response.choices[0].message.content!;
  promptCache.set(key, {
    result,
    expiresAt: Date.now() + 3_600_000, // 1 hour TTL
  });

  return result;
}

For production, use Redis with TTL. Consider semantic similarity caching for near-duplicate questions — but that adds complexity. Start simple.

Common Mistakes

1. No timeout on LLM calls. Provider outages happen weekly. Set a 30s abort signal on every call.

2. Passing raw user input as the system prompt. This enables prompt injection. Always separate system instructions from user content. Never interpolate user text into the system message.

3. Synchronous LLM calls in request handlers. A 10-second LLM call blocks your entire event loop (in Node.js without streaming). Use streaming or offload to a background job queue.

4. No cost tracking. You will get a surprise bill. Meter every request, set alerts, and implement per-user budgets from day one.

5. Trusting LLM output. The LLM will occasionally return malformed JSON, hallucinated data, or responses that don't match your schema. Always validate. Always have a retry strategy. Always have a fallback.

6. One model for everything. Use GPT-4o for complex reasoning, GPT-4o-mini for simple classification, and embeddings models for search. Matching the model to the task cuts costs 10-50x.

The Minimal Production Setup

If you're adding AI to an existing backend, start here:

Streaming endpoint with SSE + timeout
Structured output with Zod validation + retry
Cost middleware with per-user token budgets
Graceful degradation — every AI feature has a non-AI fallback
Observability — log every LLM call with model, tokens, latency, and cost

You don't need LangChain. You don't need a vector database (yet). You need solid engineering patterns applied to a new kind of external dependency.

Part of my Production Backend Patterns series. Follow for more practical backend engineering that ships.

DEV Community