Harshdeep Singh

Posted on Jun 2 • Originally published at theharshdeepsingh.com

How to Integrate the OpenAI API into a Production Express App

#openai #express #node #ai

Last year I helped a startup integrate the OpenAI API into their product. It was a chat feature — users could ask questions about their data and get natural language answers. The integration took about a day. Three days after launch, the founder messaged me: "Hey, something's wrong. Our AWS bill just showed an unexpected charge."

It was $340. For three days. They had 60 users.

The issue wasn't a bug — it was that production API usage looks nothing like a tutorial. The tutorial shows you openai.chat.completions.create() and returns a response. The tutorial doesn't show you what happens when users send 500-token messages, when they open 15 browser tabs each maintaining their own chat context, or when one user fires requests 30 times per minute because they think it's broken.

This guide covers what the tutorials skip: rate limiting, token counting, cost guards, streaming, error handling with retries, and model selection. These aren't optional additions — they're what separates a demo from a production feature.

Why Production Is Different

Here's the gap between tutorial code and production code, stated plainly:

Concern
Tutorial Code
Production Code

Cost control
Not mentioned
Token counting, spending limits, model selection by task

Rate limiting
Not mentioned
Per-user and per-IP limits to prevent abuse

Error handling
try/catch that logs to console
Typed errors, retries with backoff, user-facing messages

Response delivery
Wait for full completion, return at once
Streaming via SSE — response appears as it generates

Context management
Each request is independent
Conversation history managed, truncated at token limit

Secrets management
API key hardcoded or in .env (no rotation)
Rotation strategy, usage monitoring, per-feature keys

Let's build a production-grade Express API that addresses all of this. We'll go layer by layer.

The Architecture

┌─────────────────────────────────────────────────────────┐
│ CLIENT (Browser / Mobile) │
│ POST /api/chat { messages: [...] } │
│ GET /api/chat/stream (SSE) │
└──────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ EXPRESS MIDDLEWARE STACK │
│ │
│ 1. express-rate-limit (10 req/min per IP) │
│ 2. tokenGuard() (reject if > 4,000 tokens) │
│ 3. auth middleware (verify user session) │
└──────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ ROUTE HANDLER │
│ │
│ Select model by task type │
│ Build messages array from context │
│ Call openai.chat.completions.create() │
│ Stream or return response │
└──────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ OPENAI API │
│ Model: gpt-4o-mini (default) / gpt-4o (complex tasks) │
└─────────────────────────────────────────────────────────┘

Project Setup

mkdir express-openai && cd express-openai
npm init -y
npm install express openai express-rate-limit tiktoken dotenv
npm install --save-dev nodemon

# .env
OPENAI_API_KEY=sk-proj-your-key-here
PORT=3001

Step 1: The OpenAI Client (Configured for Production)

Don't instantiate the OpenAI client inside route handlers. Create it once, configure it for production, and export it:

// src/openaiClient.js
import OpenAI from "openai";

export const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  maxRetries: 3,     // retry on transient failures (rate limits, timeouts)
  timeout: 30_000,   // 30 second timeout — don't hang forever
});

// Model selection by task complexity
export const MODELS = {
  fast: "gpt-4o-mini",   // classification, simple Q&A, summarization
  smart: "gpt-4o",        // complex reasoning, code generation, analysis
};

The maxRetries: 3 and timeout settings are critical. Without a timeout, a hung OpenAI request will keep your Express server's response object open indefinitely — and if you're running on a serverless function, you'll pay for that idle time.

Step 2: Token Counting and Cost Guard

The tiktoken library is OpenAI's own tokenizer — it counts tokens the exact same way the API does. Use it to reject requests before they hit the API:

// src/tokenCounter.js
import { encoding_for_model } from "tiktoken";

export function countMessageTokens(messages, model = "gpt-4o-mini") {
  const enc = encoding_for_model(model);
  let totalTokens = 0;

  for (const message of messages) {
    totalTokens += 4; // every message has ~4 tokens of overhead
    if (message.role) totalTokens += enc.encode(message.role).length;
    if (message.content) totalTokens += enc.encode(message.content).length;
    totalTokens += 1; // reply primer
  }

  enc.free(); // tiktoken requires explicit cleanup
  return totalTokens + 3; // overall reply overhead
}

// Express middleware — rejects requests over the token limit
export function tokenGuard(maxInputTokens = 4_000) {
  return (req, res, next) => {
    const messages = req.body?.messages;

    if (!Array.isArray(messages)) {
      return res.status(400).json({ error: "messages must be an array" });
    }

    const tokenCount = countMessageTokens(messages);

    if (tokenCount > maxInputTokens) {
      return res.status(400).json({
        error: `Message too long: ${tokenCount} tokens (limit: ${maxInputTokens}). Shorten your message or clear the conversation.`,
        tokenCount,
        limit: maxInputTokens,
      });
    }

    req.tokenCount = tokenCount; // pass downstream for logging
    next();
  };
}

A note on the limit: GPT-4o-mini's context window is 128K tokens, so 4,000 is conservative. But conservative is good here — a user who sends 30,000 tokens in one request is either doing something unusual or has a bug in their client. Reject it, log it, and let them know to clear their context.

Step 3: Rate Limiting

One user shouldn't be able to drain your API budget or trigger OpenAI rate limits for everyone else. Add rate limiting before the AI routes:

// src/middleware/rateLimiter.js
import rateLimit from "express-rate-limit";

export const aiRateLimiter = rateLimit({
  windowMs: 60 * 1000,  // 1-minute window
  max: 15,               // 15 requests per minute per IP
  standardHeaders: true, // return RateLimit headers
  legacyHeaders: false,
  message: {
    error: "Too many requests. Please wait a moment before trying again.",
    retryAfter: 60,
  },
  keyGenerator: (req) => {
    // Use authenticated user ID if available, otherwise fall back to IP
    return req.user?.id || req.ip;
  },
});

// Stricter limit for expensive models
export const smartModelLimiter = rateLimit({
  windowMs: 60 * 1000,
  max: 5,
  message: { error: "Too many complex requests. Rate limited for 60 seconds." },
});

Step 4: Error Handling with Typed OpenAI Errors

The OpenAI Node SDK throws typed errors. Use them — don't just check err.message:

// src/middleware/openaiErrorHandler.js
import OpenAI from "openai";

export function handleOpenAIError(err, req, res, next) {
  if (err instanceof OpenAI.APIError) {
    console.error(`OpenAI API error: ${err.status} ${err.name}`, {
      message: err.message,
      requestId: err.headers?.["x-request-id"],
    });

    if (err.status === 429) {
      return res.status(429).json({
        error: "AI service is busy. Please try again in a moment.",
        retryAfter: parseInt(err.headers?.["retry-after"] || "5"),
      });
    }

    if (err.status === 400) {
      return res.status(400).json({
        error: "Invalid request to AI service. Check your message format.",
      });
    }

    if (err.status === 401) {
      console.error("OpenAI authentication failed — check OPENAI_API_KEY");
      return res.status(503).json({ error: "AI service unavailable." });
    }
  }

  // Not an OpenAI error — pass to your generic error handler
  next(err);
}

Step 5: The Chat Endpoint (Non-Streaming)

Let's wire everything together for a standard, non-streaming response first:

// src/routes/chat.js
import express from "express";
import { openai, MODELS } from "../openaiClient.js";
import { tokenGuard } from "../tokenCounter.js";
import { aiRateLimiter } from "../middleware/rateLimiter.js";

const router = express.Router();

router.post(
  "/",
  aiRateLimiter,
  tokenGuard(4_000),
  async (req, res, next) => {
    const { messages, useSmartModel = false } = req.body;
    const model = useSmartModel ? MODELS.smart : MODELS.fast;

    try {
      const completion = await openai.chat.completions.create({
        model,
        messages,
        max_tokens: 1_000, // cap output tokens to control cost
        temperature: 0.7,
      });

      const reply = completion.choices[0].message;
      const usage = completion.usage;

      res.json({
        message: reply,
        usage: {
          inputTokens: usage.prompt_tokens,
          outputTokens: usage.completion_tokens,
          totalTokens: usage.total_tokens,
          estimatedCostUsd: estimateCost(usage, model),
        },
      });
    } catch (err) {
      next(err);
    }
  }
);

function estimateCost(usage, model) {
  // Prices per million tokens (as of mid-2025)
  const pricing = {
    "gpt-4o-mini": { input: 0.15, output: 0.60 },
    "gpt-4o": { input: 5.00, output: 15.00 },
  };
  const p = pricing[model] || pricing["gpt-4o-mini"];
  const inputCost = (usage.prompt_tokens / 1_000_000) * p.input;
  const outputCost = (usage.completion_tokens / 1_000_000) * p.output;
  return Number((inputCost + outputCost).toFixed(6));
}

export default router;

Notice max_tokens: 1_000. Without this, GPT-4o can produce 4,096 output tokens per request. If a user asks it to "write me a book," it will try. The max_tokens cap is your backstop.

Step 6: Streaming Responses with Server-Sent Events

Streaming makes AI features feel responsive. Instead of a blank screen for 3–8 seconds, the user sees text appear word by word. It's the difference between "this feels AI-powered" and "this is broken."

// src/routes/chat-stream.js
import express from "express";
import { openai, MODELS } from "../openaiClient.js";
import { tokenGuard } from "../tokenCounter.js";
import { aiRateLimiter } from "../middleware/rateLimiter.js";

const router = express.Router();

router.post(
  "/stream",
  aiRateLimiter,
  tokenGuard(4_000),
  async (req, res, next) => {
    const { messages } = req.body;

    // Establish SSE connection
    res.setHeader("Content-Type", "text/event-stream");
    res.setHeader("Cache-Control", "no-cache");
    res.setHeader("Connection", "keep-alive");
    res.setHeader("Access-Control-Allow-Origin", "*");
    res.flushHeaders(); // send headers immediately

    try {
      const stream = await openai.chat.completions.create({
        model: MODELS.fast,
        messages,
        max_tokens: 1_000,
        stream: true,
      });

      let totalOutputTokens = 0;

      for await (const chunk of stream) {
        const delta = chunk.choices[0]?.delta?.content ?? "";
        if (delta) {
          totalOutputTokens += 1; // approximate; tiktoken is more accurate
          res.write(`data: ${JSON.stringify({ type: "delta", content: delta })}

`);
        }

        // Check for stop reason
        if (chunk.choices[0]?.finish_reason === "length") {
          res.write(`data: ${JSON.stringify({ type: "warning", message: "Response truncated at token limit" })}

`);
        }
      }

      res.write(`data: ${JSON.stringify({ type: "done" })}

`);
      res.end();
    } catch (err) {
      // Send error over SSE before closing
      res.write(`data: ${JSON.stringify({ type: "error", message: "Generation failed. Please try again." })}

`);
      res.end();
      // Also pass to error handler for logging
      console.error("Streaming error:", err.message);
    }
  }
);

export default router;

Watch: OpenAI API with Node.js + Express

Streaming vs. Non-Streaming — When to Use Which

Factor
Non-Streaming
Streaming (SSE)

User experience
Blank screen until done (3–8s)
Text appears word by word — feels instant

Complexity
Standard REST response
SSE connection, chunked parsing on frontend

Usage logging
Easy — completion.usage has exact token counts
Harder — token counts only available via the final chunk

Caching
Can cache the full response
Can't cache a stream

Best for
API-to-API calls, short responses, classification tasks
User-facing chat, long completions, code generation

Serverless functions
Works everywhere
Needs long-running connection — use Vercel Edge Functions or a real server

Testing Your OpenAI Integration

Mocking the OpenAI API in tests is a trap. The mock will pass but the real integration will fail in ways you didn't anticipate — different error formats, unexpected token usage, streaming chunk structure variations.

Instead:

Unit test everything except the API call. Test your token counting, your error handler, your response formatter — all without touching OpenAI. These functions should be pure and deterministic.
Use a cheap model for integration tests. gpt-4o-mini is $0.15 per million input tokens. Your integration test suite probably costs fractions of a cent to run. Run it.
Record and replay for expensive tests. Libraries like nock or VCR-style recording let you record real API responses and replay them in future test runs without hitting the API.

// Example: testing the token guard middleware in isolation
import { tokenGuard } from "../src/tokenCounter.js";
import { createMockMiddlewareContext } from "./helpers.js";

test("tokenGuard rejects messages over the limit", async () => {
  const guard = tokenGuard(10); // tiny limit for test
  const { req, res, next } = createMockMiddlewareContext({
    body: {
      messages: [{ role: "user", content: "a".repeat(500) }],
    },
  });

  await guard(req, res, next);

  expect(res.statusCode).toBe(400);
  expect(res.body.error).toContain("too long");
  expect(next).not.toHaveBeenCalled();
});

TL;DR

Initialize the OpenAI client once with maxRetries and timeout set. Don't instantiate it in route handlers or you'll get a new client per request with no retry or timeout configuration.
Count tokens before you call the API. Use tiktoken to measure input size and reject oversized requests before they cost you money. Set a max_tokens cap on output for the same reason.
Rate limit by user ID, not just IP. Authenticated users with the same IP (corporate NAT, mobile networks) would all share a single IP limit — use their user ID as the rate limit key.
Use typed error handling — instanceof OpenAI.APIError gives you the status code, request ID, and message. Turn 429s into user-friendly retry prompts, not 500 errors.
Stream for user-facing features, skip it for internal calls. SSE streaming transforms the UX for chat interfaces. For batch processing or API-to-API calls, non-streaming is simpler to implement and log.
Test everything except the API call. Token counting, error handling, and response formatting are all pure functions you can test cheaply. For integration tests, use gpt-4o-mini — it's cheap enough to run in CI.

Top comments (1)

江欢（JackSoul） • Jun 3

Great production checklist — especially that token counting + per-user limits need to exist before launch, not after the first bill surprise.

One extra pattern I’d add: keep a hard budget boundary outside the app code too. App-level guards are necessary, but they’re easy to bypass later when you add a worker, cron job, admin script, or second service that calls the model directly.

A practical setup is:

app-level per-user/per-feature limits,
separate API keys per customer/environment,
prepaid or hard monthly caps,
usage logs grouped by key + model + endpoint,
alerts before the cap, but request rejection at the cap.

That separation makes the “one user opens 15 tabs” case easier to contain, and it also helps debug whether cost came from chat, background jobs, or internal tooling.