DEV Community

Cover image for How to Add AI Features to Your SaaS App Without Breaking Everything
Aadesh Kumar
Aadesh Kumar

Posted on

How to Add AI Features to Your SaaS App Without Breaking Everything

LLM integrations look simple in demos. In production, they fail in ways most tutorials never cover.

What you will learn in this guide:

  • How to structure an LLM integration that does not block your API
  • How to implement streaming responses for better user experience
  • How to manage token costs before they become a budget problem
  • How to version prompts without redeploying your application
  • How to build graceful fallback when the AI call fails or times out

The Problem With Most AI Integration Tutorials

Most LLM integration guides stop at "call the API, log the response." That is enough to demo. It is not enough to ship.

Production AI features fail in ways that do not appear in tutorials: the LLM API times out under load, a prompt that worked in testing generates hallucinated output with slightly different user input, token costs spike 10x when a user submits a longer-than-expected document, and the entire feature breaks every other feature when the AI call blocks the event loop.

This guide covers the integration patterns that prevent those failures — in Node.js with Express, using the Anthropic and OpenAI APIs as the reference implementations. The patterns apply to any LLM provider.

Step 1: Structure the Integration to Not Block Your API

The single most common mistake in first-pass LLM integrations is making a synchronous API call to the LLM provider inside a request handler. LLM calls are slow — typically 1–15 seconds for a full response depending on output length. Blocking the request handler for that duration under concurrent load will saturate your server.

Wrong approach:

// DON'T do this — blocks the event loop under concurrent load
app.post('/api/summarize', requireAuth, async (req, res) => {
  const { text } = req.body;

  // This can take 5-15 seconds — terrible under concurrent requests
  const completion = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: `Summarize: ${text}` }]
  });

  res.json({ summary: completion.choices[0].message.content });
});
Enter fullscreen mode Exit fullscreen mode

Correct approach for long-running AI tasks — use a background job:

const Queue = require('bull');
const aiQueue = new Queue('ai-tasks', { redis: { url: process.env.REDIS_URL } });

// Route handler: enqueue the job, return immediately
app.post('/api/summarize', requireAuth, async (req, res) => {
  const { text, documentId } = req.body;

  const job = await aiQueue.add('summarize', {
    userId: req.user.id,
    documentId,
    text
  });

  // Return job ID immediately — client polls or receives result via websocket
  res.status(202).json({ jobId: job.id, status: 'processing' });
});

// Status endpoint — client polls this
app.get('/api/summarize/:jobId', requireAuth, async (req, res) => {
  const job = await aiQueue.getJob(req.params.jobId);
  if (!job) return res.status(404).json({ error: 'Job not found' });

  const state = await job.getState();
  const result = job.returnvalue;

  res.json({ jobId: job.id, status: state, result: result || null });
});

// Worker processes the job outside the request lifecycle
aiQueue.process('summarize', async (job) => {
  const { userId, documentId, text } = job.data;

  const completion = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: buildSummaryPrompt(text) }],
    max_tokens: 500
  });

  const summary = completion.choices[0].message.content;

  // Persist the result
  await db.documents.update({ summary }, { where: { id: documentId, userId } });

  return { summary, documentId };
});
Enter fullscreen mode Exit fullscreen mode

For features where the user is waiting in the UI and needs a real-time response — a chat interface, an inline text suggestion — streaming is the correct approach instead of background jobs.

Step 2: Implement Streaming Responses

Streaming returns tokens to the client as they are generated rather than waiting for the full response. For the user, this is the difference between a blank screen for 8 seconds and text appearing incrementally within 500ms.

// Streaming endpoint using Server-Sent Events (SSE)
app.get('/api/chat/stream', requireAuth, async (req, res) => {
  const { message, conversationId } = req.query;

  // Set SSE headers
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');
  res.setHeader('X-Accel-Buffering', 'no'); // Disable Nginx buffering

  res.flushHeaders();

  try {
    const stream = await openai.chat.completions.create({
      model: 'gpt-4o',
      messages: await buildConversationHistory(conversationId, message),
      max_tokens: 1000,
      stream: true  // Enable streaming
    });

    let fullContent = '';

    for await (const chunk of stream) {
      const delta = chunk.choices[0]?.delta?.content || '';

      if (delta) {
        fullContent += delta;

        // Send token to client
        res.write(`data: ${JSON.stringify({ token: delta })}\n\n`);
      }

      // Detect end of stream
      if (chunk.choices[0]?.finish_reason === 'stop') {
        // Persist the full response
        await db.messages.create({
          conversationId,
          role: 'assistant',
          content: fullContent
        });

        // Signal completion to client
        res.write(`data: ${JSON.stringify({ done: true })}\n\n`);
        res.end();
      }
    }
  } catch (err) {
    // Send error event to client so the UI can respond gracefully
    res.write(`data: ${JSON.stringify({ error: err.message })}\n\n`);
    res.end();
  }
});
Enter fullscreen mode Exit fullscreen mode

Client-side SSE consumption:

function streamAIResponse(message, conversationId, onToken, onComplete, onError) {
  const params = new URLSearchParams({ message, conversationId });
  const eventSource = new EventSource(`/api/chat/stream?${params}`);

  eventSource.onmessage = (event) => {
    const data = JSON.parse(event.data);

    if (data.token) {
      onToken(data.token);  // Append token to UI
    }

    if (data.done) {
      eventSource.close();
      onComplete();
    }

    if (data.error) {
      eventSource.close();
      onError(new Error(data.error));
    }
  };

  eventSource.onerror = () => {
    eventSource.close();
    onError(new Error('Stream connection failed'));
  };

  // Return cleanup function
  return () => eventSource.close();
}
Enter fullscreen mode Exit fullscreen mode

Step 3: Manage Token Costs Before They Manage You

Token costs are invisible in development and punishing in production. A feature that costs $0.002 per call with a 500-token input becomes $2.00 per call when a user submits a 50,000-word document.

Enforce input length limits at the application layer:

const { encode } = require('gpt-tokenizer');

function validateTokenBudget(text, maxInputTokens = 4000) {
  const tokenCount = encode(text).length;

  if (tokenCount > maxInputTokens) {
    throw new Error(
      `Input too long: ${tokenCount} tokens (maximum: ${maxInputTokens}). ` +
      `Please shorten your input by approximately ${tokenCount - maxInputTokens} tokens.`
    );
  }

  return tokenCount;
}

// Usage in route handler
app.post('/api/summarize', requireAuth, async (req, res) => {
  try {
    const inputTokens = validateTokenBudget(req.body.text, 4000);
    // Proceed with AI call
  } catch (err) {
    return res.status(400).json({ error: err.message, code: 'TOKEN_LIMIT_EXCEEDED' });
  }
});
Enter fullscreen mode Exit fullscreen mode

Track usage per user for cost attribution and abuse prevention:

async function trackAIUsage(userId, feature, inputTokens, outputTokens, model) {
  const costs = {
    'gpt-4o': { input: 0.0000025, output: 0.00001 },
    'claude-sonnet-4-20250514': { input: 0.000003, output: 0.000015 }
  };

  const modelCosts = costs[model] || { input: 0, output: 0 };
  const estimatedCost = (inputTokens * modelCosts.input) + (outputTokens * modelCosts.output);

  await db.aiUsage.create({
    userId,
    feature,
    model,
    inputTokens,
    outputTokens,
    estimatedCostUsd: estimatedCost,
    timestamp: new Date()
  });

  // Check monthly spend limit per user
  const monthlySpend = await db.aiUsage.sum('estimatedCostUsd', {
    where: {
      userId,
      timestamp: { [Op.gte]: startOfMonth() }
    }
  });

  if (monthlySpend > USER_MONTHLY_AI_LIMIT_USD) {
    throw new Error('Monthly AI usage limit reached');
  }
}
Enter fullscreen mode Exit fullscreen mode

Use caching for repeated identical requests:

const crypto = require('crypto');

async function cachedCompletion(prompt, options = {}) {
  const cacheKey = `ai:${crypto.createHash('sha256').update(prompt).digest('hex')}`;
  const cached = await redis.get(cacheKey);

  if (cached) {
    return JSON.parse(cached);
  }

  const result = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: prompt }],
    ...options
  });

  // Cache for 1 hour — adjust based on how dynamic the content needs to be
  await redis.set(cacheKey, JSON.stringify(result), { EX: 3600 });

  return result;
}
Enter fullscreen mode Exit fullscreen mode

Caching is particularly effective for features like document classification, tag suggestion, and sentiment analysis — where the same input frequently reappears across users.

Step 4: Version Your Prompts Without Redeploying

Prompts are logic. When a prompt changes, behavior changes. Hardcoding prompts inside application code means every prompt iteration requires a code deployment — slowing iteration and making rollback difficult.

Store prompts in the database with versioning:

CREATE TABLE prompt_templates (
  id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  feature     VARCHAR(100) NOT NULL,
  version     INTEGER NOT NULL,
  is_active   BOOLEAN DEFAULT FALSE,
  template    TEXT NOT NULL,
  created_at  TIMESTAMP DEFAULT NOW(),

  UNIQUE(feature, version)
);

-- Only one active version per feature
CREATE UNIQUE INDEX idx_active_prompt ON prompt_templates (feature)
  WHERE is_active = TRUE;
Enter fullscreen mode Exit fullscreen mode
// Prompt management service
class PromptService {
  constructor() {
    this.cache = new Map();
  }

  async getPrompt(feature) {
    // Cache in memory for 5 minutes to avoid DB hits on every request
    if (this.cache.has(feature)) {
      const { prompt, cachedAt } = this.cache.get(feature);
      if (Date.now() - cachedAt < 300000) return prompt;
    }

    const template = await db.promptTemplates.findOne({
      where: { feature, isActive: true }
    });

    if (!template) throw new Error(`No active prompt found for feature: ${feature}`);

    this.cache.set(feature, { prompt: template.template, cachedAt: Date.now() });
    return template.template;
  }

  buildPrompt(template, variables) {
    return template.replace(/\{\{(\w+)\}\}/g, (_, key) => {
      if (!(key in variables)) throw new Error(`Missing prompt variable: ${key}`);
      return variables[key];
    });
  }
}

const promptService = new PromptService();

// Usage
const template = await promptService.getPrompt('document-summary');
const prompt = promptService.buildPrompt(template, {
  documentText: req.body.text,
  outputLanguage: req.user.preferredLanguage || 'English',
  maxLength: '3 sentences'
});
Enter fullscreen mode Exit fullscreen mode

This pattern lets non-engineering team members iterate on prompts through an admin interface without touching application code, and enables instant rollback by deactivating one version and activating the previous one.

Step 5: Build Graceful Fallback for AI Failures

LLM APIs have outages. They return 429 rate limit errors. They time out. An AI feature that makes the rest of your application unavailable when the LLM provider has an incident is a reliability problem waiting to happen.

class AIService {
  constructor() {
    this.timeout = 30000; // 30 second timeout
    this.retryAttempts = 2;
    this.retryDelay = 1000;
  }

  async complete(prompt, options = {}) {
    for (let attempt = 0; attempt <= this.retryAttempts; attempt++) {
      try {
        const controller = new AbortController();
        const timeoutId = setTimeout(() => controller.abort(), this.timeout);

        const result = await openai.chat.completions.create(
          { model: 'gpt-4o', messages: [{ role: 'user', content: prompt }], ...options },
          { signal: controller.signal }
        );

        clearTimeout(timeoutId);
        return { success: true, data: result.choices[0].message.content };

      } catch (err) {
        const isRetryable = err.status === 429 || err.status >= 500 || err.name === 'AbortError';
        const isLastAttempt = attempt === this.retryAttempts;

        if (!isRetryable || isLastAttempt) {
          return { success: false, error: err.message, code: err.status };
        }

        // Exponential backoff between retries
        await new Promise(r => setTimeout(r, this.retryDelay * Math.pow(2, attempt)));
      }
    }
  }
}

// Route handler with graceful degradation
const aiService = new AIService();

app.post('/api/documents/:id/suggest-tags', requireAuth, async (req, res) => {
  const document = await db.documents.findByPk(req.params.id);
  const prompt = await buildTagSuggestionPrompt(document.content);

  const result = await aiService.complete(prompt, { max_tokens: 100 });

  if (!result.success) {
    // Graceful fallback: return empty suggestions rather than an error
    // Log the failure for monitoring
    console.error('AI tag suggestion failed:', result.error);

    return res.json({
      tags: [],
      source: 'fallback',
      message: 'Automatic tag suggestions are temporarily unavailable'
    });
  }

  const tags = parseTagsFromCompletion(result.data);
  res.json({ tags, source: 'ai' });
});
Enter fullscreen mode Exit fullscreen mode

The key principle: an AI feature should degrade gracefully, not fail loudly. A tag suggestion that returns empty results is acceptable. A tag suggestion that throws an unhandled exception and crashes the document view is not.

AI Builders and LLM Integration in 2026

A growing category of full-stack AI builders now generates LLM integration scaffolding as part of the application output — including streaming handlers, prompt storage patterns, and token tracking. For teams evaluating whether to build this infrastructure from scratch, platforms like imagine.bo include AI feature scaffolding in their generated backends, which can compress the initial integration work significantly.

For teams reviewing generated LLM integration code or building their own, the patterns above represent the production baseline. Generated scaffolding is a starting point; understanding why each pattern exists is what allows you to extend and debug it.

Testing AI Integrations

// Mock the LLM client in tests — never call the real API in unit tests
jest.mock('openai', () => ({
  chat: {
    completions: {
      create: jest.fn()
    }
  }
}));

describe('Document summarization', () => {
  it('returns summary on successful AI call', async () => {
    openai.chat.completions.create.mockResolvedValue({
      choices: [{ message: { content: 'This is a test summary.' }, finish_reason: 'stop' }]
    });

    const response = await request(app)
      .post('/api/summarize')
      .set('Authorization', `Bearer ${testToken}`)
      .send({ text: 'Sample document text', documentId: 'doc-123' });

    expect(response.status).toBe(202); // Accepted for processing
    expect(response.body.jobId).toBeDefined();
  });

  it('returns fallback response when AI call fails', async () => {
    openai.chat.completions.create.mockRejectedValue(
      Object.assign(new Error('Service unavailable'), { status: 503 })
    );

    const response = await request(app)
      .post('/api/documents/doc-123/suggest-tags')
      .set('Authorization', `Bearer ${testToken}`);

    expect(response.status).toBe(200);
    expect(response.body.tags).toEqual([]);
    expect(response.body.source).toBe('fallback');
  });
});
Enter fullscreen mode Exit fullscreen mode

Common Questions

Which LLM provider should I use in 2026?
The honest answer is: it depends on the task. OpenAI's GPT-4o and Anthropic's Claude models are the two most widely used in production SaaS integrations. For code generation tasks, Claude consistently performs well. For multimodal tasks involving images, GPT-4o has broader support. Running both behind an abstraction layer and switching based on task type is a pattern used by teams with high AI feature density.

How do I prevent users from abusing AI features?
Rate limiting at the user level, monthly token budgets, and input length validation are the three primary controls. Implementing all three is appropriate for any AI feature with non-trivial per-call cost. Requiring a paid plan to access AI features is also a common and legitimate approach.

Should AI features be synchronous or asynchronous?
It depends on the expected latency. Features where users expect an instant response — inline text suggestions, short completions — should use streaming. Features where users understand processing will take time — document analysis, report generation — should use background jobs with status polling or websocket updates.

How do I handle LLM hallucinations in production?
For factual tasks, constrain the output format to structured JSON and validate the response against a schema before using it. For tasks where hallucination is particularly costly — financial data, medical context, legal summaries — add a human review step before surfacing AI output to end users. For general-purpose text generation, disclosing that content is AI-generated and providing an edit interface is the standard approach.

Further Reading


Tags: webdev, ai, saas, javascript

Top comments (0)