You have a working product. Customers use it. Revenue comes in. Now your team wants to add AI — smart search, document processing, a support chatbot, something. Every architecture conversation ends up in the same place: "maybe we should rebuild the whole thing on a modern stack."
You don't need to rebuild anything. Almost every AI feature I've added to production systems has been an additional layer on top of the existing architecture, not a replacement of it. The tricky part is knowing which layer, and where.
The Three Integration Patterns
When people say "add AI to our product," they almost always mean one of three things. The patterns are distinct enough that they require different technical approaches — and picking the wrong one wastes months.
slug="ai-integration"
text="Adding AI to a production system without breaking it is what I do full-time. Bring me your stack — I'll plan the integration."
/>
Pattern 1: RAG for Domain Knowledge
Use this when your product has information that a general-purpose model doesn't know — your documentation, your customers' previous orders, your company's specific policies, your product catalog.
RAG (retrieval-augmented generation) works by indexing your domain content as vector embeddings, retrieving the relevant pieces at query time, and passing them to the LLM as context. The model doesn't need to "learn" your data — it reads it on demand.
I built this for Pikkuna: a support chatbot across 30 languages and 35 countries that resolved 70% of support queries without a human agent. The entire document base — 1,614 docs — is in English. Cross-lingual retrieval works because text-embedding-3-large encodes semantic meaning across languages, not surface-level words. A Finnish customer asking "Missä on tilaukseni?" retrieves the same English FAQ chunk as an English-speaking customer asking "Where is my order?"
One ingestion pipeline, one vector index, zero translation infrastructure.
When RAG is the right pattern:
- Your product has domain-specific knowledge a general model doesn't have
- You need the AI to answer from your data, not general web knowledge
- You can version and update the knowledge base independently from the model
The failure mode to watch: retrieval quality determines response quality. If you surface irrelevant or incomplete chunks, the model produces a confident-sounding but incorrect answer. Measuring what you retrieve is as important as measuring what the model generates.
Pattern 2: Document Processing
Use this when users upload documents — PDFs, invoices, contracts, receipts — and you need structured data out of them.
The workflow is straightforward: extract text (PyMuPDF or pdf2image + OCR for scanned documents), pass it to an LLM with a specific extraction prompt, validate the output against a schema, persist structured records. The AI is a parsing layer, not a reasoning layer.
I built a PDF forensics tool for htpbe.tech that runs a 5-layer analysis on uploaded PDFs — xref structure, metadata, digital signatures, LTV, tool fingerprints — in under 9 seconds. The AI layer handles the classification of anomalies; the forensics extraction layer is conventional code.
When document processing is the right pattern:
- Users submit unstructured documents you need structured data from
- The output schema is well-defined (invoice fields, contract clauses, form values)
- You need to handle format variation without writing a new parser for each template
Pattern 3: UX Augmentation
Use this when you want to improve an existing user interaction — search, classification, recommendations, autocomplete, smart filters.
This is usually the lowest-risk pattern because it degrades gracefully. If the AI call fails or returns a low-confidence result, you fall back to the existing behavior. No user notices.
For pi-pi.ee, a B2B e-commerce platform across 28 languages and 32 EU markets, I added AI-assisted product classification that routes items to the correct HS tariff codes for customs declarations. The existing product catalog and checkout flow are untouched. The AI runs as a background enrichment step. If it misclassifies, a human corrects it — the correction feeds back into the system.
When UX augmentation is the right pattern:
- You want to improve an existing interaction, not replace it
- A fallback to the old behavior is acceptable
- You can measure the improvement in user behavior (search success rate, task completion, etc.)
Where AI Sits in Your Architecture
The three integration points — API gateway, background worker, direct call — have different tradeoffs.
Direct call from a Server Action or API route is the right choice for low-latency user-facing features. A user types a search query; you embed it, retrieve context, stream a response. The AI is in the critical path.
// app/api/search/route.ts
import { openai } from "@ai-sdk/openai";
import { streamText } from "ai";
import { retrieveContext } from "@/lib/rag/retrieve";
export async function POST(request: Request) {
const { query } = await request.json();
// Retrieval happens first — if it fails, return a fallback
const chunks = await retrieveContext(query).catch(() => []);
if (chunks.length === 0) {
// Graceful fallback to conventional search
return Response.json({ fallback: true });
}
const context = chunks.map((c) => c.text).join("\n\n");
const result = streamText({
model: openai("gpt-4o-mini"),
system: `Answer based only on this context:\n\n${context}`,
prompt: query,
});
return result.toDataStreamResponse();
}
Background worker via BullMQ is the right choice for anything that doesn't need a sub-second response — document processing, batch enrichment, re-indexing, generating summaries of long content. The job runs asynchronously; the UI polls for completion or receives a webhook when it's done.
// lib/workers/document-processor.ts
import { Worker } from "bullmq";
import { connection } from "@/lib/redis";
import { extractStructuredData } from "@/lib/ai/extract";
import { db } from "@/lib/db";
import { documents } from "@/lib/db/schema";
import { eq } from "drizzle-orm";
export const documentWorker = new Worker(
"document-processing",
async (job) => {
const { documentId, text } = job.data as {
documentId: string;
text: string;
};
const extracted = await extractStructuredData(text);
await db
.update(documents)
.set({ structured: extracted, status: "processed" })
.where(eq(documents.id, documentId));
},
{ connection }
);
API gateway enrichment — running AI on every request before it reaches your application — is almost always the wrong choice. The latency budget disappears fast, and a model outage takes down your entire product. Keep AI out of your critical request path unless the feature genuinely requires it.
Choosing the Right Model
The instinct is to use the most capable model available. That's usually wrong.
GPT-4o is roughly 15× more expensive per token than GPT-4o-mini. For most tasks — classification, extraction, short-form generation from structured context — GPT-4o-mini produces equivalent results. I use it as the default and only escalate to a more capable model when quality evaluation shows the smaller model actually failing.
The decision framework I use:
| Task type | Starting model | Escalate if |
|---|---|---|
| RAG chat, short answers | gpt-4o-mini | Multi-step reasoning required |
| Document extraction | gpt-4o-mini | Complex unstructured layout |
| Long-form generation | gpt-4o | Quality threshold not met |
| Code generation | gpt-4o | gpt-4o-mini misses edge cases |
Never reach for a model you can't justify based on task-specific quality evaluation. "It's better" is not a justification — it's an assumption.
What Production-Ready AI Actually Requires
Adding an OpenAI call to a route handler takes 20 minutes. Shipping it to production takes longer, because several things will go wrong that the documentation doesn't cover.
Output validation. LLMs return text. Your application needs structured data. Use Zod to validate extraction output — if the model returns a malformed object, you need to catch that before it writes garbage to your database.
import { z } from "zod";
const InvoiceSchema = z.object({
vendor: z.string(),
amount: z.number().positive(),
currency: z.string().length(3),
date: z.string().regex(/^\d{4}-\d{2}-\d{2}$/),
vatNumber: z.string().optional(),
});
export type Invoice = z.infer<typeof InvoiceSchema>;
export async function extractInvoice(text: string): Promise<Invoice | null> {
const raw = await callLLM(text);
const result = InvoiceSchema.safeParse(raw);
if (!result.success) {
// Log the failure for review, return null to trigger manual processing
console.error("Invoice extraction failed validation", result.error);
return null;
}
return result.data;
}
Rate limiting. OpenAI has per-minute token limits. If you don't enforce your own limits upstream, you will hit theirs — and your users will see 429 errors. For vatnode.dev I implemented a 30 req/min per-IP limit with Redis. Same pattern applies to any AI-backed endpoint.
Cost caps. At scale, uncapped AI calls will surprise you. Set hard spending limits in the OpenAI dashboard and implement soft limits in your application — track token usage per tenant, fire an alert when a customer approaches a threshold, reject requests when they exceed it. I covered the full set of production cost controls in Keeping OpenAI Costs Under Control in Production.
Graceful fallbacks. Every AI call should have a defined fallback. For a smart search feature, the fallback is conventional keyword search. For a document extractor, the fallback is a manual review queue. A model outage should degrade the feature, not crash the page.
Timeouts. Streaming responses can hang. Set an explicit timeout on every LLM call — AbortSignal.timeout(15000) is a reasonable starting point for generation. Background workers need a jobTimeout setting in BullMQ.
Fine-Tuning: Almost Always Not Worth It
Fine-tuning means training a new version of a model on your own data. It sounds appealing — a model that "knows your domain." In practice, it's expensive, slow to iterate, and almost never necessary.
For domain knowledge, use RAG. Update the knowledge base without retraining anything.
For output formatting: use structured output with a strict schema (OpenAI's response_format: { type: "json_schema" } or Vercel AI SDK's generateObject). No fine-tuning needed.
Fine-tuning is worth considering only in two narrow cases: you need to radically change the model's style or personality for a specific audience, or you have thousands of labeled examples of a classification task and latency matters enough that you need a smaller custom model. That's not the case for most products.
The Evaluation Problem
How do you know your AI feature actually works?
This is the question most teams skip until the customer complains. "It seemed fine in testing" is not a quality signal.
For RAG, I measure the percentage of queries that exceed the confidence threshold without human escalation. At Pikkuna, a query is "resolved" if the user doesn't open a support ticket within 10 minutes of receiving the chatbot answer. After 90 days in production, 70% resolved. That's a real number, not a demo metric.
For document extraction, track validation failure rates per document type. A 2% failure rate is acceptable; 15% means the extraction prompt needs work or the document type needs a different approach.
For UX augmentation, compare the target metric before and after: search success rate, task completion rate, time to find a result. A/B test if you have enough traffic.
Pick your metric before you ship. Otherwise you'll be guessing whether it works.
What to Avoid
Don't rebuild working systems. If your backend handles orders reliably, add AI as a layer on top. Rewriting the order system to "be more AI-native" breaks things that work and delivers nothing to users.
Don't over-promise accuracy. LLMs make mistakes. Any feature that presents AI output as authoritative — without a way for users to correct it — will erode trust when the model is wrong. Always design a correction path.
Don't start with fine-tuning. See above.
Don't put AI in the critical path without a fallback. If the OpenAI API goes down and your checkout breaks, you've made a bad architectural decision that has nothing to do with AI.
Don't skip evaluation. "It works in the demo" is the worst possible signal. Demo inputs are cherry-picked. Production inputs will find every edge case.
Results
The three production systems I've built with AI integration:
| System | Pattern | Key metric |
|---|---|---|
| Pikkuna chatbot | RAG | 70% ticket resolution, P95 <500ms |
| htpbe.tech | Document processing | ≤9 sec analysis, up to 10MB |
| pi-pi.ee classification | UX augmentation | 32 EU markets, 0 per-market prompt variants |
None of these required rewriting existing systems. Each one added a layer.
AI works as an interface layer between your existing systems and new capabilities — not as a new foundation to rebuild around. The integrations that ship successfully start narrow, measure honestly, and add complexity only when a specific problem demands it.
If you're building a product for the EU market and want to add AI features without risking what already works — get in touch. I've built these integrations across SaaS, e-commerce, and document processing systems, and I can tell you within one conversation whether your use case fits a RAG pattern, document extraction, or something simpler.
I'm available for freelance projects and long-term engagements.
Related reading: Building a i18n RAG Chatbot — the full technical implementation of the Pikkuna support chatbot, including ingestion pipeline, hybrid search, and streaming UI.
Top comments (0)