Most developers can call an LLM API in ten lines of code. Very few can turn that call into a reliable production system.
These six JavaScript patterns are what separate a demo AI feature from something you can actually ship in a product.
Streaming Responses Instead of Waiting for the Whole Completion
Most tutorials show a blocking request. The user waits several seconds and then receives the full response.
Before
import Anthropic from "@anthropic-ai/sdk"
const client = new Anthropic()
async function generateReply(message) {
const response = await client.messages.create({
model: "claude-opus-4-6",
max_tokens: 1024,
messages: [{ role: "user", content: message }]
})
return response.content[0].text
}
The UI freezes while the model generates the entire response.
After
import Anthropic from "@anthropic-ai/sdk"
const client = new Anthropic()
async function streamReply(message, res) {
res.setHeader("Content-Type", "text/event-stream")
res.setHeader("Cache-Control", "no-cache")
res.setHeader("Connection", "keep-alive")
const stream = await client.messages.stream({
model: "claude-opus-4-6",
max_tokens: 1024,
messages: [{ role: "user", content: message }]
})
for await (const chunk of stream) {
if (
chunk.type === "content_block_delta" &&
chunk.delta.type === "text_delta"
) {
res.write(`data: ${chunk.delta.text}\n\n`)
}
}
res.end()
}
Streaming reduces perceived latency dramatically. Users see output within hundreds of milliseconds instead of waiting several seconds.
Retry Logic for Model Failures
LLM APIs fail more often than typical REST services. Rate limits, timeouts, and safety refusals happen frequently.
Before
async function generateSummary(text) {
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: text }]
})
return response.choices[0].message.content
}
One failure means the feature breaks.
After
async function generateSummary(text, retries = 3) {
try {
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: text }]
})
return response.choices[0].message.content
} catch (error) {
if (retries > 0) {
await new Promise(r => setTimeout(r, 1000))
return generateSummary(text, retries - 1)
}
throw error
}
}
Retries with backoff prevent transient failures from breaking your application.
Embeddings + Vector Search Instead of Huge Prompts
Many developers try to stuff entire documents into prompts. That fails quickly because of context limits.
Before
async function answerQuestion(question, docs) {
const prompt = `
Answer the question using this documentation:
${docs.join("\n\n")}
Question: ${question}
`
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: prompt }]
})
return response.choices[0].message.content
}
Large documentation sets overflow context windows and increase costs.
After
async function semanticSearch(query) {
const embedding = await openai.embeddings.create({
model: "text-embedding-3-small",
input: query
})
const results = await pinecone.index("docs").query({
vector: embedding.data[0].embedding,
topK: 5,
includeMetadata: true
})
return results.matches.map(m => m.metadata.text)
}
Then only relevant content is passed to the model.
async function answerQuestion(question) {
const context = await semanticSearch(question)
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{
role: "system",
content: `Use this context:\n${context.join("\n\n")}`
},
{ role: "user", content: question }
]
})
return response.choices[0].message.content
}
This Retrieval Augmented Generation pattern is the backbone of production AI applications.
Tool Calling Instead of Guessing
Without tools, models hallucinate answers. With tools, they can fetch real data.
Before
async function getWeather(city) {
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
{ role: "user", content: `What is the weather in ${city}?` }
]
})
return response.choices[0].message.content
}
The model invents answers.
After
const tools = [
{
type: "function",
function: {
name: "get_weather",
description: "Get real weather data",
parameters: {
type: "object",
properties: {
city: { type: "string" }
},
required: ["city"]
}
}
}
]
Then execute the tool call.
if (toolCall.name === "get_weather") {
const result = await weatherAPI(toolCall.arguments.city)
messages.push({
role: "tool",
content: JSON.stringify(result)
})
}
Tool calling grounds the model in real systems and eliminates most hallucinated answers.
Cost Control with Embedding Caching
Generating embeddings repeatedly can become expensive at scale.
Before
async function getEmbedding(text) {
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: text
})
return response.data[0].embedding
}
Every request generates a new embedding.
After
import Redis from "ioredis"
const redis = new Redis()
async function getEmbedding(text) {
const cacheKey = `embedding:${text}`
const cached = await redis.get(cacheKey)
if (cached) return JSON.parse(cached)
const response = await openai.embeddings.create({
model: "text-embedding-3-small",
input: text
})
const embedding = response.data[0].embedding
await redis.set(cacheKey, JSON.stringify(embedding))
return embedding
}
Caching can reduce embedding costs by 60 to 80 percent in real systems.
Structured Output Instead of Free Text
Parsing raw text responses is fragile.
Before
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{ role: "user", content: "Extract name and email" }]
})
const text = response.choices[0].message.content
Now your application must parse unpredictable text.
After
const response = await openai.chat.completions.create({
model: "gpt-4o",
response_format: {
type: "json_schema",
json_schema: {
name: "contact",
schema: {
type: "object",
properties: {
name: { type: "string" },
email: { type: "string" }
},
required: ["name", "email"]
}
}
},
messages: [{ role: "user", content: "Extract name and email" }]
})
Now the output is guaranteed to match your schema.
Structured outputs remove a large class of parsing bugs from AI integrations.
JavaScript developers already understand async APIs, streaming data, caching layers, and distributed systems. AI engineering mostly means applying those same patterns to LLM infrastructure.
If you are building AI features with JavaScript today, the bigger productivity shift comes from learning how to work alongside AI tools themselves. The workflow patterns in the AI augmented JavaScript developer guide explain how teams are actually shipping these systems in production.
Pick one of the patterns above and implement it this week. Most AI demos break in production. These patterns are how you prevent that.
Top comments (0)