DEV Community

Cover image for 6 JavaScript Patterns That Turn LLM APIs Into Production AI Systems
JSGuruJobs
JSGuruJobs

Posted on

6 JavaScript Patterns That Turn LLM APIs Into Production AI Systems

Most developers can call an LLM API in ten lines of code. Very few can turn that call into a reliable production system.

These six JavaScript patterns are what separate a demo AI feature from something you can actually ship in a product.


Streaming Responses Instead of Waiting for the Whole Completion

Most tutorials show a blocking request. The user waits several seconds and then receives the full response.

Before

import Anthropic from "@anthropic-ai/sdk"

const client = new Anthropic()

async function generateReply(message) {
  const response = await client.messages.create({
    model: "claude-opus-4-6",
    max_tokens: 1024,
    messages: [{ role: "user", content: message }]
  })

  return response.content[0].text
}
Enter fullscreen mode Exit fullscreen mode

The UI freezes while the model generates the entire response.

After

import Anthropic from "@anthropic-ai/sdk"

const client = new Anthropic()

async function streamReply(message, res) {
  res.setHeader("Content-Type", "text/event-stream")
  res.setHeader("Cache-Control", "no-cache")
  res.setHeader("Connection", "keep-alive")

  const stream = await client.messages.stream({
    model: "claude-opus-4-6",
    max_tokens: 1024,
    messages: [{ role: "user", content: message }]
  })

  for await (const chunk of stream) {
    if (
      chunk.type === "content_block_delta" &&
      chunk.delta.type === "text_delta"
    ) {
      res.write(`data: ${chunk.delta.text}\n\n`)
    }
  }

  res.end()
}
Enter fullscreen mode Exit fullscreen mode

Streaming reduces perceived latency dramatically. Users see output within hundreds of milliseconds instead of waiting several seconds.


Retry Logic for Model Failures

LLM APIs fail more often than typical REST services. Rate limits, timeouts, and safety refusals happen frequently.

Before

async function generateSummary(text) {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: text }]
  })

  return response.choices[0].message.content
}
Enter fullscreen mode Exit fullscreen mode

One failure means the feature breaks.

After

async function generateSummary(text, retries = 3) {
  try {
    const response = await openai.chat.completions.create({
      model: "gpt-4o",
      messages: [{ role: "user", content: text }]
    })

    return response.choices[0].message.content
  } catch (error) {
    if (retries > 0) {
      await new Promise(r => setTimeout(r, 1000))
      return generateSummary(text, retries - 1)
    }

    throw error
  }
}
Enter fullscreen mode Exit fullscreen mode

Retries with backoff prevent transient failures from breaking your application.


Embeddings + Vector Search Instead of Huge Prompts

Many developers try to stuff entire documents into prompts. That fails quickly because of context limits.

Before

async function answerQuestion(question, docs) {
  const prompt = `
  Answer the question using this documentation:

  ${docs.join("\n\n")}

  Question: ${question}
  `

  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: prompt }]
  })

  return response.choices[0].message.content
}
Enter fullscreen mode Exit fullscreen mode

Large documentation sets overflow context windows and increase costs.

After

async function semanticSearch(query) {
  const embedding = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: query
  })

  const results = await pinecone.index("docs").query({
    vector: embedding.data[0].embedding,
    topK: 5,
    includeMetadata: true
  })

  return results.matches.map(m => m.metadata.text)
}
Enter fullscreen mode Exit fullscreen mode

Then only relevant content is passed to the model.

async function answerQuestion(question) {
  const context = await semanticSearch(question)

  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: `Use this context:\n${context.join("\n\n")}`
      },
      { role: "user", content: question }
    ]
  })

  return response.choices[0].message.content
}
Enter fullscreen mode Exit fullscreen mode

This Retrieval Augmented Generation pattern is the backbone of production AI applications.


Tool Calling Instead of Guessing

Without tools, models hallucinate answers. With tools, they can fetch real data.

Before

async function getWeather(city) {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      { role: "user", content: `What is the weather in ${city}?` }
    ]
  })

  return response.choices[0].message.content
}
Enter fullscreen mode Exit fullscreen mode

The model invents answers.

After

const tools = [
  {
    type: "function",
    function: {
      name: "get_weather",
      description: "Get real weather data",
      parameters: {
        type: "object",
        properties: {
          city: { type: "string" }
        },
        required: ["city"]
      }
    }
  }
]
Enter fullscreen mode Exit fullscreen mode

Then execute the tool call.

if (toolCall.name === "get_weather") {
  const result = await weatherAPI(toolCall.arguments.city)

  messages.push({
    role: "tool",
    content: JSON.stringify(result)
  })
}
Enter fullscreen mode Exit fullscreen mode

Tool calling grounds the model in real systems and eliminates most hallucinated answers.


Cost Control with Embedding Caching

Generating embeddings repeatedly can become expensive at scale.

Before

async function getEmbedding(text) {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text
  })

  return response.data[0].embedding
}
Enter fullscreen mode Exit fullscreen mode

Every request generates a new embedding.

After

import Redis from "ioredis"

const redis = new Redis()

async function getEmbedding(text) {
  const cacheKey = `embedding:${text}`

  const cached = await redis.get(cacheKey)
  if (cached) return JSON.parse(cached)

  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text
  })

  const embedding = response.data[0].embedding

  await redis.set(cacheKey, JSON.stringify(embedding))

  return embedding
}
Enter fullscreen mode Exit fullscreen mode

Caching can reduce embedding costs by 60 to 80 percent in real systems.


Structured Output Instead of Free Text

Parsing raw text responses is fragile.

Before

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Extract name and email" }]
})

const text = response.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

Now your application must parse unpredictable text.

After

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  response_format: {
    type: "json_schema",
    json_schema: {
      name: "contact",
      schema: {
        type: "object",
        properties: {
          name: { type: "string" },
          email: { type: "string" }
        },
        required: ["name", "email"]
      }
    }
  },
  messages: [{ role: "user", content: "Extract name and email" }]
})
Enter fullscreen mode Exit fullscreen mode

Now the output is guaranteed to match your schema.

Structured outputs remove a large class of parsing bugs from AI integrations.


JavaScript developers already understand async APIs, streaming data, caching layers, and distributed systems. AI engineering mostly means applying those same patterns to LLM infrastructure.

If you are building AI features with JavaScript today, the bigger productivity shift comes from learning how to work alongside AI tools themselves. The workflow patterns in the AI augmented JavaScript developer guide explain how teams are actually shipping these systems in production.

Pick one of the patterns above and implement it this week. Most AI demos break in production. These patterns are how you prevent that.

Top comments (0)