JSGuruJobs

Posted on Mar 12

6 JavaScript Patterns That Turn LLM APIs Into Production AI Systems

#javascript #ai #architecture #llm

Most developers can call an LLM API in ten lines of code. Very few can turn that call into a reliable production system.

These six JavaScript patterns are what separate a demo AI feature from something you can actually ship in a product.

Streaming Responses Instead of Waiting for the Whole Completion

Most tutorials show a blocking request. The user waits several seconds and then receives the full response.

Before

import Anthropic from "@anthropic-ai/sdk"

const client = new Anthropic()

async function generateReply(message) {
  const response = await client.messages.create({
    model: "claude-opus-4-6",
    max_tokens: 1024,
    messages: [{ role: "user", content: message }]
  })

  return response.content[0].text
}

The UI freezes while the model generates the entire response.

After

import Anthropic from "@anthropic-ai/sdk"

const client = new Anthropic()

async function streamReply(message, res) {
  res.setHeader("Content-Type", "text/event-stream")
  res.setHeader("Cache-Control", "no-cache")
  res.setHeader("Connection", "keep-alive")

  const stream = await client.messages.stream({
    model: "claude-opus-4-6",
    max_tokens: 1024,
    messages: [{ role: "user", content: message }]
  })

  for await (const chunk of stream) {
    if (
      chunk.type === "content_block_delta" &&
      chunk.delta.type === "text_delta"
    ) {
      res.write(`data: ${chunk.delta.text}\n\n`)
    }
  }

  res.end()
}

Streaming reduces perceived latency dramatically. Users see output within hundreds of milliseconds instead of waiting several seconds.

Retry Logic for Model Failures

LLM APIs fail more often than typical REST services. Rate limits, timeouts, and safety refusals happen frequently.

Before

async function generateSummary(text) {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: text }]
  })

  return response.choices[0].message.content
}

One failure means the feature breaks.

After

async function generateSummary(text, retries = 3) {
  try {
    const response = await openai.chat.completions.create({
      model: "gpt-4o",
      messages: [{ role: "user", content: text }]
    })

    return response.choices[0].message.content
  } catch (error) {
    if (retries > 0) {
      await new Promise(r => setTimeout(r, 1000))
      return generateSummary(text, retries - 1)
    }

    throw error
  }
}

Retries with backoff prevent transient failures from breaking your application.

Embeddings + Vector Search Instead of Huge Prompts

Many developers try to stuff entire documents into prompts. That fails quickly because of context limits.

Before

async function answerQuestion(question, docs) {
  const prompt = `
  Answer the question using this documentation:

  ${docs.join("\n\n")}

  Question: ${question}
  `

  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: prompt }]
  })

  return response.choices[0].message.content
}

Large documentation sets overflow context windows and increase costs.

After

async function semanticSearch(query) {
  const embedding = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: query
  })

  const results = await pinecone.index("docs").query({
    vector: embedding.data[0].embedding,
    topK: 5,
    includeMetadata: true
  })

  return results.matches.map(m => m.metadata.text)
}

Then only relevant content is passed to the model.

async function answerQuestion(question) {
  const context = await semanticSearch(question)

  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      {
        role: "system",
        content: `Use this context:\n${context.join("\n\n")}`
      },
      { role: "user", content: question }
    ]
  })

  return response.choices[0].message.content
}

This Retrieval Augmented Generation pattern is the backbone of production AI applications.

Tool Calling Instead of Guessing

Without tools, models hallucinate answers. With tools, they can fetch real data.

Before

async function getWeather(city) {
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      { role: "user", content: `What is the weather in ${city}?` }
    ]
  })

  return response.choices[0].message.content
}

The model invents answers.

After

const tools = [
  {
    type: "function",
    function: {
      name: "get_weather",
      description: "Get real weather data",
      parameters: {
        type: "object",
        properties: {
          city: { type: "string" }
        },
        required: ["city"]
      }
    }
  }
]

Then execute the tool call.

if (toolCall.name === "get_weather") {
  const result = await weatherAPI(toolCall.arguments.city)

  messages.push({
    role: "tool",
    content: JSON.stringify(result)
  })
}

Tool calling grounds the model in real systems and eliminates most hallucinated answers.

Cost Control with Embedding Caching

Generating embeddings repeatedly can become expensive at scale.

Before

async function getEmbedding(text) {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text
  })

  return response.data[0].embedding
}

Every request generates a new embedding.

After

import Redis from "ioredis"

const redis = new Redis()

async function getEmbedding(text) {
  const cacheKey = `embedding:${text}`

  const cached = await redis.get(cacheKey)
  if (cached) return JSON.parse(cached)

  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text
  })

  const embedding = response.data[0].embedding

  await redis.set(cacheKey, JSON.stringify(embedding))

  return embedding
}

Caching can reduce embedding costs by 60 to 80 percent in real systems.

Structured Output Instead of Free Text

Parsing raw text responses is fragile.

Before

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "Extract name and email" }]
})

const text = response.choices[0].message.content

Now your application must parse unpredictable text.

After

const response = await openai.chat.completions.create({
  model: "gpt-4o",
  response_format: {
    type: "json_schema",
    json_schema: {
      name: "contact",
      schema: {
        type: "object",
        properties: {
          name: { type: "string" },
          email: { type: "string" }
        },
        required: ["name", "email"]
      }
    }
  },
  messages: [{ role: "user", content: "Extract name and email" }]
})

Now the output is guaranteed to match your schema.

Structured outputs remove a large class of parsing bugs from AI integrations.

JavaScript developers already understand async APIs, streaming data, caching layers, and distributed systems. AI engineering mostly means applying those same patterns to LLM infrastructure.

If you are building AI features with JavaScript today, the bigger productivity shift comes from learning how to work alongside AI tools themselves. The workflow patterns in the AI augmented JavaScript developer guide explain how teams are actually shipping these systems in production.

Pick one of the patterns above and implement it this week. Most AI demos break in production. These patterns are how you prevent that.