The $10,000 Lesson: Building Cost-Efficient AI Features with Function Calling and Caching

#ai #llm #costoptimization #nextjs

I remember the exact moment a client saw the AI pipeline cost. It was a Tuesday morning, and the number made them say "shut it down."

That pipeline was rewriting job descriptions for a platform with over a million listings. The idea was solid: use a capable LLM to turn raw ATS text into structured, SEO-friendly content. But the cost per listing added up fast. The feature was technically impressive and completely uneconomical.

That was a hard lesson. But it taught me something I've used on every AI project since: building cost-efficient AI features isn't about picking the cheapest model. It's about architecture. You can cut costs dramatically without cutting quality if you design the system right.

Here's what actually works.

Function Calling Cuts Token Waste by Half or More

The biggest hidden cost in AI features is generating text you don't need. Most developers send a prompt and let the LLM write a paragraph of commentary when all they need is a structured data point. That's paying for thousands of tokens of filler.

Function calling (or structured output) fixes this. You tell the model exactly what fields to return, and it outputs only those fields in JSON. No fluff.

Here's the pattern I use in production:

const completion = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [
    { role: "system", content: "Extract structured data from the raw job description." },
    { role: "user", content: rawDescription }
  ],
  functions: [{
    name: "extract_job_listing",
    parameters: {
      type: "object",
      properties: {
        title: { type: "string" },
        salary_min: { type: "number" },
        salary_max: { type: "number" },
        remote: { type: "boolean" },
        skills: { type: "array", items: { type: "string" } }
      },
      required: ["title", "remote", "skills"]
    }
  }],
  function_call: { name: "extract_job_listing" }
});

const result = JSON.parse(completion.choices[0].message.function_call.arguments);

No introductory paragraph. No "Sure, here is the extracted data." Just the JSON object. On a bulk extraction pipeline, this cuts token usage significantly compared to freeform prompts asking for the same data.

The side benefit is reliability. When you enforce a schema, you eliminate hallucinations about fields that shouldn't exist. I've used conditional schema flags (like presence guards) to prevent the model from fabricating fields that don't exist, for example ensuring it never invents a previous job or education entry. That's not just cost optimization. It's trust.

Caching Is Free Money, but Most People Do It Wrong

Everyone knows to cache LLM responses. But the naive approach (keyed by the exact prompt string) misses most of the savings.

The trick is to cache at multiple levels.

Embedding cache. If you're embedding documents for a RAG pipeline, you're paying for the same embeddings every time a user asks a similar question. Store the embedding vectors in a database and query by content hash before calling the API. The initial seeding period is the most expensive; after that, repeated queries hit the cache and cost nothing. Embedding costs drop substantially as the cache warms up over time.

LLM response cache with semantic keying. Exact prompt matching is too brittle. A user might ask "summarize this" and another asks "give me a summary." Those should hit the same cache entry. Use a deterministic hash of the normalized prompt and the function call parameters. You can store the normalized key in a cache like Redis with a TTL tied to the data freshness: 24 hours for stable content, 1 hour for rapidly changing data.

Smart cache invalidation. This is where most people fail. They cache forever and serve stale data. Set cache TTLs based on the data source. If the underlying data changes (new job listings, updated user profile), invalidate the cache for that specific key. This prevents the "I updated my resume but the AI still sees the old version" problem.

The total impact: on a production system handling many LLM calls daily, caching can eliminate a large portion of API calls after the initial warmup. Many requests never touch the API after the first identical query.

Batch API and Prompt Compression for Heavy Workloads

OpenAI's Batch API offers a 50% discount on most models. The tradeoff is latency: results come back in hours, not seconds. That's perfect for nightly enrichment jobs, not for user-facing chat.

On a large job board platform, I moved the description rewrite pipeline to Batch API. Processing thousands of listings overnight cut the cost per listing in half compared to synchronous calls. Even at that rate, the overall cost was significant, which is why we're evaluating cheaper models like DeepSeek V4 Flash (roughly 23x cheaper than GPT-4.1) for that workload.

Prompt compression is another lever. Strip out unnecessary context. If the system prompt is long and you're sending many requests, every token you remove from the system prompt multiplies across every request. I've trimmed prompts by a noticeable margin just by removing redundant instructions and using shorter examples.

Model Selection: When to Pay for 4o and When to Use Flash

I maintain a simple decision tree:

Complex reasoning, legal, or finance tasks: GPT-4o or 4.1. The output quality justifies the cost.
Structured data extraction, classification, summarization: GPT-4o mini or Gemini 2.0 Flash. They're fast and cheap.
Bulk processing with loose quality requirements: DeepSeek V4 Flash. At roughly 23x cheaper than GPT-4.1, it's economical for pipelines where occasional errors are acceptable.
Real-time, high-volume, moderate quality: Gemini 2.0 Flash. Its free tier offers generous limits, and the paid rate is lower than GPT-4o mini.

Suppose a client insists on using a top-tier model for everything. Switching extraction tasks to a cheaper model and chat responses to a cost-effective provider can drop the monthly bill dramatically. Quality, measured by user satisfaction, barely moves.

Guardrails Prevent Runaway Costs

The most expensive bug is an infinite loop. An AI agent that retries on failure, or a user who spams the generate button, can burn through significant money in minutes.

I set three hard guardrails on every AI feature:

Per-request token limits. Hard cap on max_tokens. Never let the model decide how long to answer.
Rate limiting per user. Reasonable limits on requests per minute and per day on generation endpoints.
Cost alerts. A simple script that checks the daily API usage and sends a notification if it exceeds a threshold. A runaway prompt can cause the model to generate overly long responses. Cost alerts catch it early before it escalates.

These are not theoretical. I've seen a pipeline burn through hundreds of dollars faster than expected because of a missing guardrail. Now I never ship an AI feature without all three.

If your team is wrestling with AI feature costs and shipping slower because of it, that's the kind of thing I help with. I've been building production AI pipelines, breaking them, and figuring out what actually works. Happy to compare notes.

Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.