HelperX

Posted on Jun 15

LLM Cost Optimization: How We Cut Reply Generation from $0.011 to $0.0009

#llm #ai #javascript #node

When we shipped the first version of AI-generated replies for HelperX, each reply cost us about $0.011 in API spend. That sounds tiny until you multiply by 30 replies per slot per day times 200 active slots: roughly $66 per day, or ~$2,000 per month. Not catastrophic, but enough to eat into margins on the smaller plans.

A year later, we're spending $0.0009 per reply — a 12x reduction. Same model providers, similar reply quality, same throughput. The savings came from four optimization layers stacked on top of each other.

This is exactly what each layer does, the order we applied them, and the cost reduction each one produced.

The starting point

The naive implementation looked like this:

async function generateReply(tweet, persona) {
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 200,
    messages: [{
      role: 'user',
      content: `You are a ${persona.role} with tone level ${persona.tone}.
                Reply to this tweet in 2-3 sentences:

                Tweet: "${tweet.text}"
                Author: @${tweet.author} (${tweet.followers} followers)

                Reply should add value without being promotional.`
    }],
  });
  return response.content[0].text;
}

Sonnet, fresh request every time, full system prompt baked into every call. Cost breakdown per reply:

Input tokens: ~180 (instructions + tweet context)
Output tokens: ~85 (the reply itself)
Sonnet pricing: $3/MTok input, $15/MTok output
Per-reply cost: 0.000180 × 3 + 0.000085 × 15 ≈ $0.0019 → $0.011 with overhead

The "overhead" includes retries, occasional context bloat from longer tweets, and a few percent failure rate that ate budget without producing output.

Layer 1: Model routing (40% savings)

The first realization: not every reply needs the smartest model.

A reply to "AI is changing everything" doesn't need Sonnet-level reasoning. A reply to a detailed technical thread arguing two specific points might. We built a router that picks the model based on the complexity of the input tweet:

function routeModel(tweet) {
  const complexityScore =
    (tweet.text.length > 200 ? 2 : 0) +
    (tweet.hasNumbers ? 1 : 0) +
    (tweet.questionCount > 0 ? 1 : 0) +
    (tweet.technicalKeywords > 2 ? 2 : 0);

  if (complexityScore >= 4) return 'claude-sonnet-4-6';
  if (complexityScore >= 2) return 'claude-haiku-4-5-20251001';
  return 'claude-haiku-4-5-20251001'; // simpler tweets always get Haiku
}

We then validated reply quality across both models with a human-evaluated A/B test on 500 reply pairs. The results:

Haiku replies rated equal or better: 87% of the time
Haiku replies rated noticeably worse: 8%
Haiku replies rated much worse: 5% (almost all on highly technical input)

87% pass rate at the lower price tier is a no-brainer trade. The 5% rated much worse — Haiku failures — were exactly the high-complexity tweets, which is what the router catches.

The routing distribution in production:

78% of tweets route to Haiku
22% route to Sonnet

Haiku pricing: $0.80/MTok input, $4/MTok output.

Per-reply cost after routing:

Haiku replies: 0.000180 × 0.8 + 0.000085 × 4 ≈ $0.0005
Sonnet replies: same as before, $0.0019
Weighted: 0.78 × 0.0005 + 0.22 × 0.0019 ≈ $0.00081

Already a 4x reduction. But we were paying mostly for the same input tokens over and over.

Layer 2: Prompt caching (60% additional savings on input)

Anthropic's prompt caching lets you mark a portion of your prompt as cacheable. The first request pays the full input cost; subsequent requests within the cache TTL pay 10% of the input cost for the cached portion.

Our prompts had a long, mostly-stable system section explaining the persona, the rules, and a few examples — call it 600 tokens. The variable portion was the actual tweet (~50 tokens) plus persona settings (~20 tokens).

The naive structure:

// BAD: persona is at the end, can't be cached effectively
const messages = [{
  role: 'user',
  content: `${LONG_SYSTEM_INSTRUCTIONS}
            Persona: ${persona.role}, tone ${persona.tone}
            Tweet: ${tweet.text}`
}];

Restructured for cache hits:

const response = await anthropic.messages.create({
  model: 'claude-haiku-4-5-20251001',
  max_tokens: 200,
  system: [
    {
      type: 'text',
      text: LONG_SYSTEM_INSTRUCTIONS,
      cache_control: { type: 'ephemeral' }, // mark for caching
    },
    {
      type: 'text',
      text: PERSONA_TEMPLATES_BLOCK, // also cacheable across personas
      cache_control: { type: 'ephemeral' },
    },
  ],
  messages: [{
    role: 'user',
    content: `Persona: ${persona.role}, tone ${persona.tone}.
              Tweet from @${tweet.author}: "${tweet.text}"`,
  }],
});

Two cache blocks: a system block (the rules) and a persona templates block (per-persona context). Both are stable across many requests; only the per-tweet user message varies.

Cache hit rate after structuring this way: 94%.

Cost math with caching on Haiku:

Cached input tokens: 600 × 0.1 × 0.8 = $0.000048 (vs $0.00048 uncached)
Variable input tokens: 70 × 0.8 = $0.000056
Output tokens: 85 × 4 = $0.00034
Per-reply Haiku cost: $0.00044 (vs $0.0005 before)
Sonnet cost with cache: $0.0015
Weighted: 0.78 × 0.00044 + 0.22 × 0.0015 ≈ $0.00067

This was a 17% additional reduction. Smaller than I expected, because the output tokens dominate the cost on short replies — caching only reduces input.

The real value of caching showed up at scale: at 200 slots × 30 replies/day, the bursts of similar requests within a 5-minute window all share cache. Off-peak hours don't benefit much, but reply queue bursts can compress input cost to nearly zero.

Layer 3: Embedding-based deduplication (35% additional savings)

Here's the optimization that surprised everyone on the team: a lot of the tweets we were generating replies for were near-duplicates of each other.

In an active niche, you'll see the same news event tweeted by 8 different accounts in the same hour. Same topic, slightly different framing. Different authors, different audiences, but the underlying point is similar enough that the reply doesn't need to be generated from scratch.

We added an embedding-based deduplication layer in front of the generation step:

async function generateReplyWithDedup(tweet, persona) {
  const embedding = await embedTweet(tweet.text);

  // Search recent generated replies for near-matches
  const cached = await findSimilarReply(embedding, persona.id, {
    similarityThreshold: 0.93,
    maxAgeHours: 6,
  });

  if (cached) {
    return adaptReply(cached.reply, tweet); // light rewrite
  }

  const reply = await llmGenerate(tweet, persona);
  await storeReplyEmbedding(embedding, reply, persona.id);
  return reply;
}

The flow:

Embed the incoming tweet using a small embedding model (~$0.00001 per embedding)
Search the recent reply cache for an embedding with similarity > 0.93
If a match exists, lightly rewrite the cached reply to match the new tweet's specific wording
If no match, generate normally and cache for future use

The adaptReply step uses Haiku for a tiny, cheap transformation — replacing author handles, adjusting tense, swapping specific words. It costs roughly 1/5 of a full generation.

Cache hit rate on similarity: 32%.

That means 32% of our generation requests are now resolved by adapt instead of generate. Cost math:

Original cost per reply (after Layer 2): $0.00067
Adapt cost (Haiku light rewrite + embedding): ~$0.00015
Weighted: 0.68 × 0.00067 + 0.32 × 0.00015 + 0.00001 × 1.0 ≈ $0.00050

A 25% reduction on top of caching. The embedding spend is negligible — adding $0.00001 per request to save $0.00050 across many is an excellent trade.

Quality impact of deduplication

The team was nervous about deduplication killing reply quality. We A/B tested it for 30 days. The results:

Reply engagement rate (likes per reply): unchanged
Reply-rated quality by random sampling: unchanged
Detection rate by platform: unchanged

Turns out the platform doesn't care that two of your replies on similar topics share a stylistic skeleton — humans do this all the time. As long as each individual reply reads as natural and on-topic for its specific tweet, the audit metrics don't move.

Layer 4: Streaming and adaptive max_tokens (15% additional savings)

The fourth layer is small but adds up.

4a. Streaming with early termination

Many replies are shorter than max_tokens=200. By streaming and inspecting tokens as they come, we can terminate generation when the model produces a natural stopping point (period followed by silence, or an explicit "[end]" token if we instruct it):

const stream = await anthropic.messages.stream({ model, messages, max_tokens: 200 });

let reply = '';
let consecutiveSpaces = 0;
for await (const event of stream) {
  if (event.type === 'content_block_delta') {
    const delta = event.delta.text;
    reply += delta;

    // Stop if reply ends with sentence and next tokens are filler
    if (reply.length > 40 && /[.!?]\s*$/.test(reply)) {
      consecutiveSpaces++;
      if (consecutiveSpaces > 2) {
        await stream.controller.abort();
        break;
      }
    } else {
      consecutiveSpaces = 0;
    }
  }
}

Saves about 12% of output tokens on average across our reply distribution.

4b. Adaptive max_tokens

Setting max_tokens=200 for every request is wasteful. The model often produces 60-80 tokens for short tweets. We pre-estimate based on the input:

function estimateMaxTokens(tweet, persona) {
  const base = 80;
  const tweetBoost = tweet.text.length > 150 ? 40 : 0;
  const personaBoost = persona.verbosity === 'high' ? 40 : 0;
  return Math.min(220, base + tweetBoost + personaBoost);
}

For most requests this caps at 120 tokens instead of 200. It doesn't directly reduce cost (you only pay for tokens generated, not requested), but it slightly improves quality — the model is less likely to ramble when the budget is tighter.

Combined savings from Layer 4: ~15% on output cost = roughly 10% on total per-reply cost.

Final cost: $0.00050 × 0.90 ≈ $0.00045

Wait — that's not the $0.0009 we ended with. Let me reconcile.

The actual production number

The above math optimistically assumes every reply goes through every layer perfectly. In production, you eat:

~3% generation failures requiring full retry
~5% replies that need manual override (more expensive: full Sonnet, no cache benefit on novel inputs)
~2% requests that go through both layers due to cache misses on retries
Embedding storage and search infrastructure overhead

The blended production cost lands at $0.00088 per reply — close enough to call it $0.0009. Down from $0.011 starting point, which is a 12x reduction.

Cost summary

Layer	Action	Per-reply cost	Reduction
0	Naive Sonnet, no caching	$0.0110	—
1	Model routing (Haiku for 78%)	$0.00081	13.6x
2	Prompt caching (94% hit rate)	$0.00067	16.4x
3	Embedding deduplication (32% hit)	$0.00050	22x
4	Streaming + adaptive max_tokens	$0.00045	24.4x
Production overhead	Retries, failures, edge cases	$0.00088	12.5x

What didn't work

A few attempted optimizations that didn't pan out:

1. Self-hosted open-source models.

We tried Llama 3 70B and a few other open models for the Haiku tier of requests. The throughput was unpredictable (cold start latency, batching issues), the quality on short-form replies was noticeably worse, and the total cost when factoring in our own infrastructure wasn't competitive with Haiku's pricing.

Verdict: open models make sense at much higher volume than we run. Below ~100M tokens/day, hosted APIs win on price + quality + reliability.

2. Pre-generating reply pools.

The idea: generate 100 generic replies for common topics in advance, then pick the closest one. Tried it. The replies sounded canned because they weren't responsive to the actual tweet. Detection went up, quality went down, savings weren't worth it.

3. Using GPT-4o-mini or Gemini Flash as cheaper alternatives.

We tested cross-provider routing. Pricing was comparable to Haiku. Quality differences across providers were noticeable to our human evaluators on the same prompts. Sticking with one provider (Anthropic) eliminated a class of integration bugs and made the persona engine consistent.

4. Aggressive temperature reduction.

Lower temperature = more predictable output = potentially more cacheable. We tested temperature 0.3 vs 0.7. Lower temp made replies feel mechanical and reduced engagement metrics by 18%. The savings didn't justify the quality drop.

What we'd do differently

In retrospect:

We should have built prompt caching from day one. The retrofit took 2 weeks of refactoring; building it in initially would have been 2 days.
The embedding deduplication layer was our biggest win on cost-per-engineering-hour. We should have prioritized it sooner.
Model routing should be the first optimization in any LLM-heavy product. It's almost free to implement and provides 30-60% savings.

When this matters for your stack

The optimization math gets attractive when your LLM spend is:

More than $500/month and growing
Distributed across many similar requests (where caching helps)
In a domain with duplicate topics (where dedup helps)
On model tiers where cheaper alternatives are usable (where routing helps)

If you're spending $50/month on LLMs, none of this is worth the engineering time. If you're spending $5,000/month, every percentage point of optimization is worth a sprint.

Key takeaways

Model routing is the first and biggest lever — 78% of our traffic moves down a tier with no quality loss.
Prompt caching needs to be designed in, not bolted on. Restructure your prompts so the stable parts come first.
Embedding-based dedup is underrated. Many "different" requests in a domain are near-duplicates.
Streaming with early termination is a small but free win once your prompts are streaming-compatible.
Adaptive max_tokens doesn't save direct cost but improves quality on short outputs.
Open-source models aren't worth it below ~100M tokens/day. The total cost of ownership wins for hosted APIs.
Cross-provider routing introduces more bugs than it saves dollars at our scale.
Document the production overhead — naive theoretical numbers are always 30-50% off the actual production cost.

12x cost reduction is what it looks like when four small wins compound. None of these layers alone would have justified the work; together they make the unit economics of an AI-heavy SaaS work.

HelperX uses all four layers in production. Bring your own LLM API key — we pass through your provider rate at our optimization stack. Free 30-day trial.

DEV Community