Cutting Your AI Bill by 40%: Cost Optimization Patterns in TypeScript
Your AI infrastructure doesn't have to be a black hole for your budget.
Let's face it: AI costs are spiraling out of control. That "quick experiment" with GPT-4o just burned through your monthly API budget. The AI-powered feature you shipped last quarter? It's now your single largest infrastructure expense.
I've been there. At Juspay, we process millions of AI requests daily across 13 different providers. Our first naive implementation was costing us $47,000 per month just in API calls. Today, we run the same workload for $18,000 — a 62% reduction — without sacrificing quality.
Here's the blueprint we built into NeuroLink, our universal AI SDK for TypeScript.
The Hidden Cost Multipliers
Before diving into solutions, let's understand why AI costs explode:
- Model over-provisioning: Using GPT-4o for tasks that Gemini Flash handles flawlessly
- Redundant calls: Same tool invocations repeated across sessions
- Token bloat: Conversations that grow until they hit context limits
- Blind spots: No visibility into which requests are expensive
| Model | Input (1M tokens) | Output (1M tokens) | Best For |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | Complex reasoning, coding |
| Claude 3.5 Sonnet | $3.00 | $15.00 | Long context, analysis |
| Gemini 2.5 Flash | $0.15 | $0.60 | Speed, high volume |
| Gemini 3 Pro | $1.25 | $10.00 | Balanced performance |
The gap between cheapest and most expensive is **25x* for input tokens.*
Pattern 1: Intelligent Model Routing
The simplest win: route requests to the cheapest model that can handle them.
NeuroLink's cost optimization mode does this automatically:
import { NeuroLink } from "@juspay/neurolink";
// Automatic cost-aware routing
const neurolink = new NeuroLink({
enableOrchestration: true,
});
// Simple tasks → Cheapest model
const greeting = await neurolink.generate({
input: { text: "Say hello in Spanish" },
// Automatically routes to Gemini Flash ($0.15/1M input)
});
// Complex tasks → Capable model
const analysis = await neurolink.generate({
input: { text: "Analyze this quarterly financial report and identify risks" },
// Automatically routes to GPT-4o or Claude Sonnet
});
CLI equivalent:
# Force cost optimization
npx @juspay/neurolink generate "Summarize this text" --optimize-cost
Real-world impact: We saw a 34% cost reduction just by enabling automatic routing for customer support chatbots. Simple FAQ responses hit Gemini Flash; complex escalations use Claude.
Pattern 2: Tool Result Caching
Tool calls are expensive. The same database query shouldn't cost you $0.05 every single time.
NeuroLink's ToolCache provides production-grade result caching with multiple eviction strategies:
import { ToolCache, ToolResultCache } from "@juspay/neurolink";
// Initialize cache with LRU eviction
const cache = new ToolCache({
ttl: 5 * 60 * 1000, // 5 minutes
maxSize: 1000,
strategy: "lru", // Least Recently Used
enableAutoCleanup: true,
});
// Cache-aside pattern (recommended)
const userData = await cache.getOrSet(
"getUserById:123",
async () => {
// Only executes on cache miss
return await expensiveDatabaseQuery(123);
},
30000 // Custom TTL: 30 seconds for this entry
);
// Specialized wrapper for tool results
const resultCache = new ToolResultCache({
ttl: 120000,
strategy: "lfu", // Least Frequently Used
});
resultCache.cacheResult("getUserById", { id: 123 }, { name: "Alice" });
const cached = resultCache.getCachedResult("getUserById", { id: 123 });
Cache invalidation with patterns:
// Invalidate all user cache entries
await cache.invalidate("getUserById:*");
// Monitor performance
const stats = cache.getStats();
console.log(`Hit rate: ${(stats.hitRate * 100).toFixed(1)}%`);
// => Hit rate: 84.2%
Real-world impact: Our document processing pipeline went from 12,000 API calls per hour to 800. A 93% reduction.
Pattern 3: Context Compaction
Conversations grow. A session that starts at 500 tokens ends up at 15,000 tokens. You're paying for every single one.
NeuroLink's Context Compaction automatically manages this with a 4-stage pipeline:
import { NeuroLink } from "@juspay/neurolink";
const neurolink = new NeuroLink({
conversationMemory: {
enabled: true,
enableSummarization: true,
contextCompaction: {
enabled: true,
threshold: 0.8, // Trigger at 80% context usage
enablePruning: true, // Stage 1: Remove old tool outputs
enableDeduplication: true, // Stage 2: Deduplicate file reads
enableSlidingWindow: true, // Stage 4: Truncate oldest messages
maxToolOutputBytes: 50 * 1024, // 50KB tool output limit
fileReadBudgetPercent: 0.6, // 60% context for files
},
},
});
The 4-stage pipeline:
| Stage | Action | Cost |
|---|---|---|
| 1. Tool Output Pruning | Replace old results with placeholders | Free |
| 2. File Deduplication | Keep only latest file reads | Free |
| 3. LLM Summarization | Summarize older messages | Cheap (Flash) |
| 4. Sliding Window | Truncate oldest messages | Free |
Monitor context usage:
const stats = await neurolink.getContextStats(
"session-123",
"anthropic",
"claude-sonnet-4-20250514"
);
if (stats) {
console.log(`Usage: ${(stats.usageRatio * 100).toFixed(0)}%`);
console.log(`Tokens: ${stats.estimatedInputTokens} / ${stats.availableInputTokens}`);
console.log(`Needs compaction: ${stats.shouldCompact}`);
}
Real-world impact: Our customer support sessions averaged 8,000 tokens before compaction. After: 2,400 tokens. A 70% reduction.
Pattern 4: Budget Monitoring & Circuit Breakers
You can't optimize what you can't measure. NeuroLink's analytics give you real-time cost visibility:
const result = await neurolink.generate({
input: { text: "Generate a report" },
enableAnalytics: true,
enableEvaluation: true,
});
console.log(result.analytics);
// {
// provider: "google-ai",
// model: "gemini-2.5-flash",
// tokens: { input: 150, output: 450, total: 600 },
// cost: 0.00027, // $0.00027 for this request
// responseTime: 850,
// toolsUsed: ["getCurrentTime", "readFile"]
// }
Build budget-aware middleware:
import { NeuroLink } from "@juspay/neurolink";
class BudgetMiddleware {
private dailyBudget = 100; // $100/day
private spentToday = 0;
async beforeRequest(options: any) {
// Estimate cost before execution
const estimatedCost = this.estimateCost(options);
if (this.spentToday + estimatedCost > this.dailyBudget) {
throw new Error(`Daily budget exceeded: $${this.spentToday.toFixed(2)} / $${this.dailyBudget}`);
}
return options;
}
async afterResponse(result: any) {
if (result.analytics?.cost) {
this.spentToday += result.analytics.cost;
// Alert on expensive requests
if (result.analytics.cost > 0.10) {
console.warn(`High cost request: $${result.analytics.cost}`);
}
}
return result;
}
}
CLI cost tracking:
# Get cost breakdown
npx @juspay/neurolink generate "Analyze this" --enable-analytics --format json | jq '.analytics.cost'
# 0.000012
Pattern 5: Request Batching
Sometimes you can't avoid multiple tool calls. Batch them.
import { RequestBatcher } from "@juspay/neurolink";
const batcher = new RequestBatcher({
maxBatchSize: 10, // Flush when 10 requests queued
maxWaitMs: 100, // Or after 100ms, whichever comes first
enableParallel: true, // Execute batch items in parallel
groupByServer: true, // Group by server for efficiency
});
// Set up batch executor
batcher.setExecutor(async (requests) => {
return Promise.all(
requests.map(async (r) => {
const result = await executeToolCall(r.tool, r.args, r.serverId);
return { success: true, result };
})
);
});
// Multiple calls automatically batched
const [user1, user2, user3] = await Promise.all([
batcher.add("getUserById", { id: 1 }, "db-server"),
batcher.add("getUserById", { id: 2 }, "db-server"),
batcher.add("getUserById", { id: 3 }, "db-server"),
]);
Real-world impact: Our data enrichment pipeline dropped from 450 API calls/minute to 45 batched calls. 90% reduction in connection overhead.
Putting It All Together
Here's a production-ready configuration that implements all patterns:
import { NeuroLink, ToolCache } from "@juspay/neurolink";
// Initialize with all optimizations enabled
const neurolink = new NeuroLink({
// 1. Cost-aware routing
enableOrchestration: true,
// 2. Conversation memory with compaction
conversationMemory: {
enabled: true,
enableSummarization: true,
summarizationProvider: "vertex",
summarizationModel: "gemini-2.5-flash", // Use cheap model for summaries
contextCompaction: {
enabled: true,
threshold: 0.75, // Trigger earlier for more savings
},
},
// 3. Analytics for visibility
enableAnalytics: true,
});
// 4. Set up tool caching
const cache = new ToolCache({
ttl: 10 * 60 * 1000, // 10 minutes
maxSize: 2000,
strategy: "lru",
});
// Wrap expensive operations
async function getCachedData(key: string, fetcher: () => Promise<any>) {
return cache.getOrSet(key, fetcher, 60000);
}
// Monitor and alert
neurolink.on('generation:complete', (event) => {
if (event.analytics.cost > 0.05) {
console.warn(`Expensive request: $${event.analytics.cost} for ${event.model}`);
}
});
The Bottom Line
| Optimization | Implementation Effort | Typical Savings |
|---|---|---|
| Cost-aware routing | 1 line (enableOrchestration: true) |
30-40% |
| Tool caching | ~10 lines | 60-90% |
| Context compaction | ~5 lines | 50-70% |
| Budget monitoring | ~20 lines | Prevents overruns |
| Request batching | ~15 lines | 40-60% |
Combined: Most teams see 40-60% cost reduction within a week of implementation.
Key Takeaways
- Start with visibility: Enable analytics before optimizing. You need data.
- Cache aggressively: Tool results are the biggest win for most applications.
- Let the AI shrink itself: Context compaction runs continuously, saving tokens.
- Route intelligently: Not every request needs a $15/1M token model.
- Set limits: Budget alerts prevent nasty surprises.
The patterns above aren't theoretical — they're running in production at Juspay, processing millions of requests daily. They've cut our AI infrastructure costs by more than half while improving response times.
Your AI bill doesn't have to be a mystery. With the right patterns, it's entirely controllable.
NeuroLink — The Universal AI SDK for TypeScript
- GitHub: github.com/juspay/neurolink
- Install:
npm install @juspay/neurolink - Docs: docs.neurolink.ink
- Blog: blog.neurolink.ink — 150+ technical articles
Top comments (0)