The Pipeline That Worked Too Well
I had to kill a pipeline that was doing exactly what it was supposed to do. It was rewriting job descriptions at scale, improving SEO, and running without errors. The client asked me to shut it down anyway.
The problem wasn't quality. It was cost. Running GPT-class models across a million listings added up faster than anyone expected. The pipeline worked perfectly, and that was the problem. Every perfect run cost money. Over time, the bill became the feature that mattered most.
That moment changed how I think about AI agent architecture. Most teams building AI features into their Next.js SaaS focus on accuracy, latency, and user experience. They forget the fourth dimension: cost per action. And that's the one that kills projects.
Three Ways Naive Agents Waste Your Budget
I've seen these patterns emerge across multiple projects. Here are the three that hurt most.
Redundant context resends. Every time your agent calls the LLM, it sends the system prompt, the conversation history, and the user input. If you have 10 agents running in parallel for different users, you're sending the same system prompt 10 times. At scale, that's gigabytes of redundant tokens every hour.
No caching strategy. Most teams treat every LLM call as unique. But many calls are identical or nearly identical. Same user query, same context, same expected output. Without caching, you pay full price for every duplicate.
Expensive models for everything. GPT-4 is great for complex reasoning. It's terrible for simple classification, extraction, or rewriting. But most teams use one model for everything because it's easier to build that way. Easy to build, expensive to run.
These three patterns are the reason so many AI features don't survive their first billing cycle. The pipeline I had to shut down suffered from all of them.
Cache What the LLM Already Knows
The first fix is always prompt caching. If your system prompt is 2,000 tokens and you send it 100 times, that's 200,000 tokens of waste. Cache it.
Here's a general pattern that works with any LLM provider that supports prompt caching. OpenAI and Anthropic both support it, and newer providers are adding it too.
// Cache key based on prompt content, not just user identity
function buildCacheKey(systemPrompt: string, userInput: string): string {
const hash = crypto.createHash('sha256')
.update(systemPrompt + userInput)
.digest('hex');
return `llm:${hash}`;
}
// Check cache before making the API call
async function getCompletion(
systemPrompt: string,
userInput: string,
options: { useCache?: boolean; model?: string } = {}
) {
if (options.useCache) {
const key = buildCacheKey(systemPrompt, userInput);
const cached = await redis.get(key);
if (cached) return JSON.parse(cached);
}
const response = await openai.chat.completions.create({
model: options.model || 'gpt-4o-mini',
messages: [
{ role: 'system', content: systemPrompt },
{ role: 'user', content: userInput }
]
});
if (options.useCache) {
const key = buildCacheKey(systemPrompt, userInput);
await redis.setex(key, 3600, JSON.stringify(response));
}
return response;
}
This isn't complicated. But most teams skip it because they don't think about caching until the bill arrives. By then, the damage is done.
Batch the Cheap Work, Route the Expensive Work
Not every LLM call needs the same horsepower. Classifying a job listing as remote or on-site is a trivial task. Extracting structured data from a legal document is not.
A pattern that works well is a two-tier routing approach. Simple tasks go to a cheap model. Complex tasks go to an expensive one. The router itself is a cheap call that decides where to send the work.
type TaskDifficulty = 'simple' | 'complex';
async function routeTask(input: string): Promise<TaskDifficulty> {
// A quick, cheap call to classify the task
const classification = await cheapModel({
messages: [
{ role: 'system', content: 'Classify this task as simple or complex. Respond with one word.' },
{ role: 'user', content: input }
],
max_tokens: 5
});
return classification.includes('complex') ? 'complex' : 'simple';
}
async function processWithFallback(input: string) {
const difficulty = await routeTask(input);
const model = difficulty === 'simple'
? 'gpt-4o-mini' // $0.15 per million input tokens
: 'gpt-4o'; // $2.50 per million input tokens
// That's a 16x price difference for the same task
return openai.chat.completions.create({
model,
messages: [{ role: 'user', content: input }]
});
}
For the job description rewrite pipeline that got shut down, this pattern alone could have made a significant difference. I evaluated DeepSeek V4 Flash as a replacement at roughly 23x cheaper than GPT-4.1 with sufficient quality for the task. The pipeline could have stayed alive with better routing.
Structured Output Reduces Retries
The most expensive LLM call is the one that gives you bad output and forces a retry. If your agent returns malformed JSON, you pay again to fix it. If it hallucinates a field, you pay to regenerate.
Function calling with strict JSON schemas prevents this. The model either returns valid data or nothing. No partial outputs, no parsing errors, no retry loops.
const extractionSchema = {
name: 'extract_job_details',
description: 'Extract structured job information from raw text',
parameters: {
type: 'object',
properties: {
title: { type: 'string' },
company: { type: 'string' },
salary_range: {
type: 'object',
properties: {
min: { type: 'number' },
max: { type: 'number' },
currency: { type: 'string' }
},
required: ['min', 'max', 'currency']
},
remote: { type: 'boolean' }
},
required: ['title', 'company', 'remote']
}
};
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: rawJobText }],
functions: [extractionSchema],
function_call: { name: 'extract_job_details' }
});
This pattern eliminated retries in the AI Resume Tailor I built. The model either returns valid structured data or fails cleanly. No hallucinated fields, no broken downstream pipelines. Every retry you avoid is money you keep.
The Architecture That Pays for Itself
The teams I see that succeed with AI features treat cost as a first-class constraint, not an afterthought. They design their agent architecture knowing exactly how much each action costs. They cache aggressively. They route intelligently. They validate output before paying for retries.
The pipeline I had to shut down would have survived with better architecture from the start. Those three patterns redudant context, no caching, expensive models for everything are exactly what killed it.
If your team is building AI features into a Next.js SaaS and the costs are climbing faster than the value, that's the kind of problem I help with. Happy to compare notes on designing an agent architecture that doesn't burn your budget.
Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.
Top comments (0)