I spent months building an LLM scoring pipeline that processed 10,000 job listings a day. It worked beautifully in staging. Then it hit production and the bills started climbing fast.
The problem wasn't the model. The problem was that I had built a demo, not a production system. The gap between "it works" and "it works reliably at scale" is where most AI agent projects die. Founders burn their runway on API bills. Engineering teams ship something that works for the first 100 requests and falls apart by request 1,000.
Here's what I learned about building a reliability layer that actually survives production.
The Cost Explosion Nobody Warns You About
My first mistake was treating the OpenAI API like a utility. I sent prompts, got responses, moved on. No tracking. No budgets. No cost-per-request visibility.
A few weeks in, I checked the billing dashboard and saw a number that made me rethink the architecture entirely. The pipeline was making redundant calls. It was re-embedding the same documents every run. It was using GPT-4 for tasks that GPT-4o mini could handle perfectly well.
I fixed it with two changes.
First, I routed all batch processing through OpenAI's Batch API. It's much cheaper and handles the same throughput with a few hours of latency. For a daily scoring pipeline that doesn't need real-time responses, that tradeoff is a no-brainer.
async function submitScoringBatch(listings: JobListing[]) {
const batch = listings.map((listing) => ({
custom_id: listing.id,
method: "POST",
url: "/v1/chat/completions",
body: {
model: "gpt-4o-mini",
messages: [
{ role: "system", content: SCORING_PROMPT },
{ role: "user", content: JSON.stringify(listing) }
],
max_tokens: 500
}
}));
const response = await openai.batches.create({
input_file_id: await uploadBatchFile(batch),
endpoint: "/v1/chat/completions",
completion_window: "24h"
});
return response.id;
}
Second, I added model routing based on task complexity. Simple classification goes to GPT-4o mini at a fraction of the cost. Complex reasoning stays on GPT-4. The pipeline checks the task type before making a call, not after.
Retry Logic Is Not Optional
LLM APIs fail. Not often, but when they do, it's at the worst possible moment. Rate limits, timeouts, transient server errors, they all happen.
The naive approach is to catch the error and retry immediately. That's how you get a thundering herd problem where every failed request retries at the same instant and slams the API even harder.
I switched to exponential backoff with jitter. Each retry waits longer, with a random offset to spread the load. After three failures, I stop retrying and log the error for manual review instead of burning more API calls on a lost cause.
async function callWithRetry(prompt: string, maxRetries = 3): Promise<string> {
for (let attempt = 0; attempt < maxRetries; attempt++) {
try {
const response = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [{ role: "user", content: prompt }]
});
return response.choices[0].message.content;
} catch (error) {
if (attempt === maxRetries - 1) throw error;
const delay = Math.min(1000 * Math.pow(2, attempt) + Math.random() * 1000, 30000);
await new Promise(resolve => setTimeout(resolve, delay));
}
}
}
This pattern saved the pipeline from cascading failures during an API outage that lasted long enough to take down dependent systems. The pipeline slowed down, queued work, and recovered automatically when the API came back.
Function Calling As a Guardrail, Not a Feature
Most people think of function calling as a way to let the LLM take actions. I think of it as a way to constrain what the LLM can output.
The problem with free text generation is that the model will eventually produce something you didn't ask for. In my scoring pipeline, I needed structured output: a score between 1 and 10, a confidence level, and a one-sentence justification. Nothing else.
Function calling with a strict JSON schema turned the model's output into something I could parse and validate before it touched the rest of the system.
const scoringFunction = {
name: "score_listing",
description: "Score a job listing for relevance",
parameters: {
type: "object",
properties: {
score: {
type: "number",
minimum: 1,
maximum: 10,
description: "Relevance score"
},
confidence: {
type: "number",
minimum: 0,
maximum: 1,
description: "Confidence in the score"
},
reasoning: {
type: "string",
maxLength: 200,
description: "Brief justification"
}
},
required: ["score", "confidence", "reasoning"]
}
};
The schema acts as a contract. If the model can't produce valid output, the call fails fast instead of polluting the database with garbage. I added a validation step that checks every field against its constraints before writing anything.
Observability Is How You Find Problems Before Users Do
You can't fix what you can't see. I learned this the hard way when a subtle prompt drift caused the scoring pipeline to silently return lower-quality results for days before anyone noticed.
I wired Sentry for error tracking and LogRocket for session replay on the frontend side. But the real value came from adding structured logging to every LLM call.
Each request logs: the model used, the prompt hash, the response time, the token count, the result, and any errors. This gives me a searchable history of every interaction with the API.
logger.info("LLM call completed", {
model: "gpt-4o-mini",
promptHash: hash(prompt),
duration: Date.now() - start,
tokens: response.usage.total_tokens,
score: parsedResult.score,
error: null
});
When something goes wrong, I don't guess. I query the logs. I see the exact prompt that caused the problem, the exact response, and the exact cost. That turns debugging from a guessing game into a data exercise.
The Reliability Layer Is a Competitive Advantage
Most AI products ship without any of this. They work in demos because demos don't have 10,000 concurrent requests or unpredictable API behavior or users who will notice a 500 error instantly.
If you're a founder shipping an AI feature, your competitors are probably cutting corners on reliability. They're not handling retries. They're not monitoring costs. They're not validating output.
That means you can win by doing the boring work. Add a batch API route. Implement exponential backoff. Constrain your model's output with function calling. Log everything.
It's not glamorous. But it's the difference between a product that works and a product that works consistently enough that people trust it with their workflows.
If your team is wrestling with LLM reliability in production and shipping slower because of it, that's the kind of thing I help with, happy to compare notes.
Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.
Top comments (0)