The first time my scoring pipeline ran against a full day's batch, it took 47 minutes and cost $86. The second run took three hours because half the requests hit rate limits and the retry logic was too aggressive. By the third day I had a queue of unprocessed listings growing faster than the system could score them.
That's when I learned the difference between a RAG demo and a production RAG pipeline.
The system I built processes over 10,000 job listings daily, extracts structured fields via GPT-4 function calling, scores each listing against candidate profiles, and serves results through a REST API. It's been running in production for months. Here's what the architecture actually looks like and where most tutorials skip the hard parts.
Chunking Strategy: Why Document Size Matters More Than You Think
The naive approach is to dump the entire job description into a single chunk and call it done. That works until you need to extract 18 structured fields from a 2,000-word posting that mixes requirements, responsibilities, and boilerplate legalese.
I landed on a two-pass strategy. First pass: split the raw description into semantic sections using sentence boundary detection and topic shifts. Second pass: feed each section to GPT-4 with a focused extraction schema.
interface ExtractionSchema {
role_title: string;
required_skills: string[];
nice_to_have: string[];
salary_range: { min: number; max: number; currency: string } | null;
experience_years: { min: number; max: number };
location_type: 'remote' | 'hybrid' | 'onsite';
visa_sponsorship: boolean;
}
The key insight: smaller chunks with explicit schemas produce more reliable structured output than one large chunk with a complex schema. My error rate dropped from 12% to under 2% when I switched from single-chunk to multi-chunk extraction. The tradeoff is more API calls per listing, which brings me to cost.
Embedding Model Choice and Vector Store Reality
I evaluated three embedding setups: OpenAI's text-embedding-3-small, text-embedding-3-large, and an open-source alternative via Ollama. For job listings, the difference between the small and large model was negligible on recall metrics. The small model is 23x cheaper.
For the vector store, I tested Pinecone and pgvector on the same dataset of 50,000 listings. Pinecone was easier to set up. pgvector was cheaper and eliminated a network hop since my PostgreSQL instance already held the metadata. I went with pgvector.
The real cost isn't the vector store itself. It's the embedding generation at scale. At 10,000 listings per day, the small model costs about $0.40 in embeddings. The large model would cost $5.20. Over a month that difference is $144 versus $1,872 for just embeddings.
Cost Optimization: Batch API Changed Everything
OpenAI's Batch API is the single biggest cost lever I found. It gives you 50% off the standard API price in exchange for deferred processing. For job listings that don't need real-time scoring, that tradeoff is free money.
const batchRequest = {
custom_id: `listing-${listingId}`,
method: 'POST',
url: '/v1/chat/completions',
body: {
model: 'gpt-4o-mini',
messages: extractionMessages,
response_format: { type: 'json_object' }
}
};
I batch submissions every two hours. The results come back within 30 minutes to 4 hours. The cost dropped from $86 per full-day run to $32 for the same volume. That's a 63% reduction for accepting a few hours of latency on data that was already hours old.
The streaming approach I see recommended everywhere would have been a mistake here. Streaming helps for user-facing chat where latency matters. For batch scoring, it adds complexity and cost with zero benefit.
Error Handling: The Part Nobody Writes About
Rate limits broke my pipeline three times in the first week. Here's what I learned.
OpenAI's rate limits are per-model and per-organization. If you're hitting GPT-4 and GPT-4o-mini from the same key, they share a pool. A burst of GPT-4 calls can exhaust the limit and block your cheap GPT-4o-mini requests.
My fix was a token bucket per model tier with separate queues:
class TokenBucket {
private tokens: number;
private lastRefill: number;
constructor(private capacity: number, private refillRate: number) {
this.tokens = capacity;
}
async consume(tokens: number): Promise<void> {
this.refill();
if (this.tokens < tokens) {
const waitTime = (tokens - this.tokens) / this.refillRate * 1000;
await sleep(waitTime + 100); // 100ms buffer
this.refill();
}
this.tokens -= tokens;
}
}
The 100ms buffer is the part that matters. Rate limit enforcement isn't instant. If you retry the exact moment the bucket refills, you'll still get a 429. A small buffer makes the difference between a stable pipeline and one that oscillates between idle and rate-limited.
Evals for Scoring Accuracy
I run a weekly evaluation against a labeled test set of 500 listings. The eval checks three things: field extraction accuracy, scoring consistency (same listing scored twice should get similar results), and hallucination rate (fields populated with data not in the source text).
The hallucination check catches the most insidious bugs. Early on, the model started inventing salary ranges for listings that didn't mention compensation. The schema had salary_range as optional, but the model was filling it with plausible-looking numbers. The fix was a presence guard in the function call schema that required an explicit boolean flag before the model could populate any optional field.
const schema = {
has_salary_info: { type: 'boolean' },
salary_range: {
type: 'object',
properties: { ... },
// Only validate if has_salary_info is true
}
};
This pattern is worth adopting anywhere your LLM extracts optional fields. Without it, the model will confidently invent data that looks correct but isn't.
The Tradeoff I Still Wrestle With
The biggest unresolved tension is the AI description rewrite pipeline. I built it, it worked well, and the client shut it down because the cost at 1M+ listings was too high with GPT-4 class models. I'm evaluating DeepSeek V4 Flash as a roughly 23x cheaper alternative that might make the economics work.
If your team is building a production RAG pipeline and wrestling with cost containment or reliability at scale, that's exactly the kind of thing I help with. Happy to compare notes on what's worked and what hasn't.
Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.
Top comments (0)