I Built an AI Agent Pipeline That Processes 10,000+ Jobs Daily. Here’s What Almost Broke It

#ai #llm #nextjs #architecture

I watched the CPU graph on our MongoDB Atlas cluster spike to 100% for the third time that week. The scraping pipeline was using deep skip() pagination, and every run was scanning a million documents to find the next batch. That’s when I learned the first hard rule of production AI agents: your model is not the bottleneck. Your data access patterns are.

I’ve been running a production job board platform that ingests, scores, and surfaces over 10,000 job listings every day. It’s live. It has real users. And it has broken in ways that taught me more about AI reliability than any tutorial ever could. If you’re building an AI agent pipeline that needs to stay up at scale, here’s what I wish someone had told me before the first outage.

The Architecture That Survived a Bot Storm

Let me walk through the core design. The system has three layers: ingestion, scoring, and delivery.

Ingestion pulls from multiple ATS APIs (Greenhouse, Lever, Ashby, Workable, Recruitee) on a cron schedule. Each source returns raw job listings with inconsistent formatting. Some have HTML descriptions. Others have plain text. A few are missing entire fields.

Scoring is where the LLM comes in. Every listing gets passed to GPT-4 with a function call that extracts structured fields: job title, required skills, experience level, location type, and a relevance score for each candidate profile on the platform. The function schema looks like this:

const extractJobSchema = {
  name: 'extract_job_fields',
  description: 'Extract structured fields from a raw job listing',
  parameters: {
    type: 'object',
    properties: {
      title: { type: 'string' },
      skills: { type: 'array', items: { type: 'string' } },
      experience_level: { type: 'string', enum: ['entry', 'mid', 'senior', 'lead'] },
      location_type: { type: 'string', enum: ['remote', 'hybrid', 'onsite'] },
      relevance_score: { type: 'number', minimum: 0, maximum: 100 }
    },
    required: ['title', 'skills', 'experience_level', 'location_type', 'relevance_score']
  }
};

Delivery exposes a REST API that serves scored listings to downstream consumers. That’s the simple version. The hard parts are everything else.

The Pagination Problem That Almost Took Us Down

The first major failure was invisible until it wasn’t. The scraping loop used MongoDB’s skip() to paginate through the collection. At 100,000 documents, the CPU was fine. At 500,000, you could see the spike. At one million, the cluster started timing out during scraping runs.

Here’s why: skip() doesn’t skip work. It scans every document up to the offset. So a query that skips 900,000 documents is actually scanning 900,000 documents, discarding them, and returning the next batch. That’s a full collection scan every scrape cycle.

The fix was cursor-based pagination. Instead of skip(N), you query { _id: { $gt: lastSeenId } } with a sort on _id and a limit. That uses the index directly. No scanning. No CPU spikes.

// Before (bad)
const jobs = await db.collection('jobs')
  .find({ source: 'greenhouse' })
  .skip(page * batchSize)
  .limit(batchSize)
  .toArray();

// After (good)
const jobs = await db.collection('jobs')
  .find({ source: 'greenhouse', _id: { $gt: lastSeenId } })
  .sort({ _id: 1 })
  .limit(batchSize)
  .toArray();

We implemented this and the CPU graphs flattened overnight. The lesson: optimize your data layer before you touch your prompt.

Cost Management Is a Pipeline Problem

The AI rewrite pipeline was a different kind of failure. We built a system that took raw ATS job descriptions and rewrote them into SEO-optimized content using GPT-4. The quality was great. The cost was catastrophic.

At 10,000 listings per day, each rewrite cost roughly $0.02 with GPT-4. That’s $200 per day. For a feature that improved SEO but didn’t directly generate revenue. The client shut it down.

I learned to think about cost as a first-class constraint, not an afterthought. Now I evaluate every LLM call against a cost budget before writing a single line of code. For the rewrite pipeline, I’m evaluating DeepSeek V4 Flash as an alternative. Early tests show comparable quality at roughly 23x lower cost. If it holds up, the pipeline becomes viable again.

For the scoring pipeline, I batch requests using OpenAI’s Batch API. That gives a 50% discount with the same model quality. The tradeoff is latency, batch results come back in hours, not seconds, but for nightly scoring jobs that’s perfectly acceptable.

Error Handling: The Part Nobody Talks About

AI agents fail in ways traditional software doesn’t. A model returns a malformed JSON response. An API rate limit kicks in mid-batch. A downstream source changes its schema without warning.

I built three layers of defense.

First, every LLM call is wrapped in a retry with exponential backoff. Not just for rate limits, but for transient failures. The model might return a 500 error, or the response might be truncated. Retry with a fresh request.

Second, I validate every structured output against its schema before using it. If the relevance_score field is missing or out of range, the job gets flagged for manual review instead of being silently published. This prevents bad data from propagating downstream.

Third, I monitor the failure rate per model per endpoint. When the error rate for a specific model crosses 5% in a 15-minute window, I route traffic to a fallback model automatically. This saved us during a GPT-4 outage that lasted 45 minutes. Users saw no interruption because the pipeline switched to GPT-4o-mini for scoring and degraded gracefully.

async function callWithFallback(prompt: string, primary: string, fallback: string) {
  try {
    return await callModel(primary, prompt);
  } catch (err) {
    logger.warn(`Primary model ${primary} failed, falling back to ${fallback}`, err);
    return await callModel(fallback, prompt);
  }
}

Monitoring That Tells You What’s Breaking

You can’t fix what you can’t see. I instrument every step of the pipeline with structured logging and metrics. Each LLM call logs the model name, token count, latency, and response status. Each batch job logs the number of items processed, the success rate, and the total cost.

I use Sentry for error tracking and LogRocket for session replay on the frontend. But the most valuable monitoring is a simple dashboard that shows the pipeline health in real time: ingestion rate, scoring throughput, error rate, and cost per 1,000 listings.

When the bot storm hit, a single crawler pulling 35GB in one session, the dashboard showed the bandwidth spike immediately. I blocked the crawler at the Cloudflare edge with a WAF rule in under five minutes. Without monitoring, that would have been a discovery hours later when the server started timing out.

What I’d Tell a Founder Evaluating AI Agents

If your team is building an AI agent pipeline and you’re worried about reliability at scale, start with the data layer. Make sure your database can handle the throughput before you add LLM calls. Optimize your queries. Add cursors. Cache aggressively.

Then think about cost as a pipeline constraint, not a per-call number. Batch what you can. Use cheaper models for high-volume tasks. Build fallback routing so one model outage doesn’t take down your whole system.

The AI part is the easiest part. The hard part is everything around it, the data plumbing, the error handling, the monitoring, the cost math. Get those right and your agent pipeline will stay up when it matters.

If your team is wrestling with AI agent reliability and shipping slower because of it, that’s the kind of thing I help with, happy to compare notes on how I build production AI pipelines.

Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.