Scaling an LLM Scoring Pipeline From One Job to 10,000 a Day

#llm #architecture #jobboard #production

The first time the pipeline ran against a heavy batch of listings, my MongoDB Atlas cluster nearly buckled. CPU spiked, the API queue backed up, and the OpenAI costs climbed faster than I expected. I had seen tutorials that show processing one listing, calling GPT-4, waiting for the response, storing the result, and repeating. At low volume that approach works fine. At scale, it is a pattern that does not hold up.

That job board platform now processes 10,000+ listings daily. The LLM scoring pipeline, built with GPT-4 function calling, ranks every job against candidate profiles and exposes a REST API for downstream consumers. Getting there meant learning what actually matters when you push an AI pipeline past the demo stage. Here is what I found.

Why Function Calling, Not Free-Text Prompts

Suppose you start with a standard chat completion and an instruction like "score this job from 1 to 10 for relevance to a software engineer." Some responses return just a number, others a paragraph explaining the score, and some hallucinate entirely nonexistent job attributes. Inconsistent output makes downstream processing fragile.

Function calling solves that cleanly. By defining a strict JSON schema with required fields like relevance_score, category, and matched_skills, the model returns deterministic, parseable output every time. No regex scraping. No null checks on free text.

Here is the schema I settled on after several revisions:

const scoringSchema = {
  type: "function",
  function: {
    name: "score_job_listing",
    description: "Score a job listing for relevance to a target profile",
    parameters: {
      type: "object",
      properties: {
        relevance_score: {
          type: "number",
          description: "Relevance score from 0 to 100",
          minimum: 0,
          maximum: 100
        },
        category: {
          type: "string",
          enum: ["high_match", "partial_match", "low_match", "irrelevant"]
        },
        matched_skills: {
          type: "array",
          items: { type: "string" },
          description: "Skills from the listing that match the target profile"
        },
        reasoning: {
          type: "string",
          description: "One-sentence justification for the score"
        }
      },
      required: ["relevance_score", "category", "matched_skills"]
    }
  }
};

Making reasoning optional while including it with a description was a deliberate choice. GPT nearly always fills it in, but if the model decides to skip it, the pipeline does not break. That pattern of optional fields in a required schema saves significant manual edge-case handling over time.

Batch Processing and Why It Matters

Processing one job per API call means each request pays a full round-trip latency penalty. At hundreds of jobs that is slow. At thousands it becomes economically painful.

Batching listings in groups works better. Instead of calling GPT-4 for each job, you pass an array of job descriptions and ask the model to return an array of scores. The prompt says: "You are given an array of job descriptions. Return an array of score objects in the same order. Do NOT skip any. If a description is empty, score it as irrelevant and leave an empty matched_skills array."

This cuts the number of API calls by a large factor. The token cost goes up a bit because batching sends more context per call, but the overhead is negligible compared to the latency and rate-limit savings. It also makes staying within OpenAI's tiered rate limits much more manageable.

Batching also forces you to design idempotency properly. When a single request fails, it can affect many jobs at once. One approach is to assign a unique batch ID to each request and deduplicate on the database side. If a batch request times out or returns a partial response, the retry logic checks which jobs already have scores in the database and only resubmits the missing ones.

The Ingestion to Delivery Flow

The full pipeline has four stages. Ingestion pulls fresh listings from multiple ATS APIs (Greenhouse, Lever, Ashby, Workable, Recruitee). Normalisation cleans the raw data: strip HTML, unify job types, and remove duplicates by comparing listing URLs. Scoring is the batch LLM pipeline. Delivery pushes the scored listings into MongoDB and exposes them through a REST API with cursor-based pagination.

The most important insight from the ingestion stage: raw ATS data varies wildly in quality. One platform sends the job description as a rich HTML blob with inline styles, another sends plain text truncated at a short character limit, and a third wraps everything in a CDATA tag. A normalisation layer that passes each listing through a consistent pipeline is essential: strip tags, decode entities, extract salary ranges with regex, and fill missing fields with sensible defaults. Without this normalisation, the LLM would score garbage inputs and produce garbage outputs.

Error Handling Without Burning Money

OpenAI API failures are not rare. They happen every day. Rate limits, timeouts, server errors, and occasionally malformed responses. The worst approach is to retry immediately with the same batch. Exponential backoff with jitter is standard practice here. But the bigger cost saver is a dead letter queue.

If a batch fails multiple times, it gets pushed to a separate collection with the failure reason and raw payload. A scheduled job scans the dead letter queue periodically, tries a cheaper model like GPT-4o-mini for a simpler scoring task, and if that still fails it logs the listing for manual review. This approach reduces retry costs meaningfully because you stop hammering GPT-4 on clearly broken inputs.

Another effective trick is pre-validating listings before they enter the scoring queue. Any record with a missing title, empty description, or a language that the normalisation layer doesn't recognise gets flagged and skipped. The model never sees it. That prevents wasting tokens on unscoreable content and keeps pipeline throughput steady.

What Would Be Different on a Second Pass

If I were building this today, I would design a streaming architecture from the start instead of batch polling. A message queue like Redis or RabbitMQ would decouple ingestion from scoring and handle backpressure naturally. The current cron-based schedule works but creates bursts of load on MongoDB and OpenAI simultaneously.

I would also invest earlier in observability. Without insight into how long each stage takes, which ATS source has the highest error rate, or how many jobs fail scoring per hour, you are debugging blind. Once structured logging and a simple dashboard are in place, patterns become obvious. For example, it is common to find that one particular ATS source produces a disproportionate share of pipeline failures because its API returns inconsistent field names. That data tells you where to focus.

This pipeline taught me that production AI work is not about clever prompts. It is about data engineering: normalisation, batching, error handling, and knowing when to let something fail instead of spending more money trying to save it. The LLM is just one component in a system that has to survive real traffic, real network issues, and real messy data.

If your team is wrestling with an AI pipeline that works in a demo but stumbles under real load, that is exactly the kind of thing I help with. I build production AI pipelines that stay stable when the volume spikes. Happy to compare notes.

Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.