DEV Community: Abdul Rehman

Production RAG at Scale: Lessons from Processing 10,000+ Listings Daily

Abdul Rehman — Wed, 15 Jul 2026 09:05:08 +0000

I spent time debugging a RAG pipeline that looked correct in staging and behaved differently under real load. The job board I built processes thousands of listings every day, scoring each one against candidate profiles using LLM function calls. Getting that pipeline stable and affordable took more than a good vector store choice. It took understanding where the system actually breaks.

Here's what I learned about chunking, embeddings, vector stores, cost control, and observability when RAG stops being a demo and starts being a production service.

Chunking Strategy: The Decision That Determines Everything

Most tutorials treat chunking as a parameter you tune later. In production, your chunking strategy determines your embedding quality, your retrieval accuracy, and your cost structure from day one.

For job listings, I tried three approaches before landing on one that worked:

Fixed-size chunks (512 tokens). Simple but terrible for this use case. A job description is a structured document with sections: responsibilities, qualifications, benefits. Fixed chunks split those sections arbitrarily. You lose the semantic boundary between "must have 5 years Python" and "nice to have React experience." Retrieval becomes noisy.

Semantic chunking with section headers. Better. I split on markdown headers and paragraph breaks. Each chunk preserved a complete thought. But some sections were too long (qualifications could run 800 tokens) and others too short (benefits might be two lines). Uneven chunk sizes meant wasted embedding calls on tiny sections and missed context on large ones.

Recursive character splitting with overlap. This is what stayed. I split on double newlines first, then single newlines, then sentences. Each chunk targets roughly 400 tokens with some overlap. The overlap is critical: it catches the boundary cases where a sentence spans two chunks. For job listings specifically, I added a pre-processing step that normalizes the raw ATS output into a consistent structure before chunking. Greenhouse and Lever both return different formats. Normalizing first means the chunker sees clean input.

async function chunkJobListing(text: string): Promise<string[]> {
  const normalized = normalizeListing(text) // strip HTML, unify line endings
  const splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 400,
    chunkOverlap: 50,
    separators: ['\n\n', '\n', '. ', ' '],
  })
  return splitter.splitText(normalized)
}

The key lesson: design your chunking around your document type, not around a generic token budget. Job listings have a natural structure. Legal documents have a different one. Product descriptions have another. Match the chunker to the content.

Embedding Model: OpenAI vs Local

I started with text-embedding-3-small. It works, it's reliable, and the API is trivial to call. At thousands of listings per day, each listing producing several chunks, I was generating tens of thousands of embeddings daily. The cost was manageable but not trivial.

I tested a self-hosted alternative: Llama 3.1 via Ollama on the same AWS EC2 instance running the application. The embedding quality was noticeably worse for this use case. Job descriptions contain domain-specific language: "ATS-compliant resume," "Greenhouse integration," "equity compensation." The local model blurred these distinctions. Two listings that required very different skill sets ended up with similar embeddings.

The tradeoff was clear. OpenAI's embeddings cost money but returned accurate matches. The self-hosted model was free but produced noisy retrieval that required more LLM calls to correct downstream. I stayed with OpenAI.

One optimization that helped: I batch embedding requests. Instead of calling the API once per chunk, I send arrays of up to 100 chunks in a single call. OpenAI charges per token, not per call, so batching saves no token cost. But it reduces latency from N sequential calls to one parallel call, and it keeps the pipeline fast enough to process the daily volume within the scraping window.

Vector Store: Pinecone vs pgvector

I evaluated both and ended up with a hybrid that surprised me.

Pinecone is easy to set up and fast at query time. I had a working PoC in an afternoon. But at production scale, the cost added up fast. A Pinecone pod index with enough capacity for the vector volume I needed cost more than I wanted to pay per month. And I needed retention beyond a short window for historical matching.

pgvector inside PostgreSQL was more work upfront. I had to write the index setup, the query logic, and the maintenance scripts myself. But the savings were dramatic. My existing PostgreSQL instance on the same EC2 box handled the vector workload with no additional infrastructure cost. Query performance was close enough for approximate nearest neighbor search with the right index configuration (IVFFlat with appropriate lists and probes).

CREATE INDEX idx_job_embeddings ON job_chunks 
USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

SELECT * FROM job_chunks 
ORDER BY embedding <=> $1::vector 
LIMIT 20;

The real win was transactional consistency. With Pinecone, I had to manage sync between PostgreSQL and the vector store. A listing could be in one but not the other during failures. With pgvector, the embedding lives in the same database as the listing data. One transaction, one source of truth. No reconciliation scripts.

I use Pinecone for prototyping new RAG features. I use pgvector for production. That pattern has held across multiple projects.

LLM Scoring at Scale: Function Calls and Cost Control

The embedding search returns candidate chunks. Then I need an LLM to score each job listing's relevance to a specific candidate profile. This is the expensive part.

I use GPT-4o with function calling for scoring. The function schema is strict: it outputs a relevance score (0-100), a brief justification, and a list of matching skills. The schema enforces that the model cannot fabricate matches. Every skill in the output must be present in the listing text.

const scoreFunction: ChatCompletionTool = {
  type: 'function',
  function: {
    name: 'score_job_relevance',
    parameters: {
      type: 'object',
      properties: {
        relevance_score: { type: 'number', minimum: 0, maximum: 100 },
        justification: { type: 'string', maxLength: 200 },
        matching_skills: { type: 'array', items: { type: 'string' } },
      },
      required: ['relevance_score', 'justification', 'matching_skills'],
    },
  },
}

At thousands of listings per day, scoring each one with GPT-4o would cost far more than is sustainable.

Three things brought it down:

OpenAI Batch API. I send scoring jobs in batches of several hundred at a time. The Batch API gives a significant discount and completes within a few hours. Since the scoring pipeline runs overnight during the scraping window, latency isn't a problem. The batch completes before the morning traffic spike.

Caching. If a listing has been scored for a similar candidate profile before, I return the cached result. The cache key is a hash of the listing ID plus the candidate's skill vector. The hit rate is meaningful for repeat candidates.

Model tiering. Not every listing needs GPT-4o. Listings for common roles (software engineer, sales representative) score well enough with GPT-4o-mini. I reserve GPT-4o for niche roles where accuracy matters more (compliance officer, biomedical engineer). The routing logic is simple: if the job title matches a list of common titles, use the cheaper model.

Total daily LLM cost dropped substantially. The Batch API alone accounted for much of the savings.

Observability: What Broke and How I Caught It

The pipeline ran silently for a period before I discovered it was failing on a noticeable percentage of listings. The errors were non-fatal: a malformed ATS response would cause the chunker to produce empty chunks, which the embedding step would skip silently, and the listing would never appear in search results. No crashes, no alerts.

I added Sentry for error tracking and LogRocket for session replay on the frontend. But the real win was adding structured logging with a correlation ID that traces a single listing through the entire pipeline: ingestion, normalization, chunking, embedding, storage, scoring.

logger.info('Listing processed', {
  listingId: id,
  source: 'greenhouse',
  chunks: chunkCount,
  embeddingLatency: ms,
  score: relevanceScore,
})

This let me build a dashboard showing pipeline health per source. Some ATS sources processed successfully at a high rate. Others failed more often because their API returns HTML in description fields that the normalizer didn't handle. I fixed the normalizer, and the failure rate dropped significantly.

Without the correlation ID, I would have found that bug much later, if at all.

What I'd Do Differently

If I started this pipeline today, I'd skip Pinecone entirely and go straight to pgvector. The setup cost is higher, but the operational simplicity of having one database is worth it. I'd also build the observability layer before the pipeline, not after.

The biggest surprise was how much of the work was data normalization, not AI. Most of the bugs came from inconsistent ATS output formats, not from the LLM or the vector store. The AI parts are surprisingly reliable once you handle the data plumbing correctly.

If your team is building a production RAG pipeline and hitting the same kind of silent failure modes or cost surprises, that's the kind of thing I help with. Happy to compare notes on what worked and what didn't.

Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.

How to Build AI Agents That Won't Delete Your Database

Abdul Rehman — Tue, 14 Jul 2026 09:02:08 +0000

I watched a test agent try to delete a PostgreSQL table. It had the credentials. It had the intent. The only thing stopping it was a single line in the system prompt that said "you are a read-only assistant."

That line held. But I've seen enough close calls to know that a prompt is not a safety net.

Every week there's a new horror story. An agent that wiped a production database. An agent that mailed 10,000 customers the wrong message. An agent that ran up a $50,000 API bill in an hour. The common thread is not malicious intent. It's architecture that treats the LLM as a trusted operator instead of a powerful but unreliable intern.

The answer is not "don't use agents." The answer is "use agents with the right guardrails." I've built systems that use LLMs for scoring, generation, and structured extraction at production scale. Here are the patterns that keep agents from doing damage.

Default to Read-Only

Every agent I build starts life as a read-only system. It can observe, analyze, and report. It cannot write, delete, or execute.

This is not a feature you add later. It is the default state. You promote the agent to write access only when you have proven that it needs it and that you can control what it writes.

The implementation is straightforward. Give the agent a database connection with read-only credentials. Wrap all write operations behind a separate service that requires explicit approval. Never let the agent call a mutation API directly.

Here is what that looks like in practice with OpenAI function calling:

const readOnlyFunctions = [
  {
    name: "query_database",
    description: "Execute a SELECT query against the database. Read only.",
    parameters: {
      type: "object",
      properties: {
        sql: { type: "string", description: "The SELECT query to execute" },
      },
      required: ["sql"],
    },
  },
  {
    name: "search_records",
    description: "Search for records matching criteria.",
    parameters: { /* ... */ },
  },
];

// Write operations live in a separate system, not exposed to the agent.
// The agent can REQUEST a write, but a human must approve it.
const writeFunctions = [
  {
    name: "request_write",
    description: "Request a write operation. Requires human approval.",
    parameters: {
      type: "object",
      properties: {
        action: { type: "string", enum: ["update", "delete", "insert"] },
        target: { type: "string" },
        reason: { type: "string" },
      },
      required: ["action", "target", "reason"],
    },
  },
];

The agent can ask for a write. It cannot perform one. This forces a human-in-the-loop gate at the architectural level, not just the prompt level.

Human-in-the-Loop Gates That Actually Work

The classic "human in the loop" is a yes/no confirmation dialog. It works until users get fatigue and start clicking "yes" without reading. I've seen it happen.

The better pattern is to require the human to re-express the intent in their own words. Instead of "approve this action," show a summary and ask the user to type or confirm the specific change. For example, rather than a button that says "Delete user 1234," show the user a preview of what will happen and require them to type "delete user 1234" to proceed.

For high-risk actions, add a cooldown period. The agent cannot execute the same action twice within a window. This prevents a runaway loop where the agent floods the approval queue with identical requests.

Consider a social media engagement agent that autonomously replies to conversations. Suppose you give it the ability to generate replies but not to post them. A human reviews each generated reply, edits if needed, and clicks publish. That extra step catches hallucinations, tone mismatches, and the kind of inappropriate joke an LLM thinks is funny. The agent never touches the posting API at all. That separation is what saves you.

Idempotent Actions and Safe Retries

Agents fail. They time out. They retry. The worst thing you can do is let an agent retry a non-idempotent action.

Every mutation your agent performs should be safe to call twice. Use idempotency keys. Generate a unique key for each action, and if the agent retries with the same key, the system deduplicates it.

Here is a pattern I use for job application agents that need to submit applications:

async function submitApplication(application: Application, idempotencyKey: string) {
  // Check if this key was already processed
  const existing = await db.applications.findUnique({
    where: { idempotencyKey },
  });
  if (existing) {
    return existing; // Already submitted, return the result
  }

  // Submit the application
  const result = await atsApi.submit(application);

  // Store the result with the key
  await db.applications.create({
    data: {
      idempotencyKey,
      status: result.status,
      submittedAt: new Date(),
    },
  });

  return result;
}

The agent always passes an idempotency key. If the API call times out and the agent retries, the second call is a no-op. No duplicate applications, no double charges, no double deletions.

Sandbox the Agent's Environment

The agent should never have direct access to your production database or any production API. This is non-negotiable.

Create a proxy layer that sits between the agent and every external system. This proxy validates every request against a policy. The policy defines what the agent is allowed to do, what data it can see, and what it cannot touch.

For databases, I run agents against a replica that is read-only and lag-tolerant. The agent can see near-real-time data but cannot write to it. If the agent needs to write, it goes through the proxy which logs every mutation and enforces rate limits.

For external APIs, the proxy strips dangerous capabilities. If the agent is calling a CRM API, the proxy removes the DELETE endpoint from the available routes. The agent never sees the full API surface. It sees only what the proxy exposes.

Error Recovery Without Cascading

When an agent's action fails, the default behavior should be to stop, not to retry blindly. A circuit breaker pattern works well here: suppose the agent gets three errors in a row from the same service. The circuit opens. The agent cannot call that service again until a human resets it.

This prevents a single misbehaving agent from hammering a downstream system during an outage. It also prevents the agent from making things worse by trying to "fix" a failed operation with a destructive alternative.

I also log every action the agent takes, including the full reasoning chain. When something goes wrong, I can replay the agent's decision-making process and see exactly where it went off the rails. This is invaluable for debugging and for improving the guardrails.

The Architecture That Scales

The pattern that works across all agent systems I've designed is this: the agent is a proposal engine, not an execution engine. It generates ideas, arguments, and plans. A separate, tightly controlled execution layer decides whether to carry them out.

The agent never touches the production database. It never calls a mutation API directly. It never has admin credentials. It is given the minimum permissions needed to do its job, and those permissions are enforced at the infrastructure level, not the prompt level.

If your team is wrestling with AI agent safety and shipping slower because you're afraid of what the agent might do, that is the kind of thing I help with. I build production AI systems that are powerful enough to be useful and safe enough to ship without fear. Happy to compare notes.

Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.

How to Build an AI Agent Pipeline That Won't Delete Your Database

Abdul Rehman — Mon, 13 Jul 2026 09:02:51 +0000

An AI agent in production can delete your database. Not because the model is malicious. Because it doesn't know what it doesn't know. It sees a modal it can't parse, guesses the wrong action, and that guess gets executed against a live system.

I've seen this pattern in enough projects to know it's not a theoretical risk. The question is not whether your agent will make a mistake. It's whether your architecture can survive that mistake.

Here's what I've learned building production LLM pipelines, browser automation agents, and RAG systems. The safety patterns that actually work.

Sandboxed Execution Is Not Optional

Every AI agent action should run in a context that cannot reach production data directly. This sounds obvious. I've seen teams skip it because "the model is just generating text, it can't do anything dangerous."

A model generating text can call functions. Functions can hit APIs. APIs can delete rows.

On a job platform I built, the LLM scoring pipeline processes listings at scale. The pipeline has a strict architecture: the model never touches the database. It receives structured inputs, returns structured outputs, and a middleware layer validates every field before it touches a row.

// The agent never gets direct DB access
// It talks to a validation middleware
async function processAgentAction(action: AgentAction): Promise<ActionResult> {
  // Step 1: Validate the action against allowed operations
  const validation = validateAction(action, ALLOWED_ACTIONS);
  if (!validation.valid) {
    return { status: 'rejected', reason: validation.error };
  }

  // Step 2: Execute in a sandboxed transaction
  // The agent can only perform READ operations
  // Any WRITE requires explicit human approval
  if (action.type === 'write') {
    return { status: 'pending_approval', action };
  }

  // Step 3: Log everything for audit
  await auditLog.create({
    agent: action.agentId,
    action: action.type,
    timestamp: new Date(),
    approved: action.type === 'read' // reads auto-approved
  });

  return executeRead(action);
}

The rule is simple: the agent proposes, the system disposes. No write path exists without an approval gate.

Human in the Loop Means Real Approval, Not a Formality

Suppose you're building an autonomous job application module. The agent browses listings, fills forms, and submits. Full automation sounds great until the agent applies to the wrong job with the wrong resume.

What works is per action approval. The candidate sees each match, reviews a summary, and explicitly approves before the agent acts. This is not a checkbox they click once and forget. It's a deliberate decision each time.

The pattern matters because approval fatigue is real. If you make someone approve 100 actions, they stop reading. They tap through. That's worse than no approval at all because it creates a false sense of safety.

What works is batching approvals at a granularity that matches human attention span. Show 5-10 matches at a time with clear previews of what the agent will do. The candidate reads the job title, company, and a short summary. They approve. The agent handles the rest.

For higher risk actions like payment or account changes, the approval gate needs to be explicit. A separate confirmation modal. A required text input. Something that forces the user to actually engage.

Idempotency Is Your Safety Net

The most dangerous AI agent bug I've seen is not the one that does the wrong thing. It's the one that does the wrong thing repeatedly.

An idempotent action produces the same result no matter how many times you call it. If your agent sends a "create lead" API call, and the API creates a duplicate lead each time, a retry loop or a confused agent can flood your system with garbage.

On a job platform, every pipeline step is idempotent by design:

// Idempotent job processing
// Running this 10 times produces the same result as running it once
async function processJobListing(jobId: string, source: string): Promise<void> {
  // Use a unique constraint to prevent duplicates
  // The database rejects a second insert with the same externalId
  await db.jobListing.upsert({
    where: { externalId: `${source}_${jobId}` },
    create: {
      externalId: `${source}_${jobId}`,
      title: rawData.title,
      company: rawData.company,
      // ... other fields
    },
    update: {
      // Only update fields that should change
      lastSeenAt: new Date(),
      status: rawData.status,
    }
  });
}

The upsert pattern is your friend. It says "insert if new, update if exists." No duplicates. No matter how many times the pipeline runs.

For browser automation agents, idempotency means checking state before acting. "Is this button already clicked? Is this form already submitted? Is this modal already dismissed?" The agent should read first, act second.

Rollback Strategies That Actually Work

Every AI agent action should be reversible or logged well enough to reverse manually.

For an autonomous apply module, that means a proxy email inbox. Every application goes through a unique email address. If something goes wrong, the candidate can see exactly what was sent, to whom, and when. They can follow up manually. The system logs every field that was submitted.

For database actions, rollback means transactions. If an agent writes to the database, it should do so inside a transaction that can be rolled back if the next validation step fails.

// Wrap agent writes in a transaction
// If any step fails, everything rolls back
async function executeAgentActionWithRollback(action: AgentAction): Promise<void> {
  const transaction = await db.$transaction(async (tx) => {
    // Step 1: Execute the action
    const result = await tx[action.table].create({
      data: action.data
    });

    // Step 2: Validate the result with a separate process
    const validation = await validateResult(result);
    if (!validation.passed) {
      // Throw to trigger rollback
      throw new Error(`Validation failed: ${validation.reason}`);
    }

    return result;
  });

  // Only committed if validation passed
  return transaction;
}

The transaction boundary creates a clean line. If the agent hallucinates, the database stays clean. You retry with a corrected prompt, not with a data cleanup script.

The Architecture That Survives Production

Here's the pattern I use now for every AI agent pipeline I build:

Isolation layer: The agent never touches production data directly. It reads through a restricted API, writes through a validation middleware.
Per action approval: Every write action requires explicit human confirmation. Reads are auto-approved but logged.
Idempotent operations: Every action produces the same result on retry. Upserts, not inserts. State checks before state changes.
Transaction boundaries: All writes happen inside transactions. Validation gates sit between the write and the commit.
Full audit trail: Every action is logged with agent ID, timestamp, input, output, and approval status. You can replay any session.
Kill switch: One API call stops all agent activity. Not a rate limit. A hard stop.

These patterns apply whether you're building an LLM scoring pipeline, a browser automation agent, or a RAG system. The specifics change. The principles don't.

If your team is evaluating AI agent integration and wondering how to ship fast without shipping dangerous, that's the kind of thing I help with. Happy to compare notes on what's worked and what hasn't.

Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.

How to Build AI Agents That Don't Delete Your Production Database

Abdul Rehman — Sun, 12 Jul 2026 09:02:50 +0000

Suppose you give an AI agent a database connection string with write permissions and a tool to execute raw SQL. It will do exactly what you asked. That's the problem.

I've seen teams make this mistake. They treat agent safety as an afterthought, something to add after something breaks. But by then the damage is done. The guardrails I rely on came from studying those failures and building systems that prevent them from happening in the first place.

Read-Only Connections Are Your First Line of Defense

The single most effective safety measure I've implemented costs nothing and takes five minutes. Give your agent a read-only database connection by default. Only promote to write access when a human explicitly approves the operation.

// Database connection factory with role-based permissions
function getDbConnection(operation: 'read' | 'write'): Pool {
  if (operation === 'read') {
    return new Pool({
      connectionString: process.env.DB_READONLY_URL,
      // This user has SELECT only on all tables
    });
  }

  return new Pool({
    connectionString: process.env.DB_WRITE_URL,
    // This user has INSERT, UPDATE, DELETE on specific tables
  });
}

// Agent tool definition
const queryDatabase = {
  name: 'query_database',
  execute: async (sql: string, context: AgentContext) => {
    const connection = getDbConnection(
      context.hasWriteApproval ? 'write' : 'read'
    );
    return connection.query(sql);
  }
};

This pattern alone prevents an entire category of catastrophic failures. The agent can explore data, analyze patterns, and generate insights without ever touching production records. When it genuinely needs to write, the flow goes through a human approval step.

Consider an agent that parses natural language buyer requirements and matches them against property opportunities in a PostgreSQL database. The agent should never have write access. It should only query and return results. That constraint forces you to design a separate ingestion pipeline for new data, which turns out to be the right architecture anyway.

Human-in-the-Loop Isn't Optional, It's Architecture

The teams that skip human approval are the ones that deal with the consequences. I've learned to treat human approval as a design constraint from day one, not a feature to add later.

My pattern looks like this: every write operation generates a preview. The preview shows exactly what will change, in plain language and in structured data. A human reviews it. If they approve, the change executes. If they reject it, the agent gets feedback and tries a different approach.

// Approval queue for write operations
interface WriteOperation {
  id: string;
  agentId: string;
  operation: 'insert' | 'update' | 'delete';
  targetTable: string;
  preview: {
    affectedRows: number;
    beforeState?: Record<string, unknown>;
    afterState: Record<string, unknown>;
  };
  status: 'pending' | 'approved' | 'rejected';
  createdAt: Date;
}

async function requestWriteApproval(
  operation: WriteOperation
): Promise<boolean> {
  // Store in approval queue
  await db.insert(operationQueue).values(operation);

  // Notify human via Slack/email
  await notifyApprovalChannel({
    agentId: operation.agentId,
    preview: operation.preview,
    approvalUrl: `${APP_URL}/approvals/${operation.id}`,
  });

  // Wait for human response (with timeout)
  return pollForApproval(operation.id, TIMEOUT_30_MINUTES);
}

Think about an agent tasked with updating job listing statuses. It finds a bug in a partner feed that marks all listings as "closed." Without human approval, it would archive thousands of active listings in one pass. The human reviewer catches the anomaly in the preview, rejects the operation, and the feed gets fixed instead. That's the value of a review step.

Output Validation Is a Contract, Not a Suggestion

LLMs are probabilistic. They will occasionally output malformed JSON, invent fields, or return data that doesn't match your schema. If your agent passes that output directly to your database or API, you're asking for trouble.

I validate every agent output against a strict schema before it touches any production system. If the output doesn't match, the agent retries or the operation fails safely.

import { z } from 'zod';

// Schema for agent output that will be written to the database
const JobUpdateSchema = z.object({
  listingId: z.string().uuid(),
  status: z.enum(['active', 'paused', 'closed', 'draft']),
  salaryRange: z.object({
    min: z.number().positive(),
    max: z.number().positive(),
  }).refine(data => data.max >= data.min, {
    message: 'Max salary must be >= min salary',
  }),
  // Anti-hallucination guard: must match existing listing IDs
  hasValidListingId: z.literal(true),
});

function validateAgentOutput<T>(
  output: unknown,
  schema: z.ZodSchema<T>
): T {
  const result = schema.safeParse(output);

  if (!result.success) {
    // Log the failure for debugging
    logger.error('Agent output validation failed', {
      errors: result.error.flatten(),
      rawOutput: output,
    });

    // Don't proceed. Fail safely.
    throw new AgentValidationError(
      'Agent output did not pass validation',
      result.error
    );
  }

  return result.data;
}

The hasValidListingId field is a pattern I use frequently. It forces the LLM to explicitly confirm it found a valid reference before proceeding. If the model hallucinates a listing ID that doesn't exist, this check catches it.

Sandboxed Execution Environments Contain the Damage

Even with read-only connections and output validation, your agent might still need to run code. Maybe it's generating SQL, executing API calls, or running Python scripts for data analysis. Never run that code in your main application process.

I run all agent-generated code in isolated sandboxes. For simple operations, I use Docker containers with no network access and a read-only filesystem. For more complex scenarios, I spin up ephemeral environments that are destroyed after each execution.

// Execute agent-generated code in a sandboxed Docker container
async function executeInSandbox(code: string): Promise<SandboxResult> {
  const container = await docker.createContainer({
    Image: 'node:20-slim',
    Cmd: ['node', '-e', code],
    NetworkDisabled: true,        // No network access
    ReadonlyRootfs: true,         // Read-only filesystem
    HostConfig: {
      Memory: 256 * 1024 * 1024, // 256MB memory limit
      CpuPeriod: 100000,
      CpuQuota: 50000,            // 0.5 CPU limit
    },
  });

  const output = await container.wait();
  const logs = await container.logs();
  await container.remove();

  return {
    exitCode: output.StatusCode,
    stdout: logs.stdout,
    stderr: logs.stderr,
  };
}

Suppose an agent is extracting text from PDFs and structuring it. One run produces an infinite loop. In the sandbox, it times out and returns an error. The main application keeps running. The sandbox absorbed the failure, and the only cost was a few seconds of compute.

Log Everything, Especially Failures

You can't fix what you can't see. Every agent interaction should produce structured logs that capture the input, the output, the decisions made, and any errors encountered. This isn't just for debugging. It's for understanding how your agent behaves in production so you can improve its safety over time.

// Structured agent logging
interface AgentLog {
  agentId: string;
  sessionId: string;
  timestamp: Date;
  input: {
    query: string;
    context: Record<string, unknown>;
    tools: string[];
  };
  output: {
    response: string;
    toolsCalled: Array<{
      toolName: string;
      args: Record<string, unknown>;
      result: unknown;
    }>;
  };
  safety: {
    validationPassed: boolean;
    sandboxUsed: boolean;
    humanApprovalRequired: boolean;
    approvalStatus?: 'approved' | 'rejected';
  };
  errors?: Array<{
    type: string;
    message: string;
    stack?: string;
  }>;
}

I send these logs to Sentry for error tracking and to a dedicated analytics database for pattern analysis. When something goes wrong, I can replay the exact sequence of events that led to the failure. When something goes right, I can extract the patterns that made it work.

Reliability Is a Design Choice

The teams that ship reliable AI agents don't get lucky. They make deliberate architectural decisions that constrain what the agent can do, validate what it produces, and contain the damage when something goes wrong.

If your team is wrestling with agent reliability and shipping slower because of it, that's the kind of thing I help with. Happy to compare notes.

Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.

Production AI Scoring: Processing 10,000+ Job Listings Daily with GPT-4

Abdul Rehman — Sat, 11 Jul 2026 09:01:59 +0000

I spent months building an AI scoring pipeline that processes over 10,000 job listings every day. The first version was a mess. Slow, expensive, and unreliable. The second version worked. Here's what I learned about the architecture decisions that actually matter when you put LLMs in a production data pipeline.

Most tutorials show you how to call an API and get a response. They don't show you what happens when you need to do that 10,000 times a day without burning your budget or losing accuracy. I had to figure that out the hard way.

The Problem With Naive AI Scoring

My first approach was embarrassingly simple. Take a job listing, dump the full text into a GPT prompt, and ask for a relevance score. It worked on three test listings. It fell apart at 500.

Three problems emerged immediately.

First, latency. Each call took 3 to 8 seconds. At 10,000 listings, that's 8 to 22 hours of sequential processing. Even with parallel batching, the wall clock time was unacceptable for a system that needed to surface fresh listings within minutes.

Second, cost. GPT-4 token counts were all over the place. Some job descriptions were 200 words. Others were 2,000. I was paying for the long ones the same way as the short ones, with no control over how much context the model consumed.

Third, inconsistency. The same listing scored differently depending on how I phrased the prompt. Minor wording changes produced wildly different relevance scores. That's fine for a prototype. It's a disaster for a production system where recruiters depend on consistent ranking.

I needed structured output, predictable token usage, and a pipeline that could scale horizontally without breaking the bank.

Why Function Calling Changed the Game

GPT-4 function calling gave me a way to enforce structure on the output. Instead of asking for a score and hoping the model returned a number, I defined a schema that the model had to follow.

const scoringSchema = {
  name: "score_job_listing",
  description: "Score a job listing for relevance to a candidate profile",
  parameters: {
    type: "object",
    properties: {
      relevance_score: {
        type: "number",
        description: "Relevance score from 0 to 100",
        minimum: 0,
        maximum: 100
      },
      match_reasons: {
        type: "array",
        items: { type: "string" },
        description: "Top 3 reasons for this score"
      },
      skill_match_count: {
        type: "integer",
        description: "Number of candidate skills found in the listing"
      },
      seniority_level: {
        type: "string",
        enum: ["junior", "mid", "senior", "lead", "executive"]
      }
    },
    required: ["relevance_score", "match_reasons", "skill_match_count", "seniority_level"]
  }
};

This forced the model to return a predictable JSON object every time. No more parsing free text. No more guessing whether a score was 75 or 0.75. The schema acted as a contract between the pipeline and the LLM.

The real win was downstream. With structured output, I could store scores directly in PostgreSQL, build SQL queries that filtered by score thresholds, and feed the results into a REST API without any transformation layer. The pipeline became predictable.

The Architecture That Survived Production

The final architecture had four stages, each with a specific job.

Stage one was ingestion. Listings arrived from multiple ATS sources via their public APIs. I normalized them into a common schema: title, description, company, location, skills. No LLM calls at this stage. Just data cleaning and deduplication.

Stage two was pre-filtering. Before any AI call, I ran a fast keyword and rule-based filter. Listings that clearly didn't match the candidate's criteria were dropped without ever touching the LLM. This eliminated about 40% of listings and saved a significant chunk of API costs.

Stage three was the scoring pipeline. This ran in batches of 50 listings per worker, with each batch sent to GPT-4 via function calling. I used a simple queue system with retry logic. Failed calls were retried up to three times with exponential backoff. Listings that failed all retries were logged and sent to a dead-letter queue for manual review.

Stage four was delivery. Scored listings were written to the database with their relevance score and match reasons. The REST API served them to the frontend and to external integrations. The entire pipeline from ingestion to API availability was under 5 minutes for most listings.

Cost Control: The Real Engineering Challenge

The biggest mistake I see teams make is treating LLM calls like regular API calls. They're not. A single GPT-4 call can cost ten times more than another depending on input length and output structure.

I implemented three cost controls that made the difference between viable and unviable.

First, I used a shorter model for pre-filtering. Before sending a listing to GPT-4 for scoring, I ran a lightweight GPT-4o-mini call that classified listings into three buckets: likely match, possible match, and clear miss. Only the first two buckets went to GPT-4. This cut GPT-4 usage by about 60%.

Second, I optimized prompt length aggressively. Every word in the system prompt cost money on every call. I trimmed candidate profiles to the essential signals: skills, experience level, preferred locations. No biography, no career narrative, no fluff.

Third, I cached aggressively. Listings that were scored once and hadn't changed were never rescored. The same candidate profile scored the same listing only once. This sounds obvious, but I've seen production systems that rescored everything on every pipeline run because nobody added a cache check.

Total cost per listing landed well under one cent. At 10,000 listings daily, that's a manageable operational expense. Without these controls, it would have been five to ten times higher.

What I'd Do Differently

If I were building this today, I'd make three changes.

I'd use a batch processing strategy from day one instead of the streaming approach I started with. The OpenAI Batch API offers 50% cost reduction for asynchronous workloads. My pipeline was mostly batch-friendly, but I designed it for real-time processing first and had to retrofit.

I'd invest more in the pre-filtering stage. The keyword and rule-based filter was effective, but a lightweight embedding similarity search would have caught more false positives before they reached the LLM. I added this later, but it should have been in the initial design.

I'd build better observability into the scoring pipeline. I had basic logging, but I didn't track score drift over time. I only noticed that scores were trending downward after a prompt change when a client complained. A simple dashboard tracking average scores per day would have caught this immediately.

Production AI pipelines have more in common with ETL systems than with chat applications. The same engineering practices that make data pipelines reliable apply here: idempotency, retry logic, dead-letter queues, monitoring, and cost tracking. The LLM is just one component in a larger system.

If your team is building this kind of pipeline and wondering how to keep accuracy high while managing costs at scale, that's the kind of thing I help with. Happy to compare notes on what's worked and what hasn't.

Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.

Building a Production-Grade AI Pipeline: Scoring 10,000+ Listings Daily with LLMs

Abdul Rehman — Fri, 10 Jul 2026 09:01:35 +0000

I learned the hard way that a working LLM pipeline and a production LLM pipeline are two different things.

When I first built the scoring system for a job board platform, I thought: throw GPT-4 at each listing, ask it to rate relevance, done. It worked for 100 listings. It worked for 1,000. Then I scaled to 10,000 listings a day and my OpenAI bill hit a number that made the client's eyes go wide.

That's when I stopped treating the LLM like a magic black box and started treating it like a production service with cost, latency, and failure modes.

Here's what actually matters when you're building an LLM pipeline at volume.

Function Calling Over Fine-Tuning

The first decision was how to get structured output from the LLM. Every job listing needed a relevance score, a category, and a confidence level. I needed JSON, not prose.

Some teams reach for fine-tuning when they want consistent formatting. I don't. Fine-tuning locks you into a specific model version, requires ongoing retraining as your data shifts, and gives you less control over output structure than function calling does.

OpenAI's function calling (now tool calling) lets you define a JSON schema and the model fills it in. I defined a schema with fields for relevance_score (0-100), category (enum of job types), and confidence (high/medium/low). The model returns valid JSON every time because it's trained to do exactly that.

const scoringFunction = {
  name: "score_listing",
  description: "Score a job listing for relevance to the candidate profile",
  parameters: {
    type: "object",
    properties: {
      relevance_score: {
        type: "number",
        description: "Relevance score from 0 to 100",
        minimum: 0,
        maximum: 100
      },
      category: {
        type: "string",
        enum: ["engineering", "design", "marketing", "sales", "operations", "other"]
      },
      confidence: {
        type: "string",
        enum: ["high", "medium", "low"]
      }
    },
    required: ["relevance_score", "category", "confidence"]
  }
};

That schema gave me parseable output without regex hacks or post-processing. Every response fits the shape I defined, and if the model deviates, I catch it at the API level.

Batch Processing Saves More Than Money

The obvious win with OpenAI's Batch API is cost: 50% cheaper than real-time endpoints. But the real win is operational.

When you process 10,000 listings a day, you don't need results in 2 seconds. You need results in 2 hours. Batch mode lets you submit all 10,000 jobs at once, walk away, and come back to a finished file. No rate limits, no connection retries, no backoff logic.

I built a pipeline that collects listings throughout the day, submits a batch at midnight, and processes results by morning. The per-listing cost dropped from about $0.003 with GPT-4o mini real-time to $0.0015 with batch. At 10,000 listings a day, that's $15 saved daily. Over a month, nearly $500.

The batch API also handles retries internally. If a job fails, OpenAI retries it automatically. I just poll for completion and check the output file.

async function submitBatch(listings: Listing[]): Promise<string> {
  const batchItems = listings.map((listing, index) => ({
    custom_id: `listing-${index}`,
    method: "POST",
    url: "/v1/chat/completions",
    body: {
      model: "gpt-4o-mini",
      messages: [
        { role: "system", content: "Score the following job listing..." },
        { role: "user", content: JSON.stringify(listing) }
      ],
      tools: [scoringFunction],
      tool_choice: { type: "function", function: { name: "score_listing" } }
    }
  }));

  const response = await openai.batches.create({
    input_file_id: await uploadBatchFile(batchItems),
    endpoint: "/v1/chat/completions",
    completion_window: "24h"
  });

  return response.id;
}

The batch file upload step is the only extra complexity. After that, it's fire and forget.

Error Handling That Doesn't Burn Your Budget

LLMs fail in unpredictable ways. Timeouts, malformed JSON, content filters, model overloads. If you handle each failure with a retry to the same endpoint, you burn money and time.

I built a three-tier error handling system:

Transient failures (rate limits, 500s): retry with exponential backoff, max 3 attempts.
Content filter hits: skip the listing, log the reason, move on. Don't retry.
Malformed output: fall back to a simpler model (GPT-4o mini) with a stricter prompt. If that also fails, flag the listing for manual review.

The key insight: not every listing needs a perfect score. Missing 1% of listings is acceptable if it keeps the pipeline running. Trying to force 100% coverage will cost you in latency and dollars.

I also added a circuit breaker. If error rates exceed 10% in a 5-minute window, the pipeline pauses and alerts me. That's saved me from runaway costs more than once.

Observability: You Can't Optimize What You Can't See

When I started, I logged nothing. Then a batch of 5,000 listings returned zero results and I had no idea why. Turns out a schema change broke the function definition and every response was empty.

Now every step produces structured logs: submission time, batch ID, completion time, token usage, cost, error type. I ship these to a simple logging service and query them to find bottlenecks.

The metric that matters most is cost per useful listing. Not just cost per API call. Some listings get scored but the score is low confidence and needs human review. Those cost money without producing value. I track the ratio of high-confidence scores to total cost. That tells me if my prompt or model choice is efficient.

I also monitor latency distribution. Batch mode has a wide tail. Most jobs complete in 30 minutes, but some take 6 hours. If a batch hasn't completed after 12 hours, I investigate. Usually it's a large file that got queued behind higher-priority customers.

When Cheap Models Are Good Enough

The original pipeline used GPT-4 for scoring. It worked, but it was expensive. I switched to GPT-4o mini and saw no meaningful drop in scoring accuracy for the use case. The categories are broad, the relevance scores are relative. A 2% error rate doesn't matter when you're ranking 10,000 listings.

I'm now evaluating DeepSeek V4 Flash as an even cheaper alternative. Early tests show similar quality at roughly 23x lower cost. If it holds up in production, that's $15/day becoming $0.65/day.

The lesson: test the cheapest model first. Upgrade only when you have evidence that quality suffers. Most teams do the reverse.

If your team is wrestling with LLM integration at scale and shipping slower because of it, that's the kind of thing I help with. I've built pipelines that process tens of thousands of items daily without breaking the bank or the API. Happy to compare notes.

Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.

Your AI Agent Keeps Failing After 50 Clean Demos, Stop Treating the LLM Like a Black Box

Abdul Rehman — Thu, 09 Jul 2026 09:01:48 +0000

I watched a resume tailoring agent run 50 perfect demos. Every test resume produced clean, accurate output. On the first real user submission, the model fabricated a degree, invented a job title, and added a company that didn't exist. The demo never caught it because the demo data was curated. Production data is never curated.

The problem wasn't the model. It was the contract.

Most teams treat the LLM as a black box. Feed it text, get text back, move on. When it fails in production, they blame the model. But the model is doing exactly what you asked. The fault is in the system prompt you never read, the schema you never validated, and the fallback you never built.

Here's what I've learned building production AI pipelines that don't fall apart when real users show up.

The demo trap

Every LLM demo looks good because it runs on clean inputs. The resume tailor I built exposed this immediately. In testing, the model correctly extracted work history, education, and skills from every sample. Then a real upload arrived with a nonstandard format, and the model filled in the gaps by inventing data.

Why? Because the schema said "education" was required. The LLM chose to satisfy the schema rather than return null. It wasn't malicious. It was following instructions.

This is the first thing you need to understand about production LLMs: they will satisfy the schema you give them, even if it means lying. The fix isn't to prompt harder. It's to design a schema that lets the model say "I don't know" without breaking the pipeline.

When I built a job scoring pipeline processing 10,000 listings daily, every extraction function included a confidence field and a boolean flag for insufficient data. The model was allowed to skip a record. That single change eliminated fabricated data from the pipeline.

Read the system prompt (the one you didn't write)

Every model provider ships a default system prompt. If you haven't read yours, you're guessing.

The first thing I do with any new LLM integration is audit the default system prompt. I found one that included "be helpful and thorough." That sounds harmless until you realize "thorough" means the model will guess when it doesn't have enough information. In a production pipeline, guessing is worse than failing.

The fix was explicit. I added a confidence threshold to every function call. If the model couldn't hit 0.8 confidence, the record was skipped instead of getting fabricated data.

const extractionConfig = {
  functions: [{
    name: "extract_job_details",
    parameters: {
      type: "object",
      properties: {
        title: { type: "string" },
        company: { type: "string" },
        confidence: { type: "number", minimum: 0, maximum: 1 },
        has_insufficient_data: { type: "boolean" }
      },
      required: ["confidence", "has_insufficient_data"]
    }
  }]
}

The model must be allowed to decline. If there's no path to say "I can't do this," it will fabricate. Every production system I build now includes this pattern.

Validate the output, not the intent

Most teams validate the input to an LLM call. Almost none validate the output. That's backward.

The LLM's output is the only thing that touches your downstream system. If it's malformed, the entire pipeline breaks. I built a resume tailoring system where the output was validated against a strict JSON schema before it ever reached the document generator.

The key pattern is conditional presence flags. Every optional field has a corresponding boolean guard. The model must set the guard to true before it can fill in the field. If the guard is false, the field is omitted entirely. No fabricated data can leak through.

const resumeOutputSchema = {
  type: "object",
  properties: {
    has_work_experience: { type: "boolean" },
    work_experience: {
      type: "array",
      items: { /* company, role, duration */ }
    },
    has_education: { type: "boolean" },
    education: {
      type: "array",
      items: { /* institution, degree, year */ }
    }
  },
  required: ["has_work_experience", "has_education"],
  allOf: [
    {
      if: { properties: { has_work_experience: { const: true } } },
      then: { required: ["work_experience"] }
    },
    {
      if: { properties: { has_education: { const: true } } },
      then: { required: ["education"] }
    }
  ]
}

This eliminated hallucinations in the resume pipeline. The model could set a guard to false, and the system would simply skip that section. No invented data, no downstream corruption.

Build for failure, not success

A production AI agent needs a failure chain, not a success path.

The most common mistake I see is a single retry loop. If the LLM fails, try again. Maybe three times. That's not a fallback chain. That's hope.

Real fallback chains look like this:

Try the primary model with the full prompt and strict schema
If confidence is low, try a simpler prompt with fewer constraints
If the simpler prompt fails, fall back to a deterministic rule-based extractor
If nothing works, surface the error to a human with enough context to fix it

I built this into a document analysis pipeline for legal documents. The primary model (GPT-4o) attempted full extraction. If confidence dropped below threshold, the system fell back to a simpler model with reduced scope. If that also failed, it flagged the document for manual review.

The breakdown: 87% of documents handled by the primary model. 11% by the fallback. 2% needed human review. That 2% was the difference between a reliable system and a constant fire drill.

The loop that doesn't break

Agent loops are where most production systems die. The agent gets stuck in a cycle, burning tokens and producing nothing useful.

The fix is a dead man's switch. Every loop iteration must increment a counter. If the counter exceeds a threshold, the agent must stop and return whatever it has, even if incomplete. This prevents the silent budget drain that kills so many production AI systems.

I apply this to every agent loop I build now. The agent can iterate up to N times. After that, it must produce a partial result and explain what it couldn't complete. Bounded loops, always.

async function runAgent(task, maxIterations = 5) {
  let result = null;
  for (let i = 0; i < maxIterations; i++) {
    result = await agentStep(task, result);
    if (result.status === "complete") break;
    if (result.status === "stuck") {
      result.partial = true;
      break;
    }
  }
  return result;
}

The loop is bounded. The agent must produce something. It cannot spin forever.

If your team is wrestling with AI agents that work in demos but fail when real users show up, that's the kind of thing I help with. I build production AI pipelines that don't fall apart, and I'm happy to compare notes on what's breaking in your system.

Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.

How to Build a Reliable LLM Pipeline for Your AI MVP Without Over-Engineering

Abdul Rehman — Wed, 08 Jul 2026 09:03:01 +0000

I once built an AI pipeline that was shut down after a single month. The LLM costs were unsustainable, and worse, the outputs were unreliable enough that we couldn't trust them in production. That failure taught me something I still use today: evaluation isn't a phase you add later. It's the scaffolding that makes the whole thing work.

If you're building an AI-powered MVP, you're probably torn between shipping fast and shipping something that doesn't embarrass you. You don't have time for a full observability stack or a human-in-the-loop review team. You need lightweight, automated checks that catch the worst failures without slowing you down. Here's what I've learned from shipping production LLM pipelines at scale.

Start With Structured Outputs, Not Free Text

The single highest-use thing you can do for reliability is to force the LLM into a structured schema. Free-text responses are a debugging nightmare. A JSON schema with typed fields and validation is something you can unit test.

On a production job board platform I built, every job listing goes through an LLM scoring pipeline that extracts structured data from raw ATS descriptions. The prompt uses function calling with a strict JSON schema. Here's a simplified version of the pattern:

import { z } from 'zod';

const JobListingSchema = z.object({
  title: z.string().min(1),
  company: z.string().min(1),
  location: z.string().optional(),
  salaryRange: z.object({
    min: z.number().positive(),
    max: z.number().positive(),
    currency: z.string().length(3),
  }).optional(),
  requirements: z.array(z.string()).min(1),
  hasRequirements: z.boolean(),
  // A guard field: if hasRequirements is false, requirements must be empty
});

function validateLLMOutput(raw: unknown) {
  const result = JobListingSchema.safeParse(raw);
  if (!result.success) {
    // Log the failure, retry with a simpler prompt, or fall back
    throw new Error(`LLM output failed validation: ${result.error}`);
  }
  return result.data;
}

The key detail is the hasRequirements guard. If the LLM fabricates requirements for a listing that has none, the guard catches it because the schema enforces consistency. This pattern came from an AI resume tailor I built where fabricated experience data was the biggest risk. Conditional presence flags let the LLM honestly say "this field doesn't exist" instead of hallucinating a default.

Add this on day one. It takes ten minutes to write a Zod schema and costs nothing at runtime. It will save you hours of debugging later.

Unit Test Your Prompts Like Production Code

Most teams treat prompts as art, not engineering. They tweak them in a playground, copy-paste the final version, and hope for the best. That's how regressions sneak in.

I write unit tests for every prompt that goes into production. The test calls the LLM with a known input and asserts that the output matches the schema and contains expected values. It's not testing the LLM's reasoning, it's testing that my prompt, model, and schema still work together after a change.

// jest or vitest
import { describe, it, expect } from 'vitest';

describe('job listing extraction prompt', () => {
  it('extracts title and company from a raw description', async () => {
    const input = 'We are hiring a Senior Software Engineer at Acme Corp.';
    const output = await runExtractionPrompt(input);
    const parsed = JobListingSchema.parse(output);
    expect(parsed.title).toContain('Software Engineer');
    expect(parsed.company).toBe('Acme Corp');
    expect(parsed.hasRequirements).toBe(false);
  });
});

These tests are cheap. They cost a few cents per run. Run them in CI on every pull request. If a prompt change breaks the schema, you catch it before it reaches production. I've caught dozens of regressions this way, from model upgrades that changed output formatting to accidental whitespace in a system prompt that broke JSON parsing.

Build a Two-Tier Evaluation for Cost and Speed

Full LLM evaluation is expensive. Running every output through GPT-4 for quality scoring will blow your MVP budget. But you can't ship blind either.

The trick is a two-tier system. Use a cheap, fast model to run lightweight checks on every output. Only escalate to an expensive model when the cheap one flags a problem.

On the job board platform, every scored listing goes through a lightweight consistency check using a smaller model. It verifies that the extracted fields are internally consistent (e.g., salary min <= max, location string matches expected format). If the check passes, the listing is published. If it fails, the listing is queued for a more expensive review pass.

async function evaluateListing(listing: JobListing): Promise<'pass' | 'fail'> {
  const cheapCheck = await runConsistencyCheck(listing, 'gpt-4o-mini');
  if (cheapCheck.passed) return 'pass';

  // Escalate to a more thorough review
  const expensiveReview = await runDeepReview(listing, 'gpt-4o');
  return expensiveReview.verdict;
}

This pattern cut our evaluation costs by roughly 80% compared to running every listing through the expensive model. The false positive rate was low enough that we never missed a real hallucination. And the latency stayed under a second for most listings because the cheap model handled the volume.

Catch Hallucinations With Simple Heuristics Before You Reach for AI

You don't need an AI guardrail service to detect hallucinations. Start with simple rules that catch the most common failure modes.

Presence guards: If the source data doesn't contain a phone number, the LLM should not output one. Enforce this with a schema-level boolean flag.
Length bounds: If the source description is 200 words, the LLM shouldn't generate a 2000-word summary. Cap output length in the prompt and validate after.
Known entity lists: If you're extracting company names, check against a known list. If the LLM outputs a company that doesn't exist, reject it.

These heuristics won't catch every hallucination. But they catch the dangerous ones, the fabricated phone number, the fake requirement, the inflated salary. And they cost nothing to run.

On the job board platform, we added a simple check that compares the extracted location against a geocoding API. If the location string doesn't resolve to a real place, the listing is flagged. It caught a surprising number of hallucinated cities.

Ship With Confidence, Then Iterate

The goal isn't perfect evaluation on day one. The goal is a baseline that prevents catastrophic failures while you learn what real users do with your AI feature.

Start with structured outputs and unit tests. Add a cheap consistency check. Use simple heuristics for the most common hallucinations. That's enough for an MVP. Once you have real traffic and real feedback, you can layer on more sophisticated evaluation, human review loops, A/B testing of prompts, cost optimization.

I learned this the hard way when my first AI pipeline died from cost and trust issues. The second one survived because I built the evaluation scaffolding first. The third one scaled.

If your team is wrestling with LLM reliability and shipping slower because of it, that's the kind of thing I help with. Happy to compare notes on what's worked in production.

Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.

Your AI Agent Is Only as Reliable as Your Observability Layer

Abdul Rehman — Tue, 07 Jul 2026 09:03:09 +0000

I learned this the hard way. A pipeline processing 10,000 job listings daily, scoring each one with GPT-4 function calling, and it was a black box. When something broke, I had no idea what broke, why it broke, or how much it cost while it was breaking.

That's not a production system. That's a prototype running on production infrastructure.

The difference between a demo and a reliable AI agent isn't better prompts or a fancier model. It's observability. Logging, cost tracking, error monitoring, and latency metrics turned that fragile pipeline into something I could trust. Here's exactly what I did and why.

The Problem: An LLM Pipeline With No Dashboard

The system scored 10,000+ job listings daily using GPT-4 function calling. Each listing went through a multi-step pipeline: fetch from source, normalize the data, call the LLM for scoring, parse the structured output, and store the result.

For weeks it ran fine. Then one day it didn't.

A model update changed the output format silently. The function call returned valid JSON but with a different key structure. The pipeline kept running, kept returning 200s, and kept storing null scores. I didn't notice for three days. Three days of bad data, wasted API costs, and a backlog of unscored listings.

That's when I stopped treating observability as optional.

What to Log: The Minimum Viable Trace

You don't need Datadog or a full OpenTelemetry setup to start. You need three things: a trace ID per request, structured logs for every LLM call, and a way to query them.

Here's the pattern I settled on after trying a few approaches:

interface LLMTrace {
  traceId: string;
  model: string;
  promptTokens: number;
  completionTokens: number;
  latencyMs: number;
  success: boolean;
  error?: string;
  inputPreview: string; // first 200 chars
  outputPreview: string; // first 200 chars
  timestamp: Date;
}

Every LLM call gets one of these. The trace ID ties it back to the original job listing. The token counts let me track cost per job. The success flag catches silent failures where the API returned 200 but the output was garbage.

I log this to a dedicated llm_traces collection. It's separate from application logs because LLM calls have their own failure modes and cost implications. Mixing them with general app logs makes both harder to query.

What I Learned From the First Week of Logging

The first week of structured logging told me things I didn't want to hear.

First, latency was all over the map. Some calls completed in 400ms. Others took 12 seconds for the same model and prompt structure. No pattern I could see without the trace data. Once I had it, I found the culprit: prompt length variance. Listings with long descriptions pushed token counts past a threshold where the model's response time doubled. I added a prompt truncation step and latency flattened.

Second, cost. I had no idea what a single scoring call cost until I logged token counts per trace. The average was $0.003 per listing. At 10,000 daily listings, that's $30 a day. Not a problem. But the tail was ugly. Some listings with huge descriptions cost 10x the average. I added a max-token cap and saved about 18% on monthly API spend.

Third, silent failures. The most dangerous kind. The LLM returned valid JSON with a score of 0 for every listing in a batch. No error. No exception. Just useless data. The success flag in my trace caught it because the output didn't match the expected schema. Without that check, I'd have shipped bad scores for days.

How to Structure Agent Traces

A single LLM call is easy to trace. An agent that makes multiple calls, retries, and conditional branches is harder. You need a span model.

Each trace becomes a parent with child spans. The parent is the overall job. Each child span is one LLM call, one tool invocation, or one decision point. They share the trace ID and carry their own timing and status.

interface AgentSpan {
  traceId: string;
  spanId: string;
  parentSpanId?: string;
  spanType: 'llm_call' | 'tool_use' | 'decision' | 'retry';
  model?: string;
  inputTokens: number;
  outputTokens: number;
  durationMs: number;
  status: 'success' | 'error' | 'timeout';
  errorMessage?: string;
  timestamp: Date;
}

This structure lets me answer questions like: which step in the pipeline is slowest? Which model calls fail most often? Are retries actually succeeding or just burning money?

I found that retries on timeout errors had a 40% success rate. Retries on rate limit errors had a 90% success rate. That changed how I handled each case. Timeout retries got a longer cooldown. Rate limit retries just needed a short backoff.

Catching Failures Before They Compound

The most expensive bug I caught with observability wasn't a crash. It was a silent schema drift.

The LLM had been returning scores as integers between 1 and 100. Then an API update changed the output format. Scores came back as strings like "85.0" instead of 85. The pipeline accepted them, stored them, and the frontend displayed them. Everything looked fine until someone noticed the sorting was wrong. String comparison sorts "9" after "85". Listings with single-digit scores were ranking above high-scoring ones.

A simple type check in the trace handler caught it:

function validateScore(output: unknown): number | null {
  if (typeof output === 'number' && output >= 1 && output <= 100) {
    return output;
  }
  if (typeof output === 'string') {
    const parsed = parseFloat(output);
    if (!isNaN(parsed) && parsed >= 1 && parsed <= 100) {
      return parsed;
    }
  }
  return null; // triggers alert
}

When the score is null, the trace logs a warning and the pipeline pauses that batch. No bad data propagates. No silent corruption.

The Metrics That Matter for AI Agents

Not all metrics are equal. Here's what I track and why.

Token cost per job. This is your unit economics. If you don't know what one AI operation costs, you can't price your product or catch regressions. A prompt change that adds 200 tokens might not feel like much until you multiply it by 10,000 daily jobs. That's 2 million extra tokens a day. At GPT-4o pricing, that's about $6 a day. Over a month, $180. Over a year, over $2,000. For one prompt change.

Error rate by error type. Timeout errors and content filter errors need different responses. Timeouts need retry with backoff. Content filter errors need prompt adjustment. If you lump them together, you can't tune either.

Latency p50, p95, p99. The average hides the problem. A p50 of 800ms looks fine until you see the p99 is 14 seconds. That tail is what kills user experience. I set an alert when p95 exceeds 3 seconds. It's fired twice. Both times the cause was a prompt that had drifted to include too much context.

Cost per model. If you're routing between models (GPT-4o for complex scoring, GPT-4o-mini for simple classification), you need to know the actual cost split. I found that 30% of my calls were hitting the expensive model when the cheap one would have worked. A routing rule fixed it.

The Trust Loop

Observability does something more important than catching bugs. It builds trust.

When a founder asks "is the AI working right now?" you need to answer with data, not vibes. A dashboard showing the last 100 traces, their status, latency, and cost is worth more than any testing suite. It proves the system is doing what you think it's doing.

I built a simple status page that shows the last hour of pipeline activity. Green traces for success, red for failures, yellow for retries. The founder checks it once a day. When it's all green, they don't worry about the AI. That trust lets me iterate faster. I can deploy a prompt change, watch the traces for 10 minutes, and know immediately if it broke something.

Without observability, every deployment is a leap of faith. With it, every deployment is a measured risk with a rollback trigger.

The Real Cost of Skipping Observability

I see teams skip observability because it feels like overhead. Another system to maintain. Another dashboard to check. Another thing that can break.

The real overhead is debugging a black box. Three days of bad data. A $500 API bill you can't explain. A founder who's lost confidence in the AI feature.

Observability is the cheapest insurance you can buy for an AI system. A few extra fields in your database. A simple status endpoint. A weekly cost report. That's it. You don't need a PhD in distributed tracing. You need a trace ID, a success flag, and a timestamp.

If your team is shipping AI features and finding yourself unable to answer basic questions about cost, latency, or failure rates, that's the kind of thing I help with. Happy to compare notes on what's worth tracking and what's noise.

Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.

Your AI Agent's Logs Are More Important Than Its Prompt

Abdul Rehman — Mon, 06 Jul 2026 09:03:44 +0000

I spent weeks perfecting a system prompt for a job description rewrite pipeline. The prompt had persona blocks, banned-word lists, structured output schemas, the works. It was beautiful.

Then I deployed it to production and watched it rewrite "Senior Software Engineer" into "a visionary digital architect who orchestrates cross-functional synergies." The prompt was fine. The problem was I had no idea what the model was actually doing until users told me.

That's when I learned the hard truth about production AI: your prompt is a guess. Your logs are proof.

The Hallucination That Cost Me a Pipeline

The job board platform I built processes over 10,000 listings daily. Each one goes through an LLM scoring pipeline that ranks relevance, extracts skills, and normalizes job titles. Early on, I trusted the prompt to handle everything.

One day the pipeline started tagging "Entry Level" positions as "Executive" roles. The system prompt hadn't changed. The model version hadn't changed. But the output quality had silently degraded.

I spent three days tweaking prompts before I thought to check the logs. The issue was obvious once I looked: the token usage per request had dropped by 40%. The model was truncating its reasoning because a cost-saving change I'd made to the temperature setting was causing shorter completions. The prompt was fine. The configuration was broken.

Here's what I should have had from day one:

interface LLMLogEntry {
  requestId: string;
  model: string;
  systemPrompt: string;
  userPrompt: string;
  completion: string;
  tokensUsed: number;
  latencyMs: number;
  temperature: number;
  timestamp: Date;
  score?: number; // quality score from downstream validation
}

Every single LLM call gets logged with full context. Not just the input and output, but the configuration, the timing, and the downstream quality signal. Without this, you're debugging blind.

The Three Signals You Can't Ignore

After that incident, I built a structured logging layer around every AI call in the system. Three metrics matter more than anything else.

Latency drift. When your average response time jumps from 800ms to 2.4 seconds, something changed. It could be API congestion, a model update, or a prompt that's growing too long. I catch this with Sentry performance monitoring on the API layer. A sudden latency spike is usually the first sign of trouble, hours before users complain.

Token consumption per task. The job description rewrite pipeline was shut down because GPT-4.1 costs made it uneconomical at 1M+ listings. I knew the per-request cost, but I didn't have a dashboard showing total weekly spend trending upward. Now I log token counts per request and aggregate them by endpoint. When a single prompt starts costing 30% more per completion, I see it immediately.

Output structure failures. The most dangerous failure mode is when the model returns valid JSON with wrong data. A hallucinated skill, a fabricated company name, a salary range that doesn't match the listing. I validate every structured output against the original input data. If the model claims a job requires "10+ years of React" and the source listing says "2 years preferred," that's a log entry and a retry.

Why Prompt Engineering Alone Won't Save You

I see teams pour weeks into prompt optimization and skip observability entirely. They treat the prompt like a configuration file you can perfect once and forget. That's wrong for two reasons.

First, the model changes under you. OpenAI, Anthropic, Google, they all update their models silently. A prompt that worked perfectly in March produces gibberish in June. Without logs, you can't tell if the prompt broke or the model changed. With logs, you compare last week's completions to this week's and see the drift immediately.

Second, production data is messier than your test set. Your test prompts are clean, well-formed, and representative. Real user input is garbled, truncated, and malicious. LogRocket session replays showed me a user pasting a job description with Unicode characters that caused the LLM to return empty responses. The prompt handled it fine. The input was the problem. Without session replay, I'd still be rewriting system prompts.

Here's the logging middleware I use now:

async function logLLMCall<T>(
  model: string,
  prompt: string,
  executor: () => Promise<T>,
  context: Record<string, unknown>
): Promise<T> {
  const start = Date.now();
  try {
    const result = await executor();
    const latency = Date.now() - start;

    await db.llmLogs.create({
      data: {
        model,
        prompt,
        completion: JSON.stringify(result),
        latencyMs: latency,
        success: true,
        context,
        timestamp: new Date()
      }
    });

    return result;
  } catch (error) {
    const latency = Date.now() - start;
    await db.llmLogs.create({
      data: {
        model,
        prompt,
        completion: null,
        latencyMs: latency,
        success: false,
        error: error.message,
        context,
        timestamp: new Date()
      }
    });
    throw error;
  }
}

Every call gets logged. Every failure gets captured. Every latency spike gets recorded. This pattern costs almost nothing to implement and saves hours of debugging.

The Dashboard That Changed How I Ship AI

I built a simple dashboard that shows three things: average latency by model, total tokens consumed per day, and error rate by endpoint. That's it. Three charts.

The first time I saw it in production, I noticed the error rate for one endpoint was 12%. The endpoint was the job scoring pipeline. Twelve percent of all scoring requests were failing silently and returning default scores. Users saw accurate results for 88% of jobs and random results for the other 12%. They couldn't tell. The system didn't alert.

The root cause was a race condition in the batch processing queue. Two concurrent requests would overwrite each other's state. The logs showed the pattern immediately: paired failures with identical timestamps. Without that dashboard, I might have blamed the model and wasted weeks on prompt engineering.

I use Sentry for error tracking and LogRocket for session replay on the frontend. But the LLM-specific logging lives in the application database. Sentry catches crashes. The structured logs catch silent failures.

When Your AI Agent Needs a Doctor, Not a Coach

If your team is deploying AI features and shipping slower because you can't tell if a failure is the prompt, the model, or the data, that's the kind of thing I help with. I build production observability into AI pipelines from day one, so you catch hallucinations, cost spikes, and latency issues before they reach your users. Happy to compare notes on what's worked for your setup.

Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.

The $10,000 Lesson: Building Cost-Efficient AI Features with Function Calling and Caching

Abdul Rehman — Sun, 05 Jul 2026 09:01:45 +0000

I remember the exact moment a client saw the AI pipeline cost. It was a Tuesday morning, and the number made them say "shut it down."

That pipeline was rewriting job descriptions for a platform with over a million listings. The idea was solid: use a capable LLM to turn raw ATS text into structured, SEO-friendly content. But the cost per listing added up fast. The feature was technically impressive and completely uneconomical.

That was a hard lesson. But it taught me something I've used on every AI project since: building cost-efficient AI features isn't about picking the cheapest model. It's about architecture. You can cut costs dramatically without cutting quality if you design the system right.

Here's what actually works.

Function Calling Cuts Token Waste by Half or More

The biggest hidden cost in AI features is generating text you don't need. Most developers send a prompt and let the LLM write a paragraph of commentary when all they need is a structured data point. That's paying for thousands of tokens of filler.

Function calling (or structured output) fixes this. You tell the model exactly what fields to return, and it outputs only those fields in JSON. No fluff.

Here's the pattern I use in production:

const completion = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [
    { role: "system", content: "Extract structured data from the raw job description." },
    { role: "user", content: rawDescription }
  ],
  functions: [{
    name: "extract_job_listing",
    parameters: {
      type: "object",
      properties: {
        title: { type: "string" },
        salary_min: { type: "number" },
        salary_max: { type: "number" },
        remote: { type: "boolean" },
        skills: { type: "array", items: { type: "string" } }
      },
      required: ["title", "remote", "skills"]
    }
  }],
  function_call: { name: "extract_job_listing" }
});

const result = JSON.parse(completion.choices[0].message.function_call.arguments);

No introductory paragraph. No "Sure, here is the extracted data." Just the JSON object. On a bulk extraction pipeline, this cuts token usage significantly compared to freeform prompts asking for the same data.

The side benefit is reliability. When you enforce a schema, you eliminate hallucinations about fields that shouldn't exist. I've used conditional schema flags (like presence guards) to prevent the model from fabricating fields that don't exist, for example ensuring it never invents a previous job or education entry. That's not just cost optimization. It's trust.

Caching Is Free Money, but Most People Do It Wrong

Everyone knows to cache LLM responses. But the naive approach (keyed by the exact prompt string) misses most of the savings.

The trick is to cache at multiple levels.

Embedding cache. If you're embedding documents for a RAG pipeline, you're paying for the same embeddings every time a user asks a similar question. Store the embedding vectors in a database and query by content hash before calling the API. The initial seeding period is the most expensive; after that, repeated queries hit the cache and cost nothing. Embedding costs drop substantially as the cache warms up over time.

LLM response cache with semantic keying. Exact prompt matching is too brittle. A user might ask "summarize this" and another asks "give me a summary." Those should hit the same cache entry. Use a deterministic hash of the normalized prompt and the function call parameters. You can store the normalized key in a cache like Redis with a TTL tied to the data freshness: 24 hours for stable content, 1 hour for rapidly changing data.

Smart cache invalidation. This is where most people fail. They cache forever and serve stale data. Set cache TTLs based on the data source. If the underlying data changes (new job listings, updated user profile), invalidate the cache for that specific key. This prevents the "I updated my resume but the AI still sees the old version" problem.

The total impact: on a production system handling many LLM calls daily, caching can eliminate a large portion of API calls after the initial warmup. Many requests never touch the API after the first identical query.

Batch API and Prompt Compression for Heavy Workloads

OpenAI's Batch API offers a 50% discount on most models. The tradeoff is latency: results come back in hours, not seconds. That's perfect for nightly enrichment jobs, not for user-facing chat.

On a large job board platform, I moved the description rewrite pipeline to Batch API. Processing thousands of listings overnight cut the cost per listing in half compared to synchronous calls. Even at that rate, the overall cost was significant, which is why we're evaluating cheaper models like DeepSeek V4 Flash (roughly 23x cheaper than GPT-4.1) for that workload.

Prompt compression is another lever. Strip out unnecessary context. If the system prompt is long and you're sending many requests, every token you remove from the system prompt multiplies across every request. I've trimmed prompts by a noticeable margin just by removing redundant instructions and using shorter examples.

Model Selection: When to Pay for 4o and When to Use Flash

I maintain a simple decision tree:

Complex reasoning, legal, or finance tasks: GPT-4o or 4.1. The output quality justifies the cost.
Structured data extraction, classification, summarization: GPT-4o mini or Gemini 2.0 Flash. They're fast and cheap.
Bulk processing with loose quality requirements: DeepSeek V4 Flash. At roughly 23x cheaper than GPT-4.1, it's economical for pipelines where occasional errors are acceptable.
Real-time, high-volume, moderate quality: Gemini 2.0 Flash. Its free tier offers generous limits, and the paid rate is lower than GPT-4o mini.

Suppose a client insists on using a top-tier model for everything. Switching extraction tasks to a cheaper model and chat responses to a cost-effective provider can drop the monthly bill dramatically. Quality, measured by user satisfaction, barely moves.

Guardrails Prevent Runaway Costs

The most expensive bug is an infinite loop. An AI agent that retries on failure, or a user who spams the generate button, can burn through significant money in minutes.

I set three hard guardrails on every AI feature:

Per-request token limits. Hard cap on max_tokens. Never let the model decide how long to answer.
Rate limiting per user. Reasonable limits on requests per minute and per day on generation endpoints.
Cost alerts. A simple script that checks the daily API usage and sends a notification if it exceeds a threshold. A runaway prompt can cause the model to generate overly long responses. Cost alerts catch it early before it escalates.

These are not theoretical. I've seen a pipeline burn through hundreds of dollars faster than expected because of a missing guardrail. Now I never ship an AI feature without all three.

If your team is wrestling with AI feature costs and shipping slower because of it, that's the kind of thing I help with. I've been building production AI pipelines, breaking them, and figuring out what actually works. Happy to compare notes.

Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.

I Built an AI Pipeline That Scores 10,000+ Listings Daily Without Breaking the Bank

Abdul Rehman — Sat, 04 Jul 2026 09:01:40 +0000

The Moment the Cost Problem Snapped Into Focus

The first time I ran the LLM scoring pipeline against our full backlog of job listings, I watched the OpenAI API costs climb in real time. What worked beautifully for 100 test listings was economically impossible at 10,000 per day.

This wasn't a side project. This was a production job board platform I was building for a client, processing listings from major ATS sources. Users needed scores. The client needed the system to be profitable. And I needed to rethink everything about how I was calling LLMs at scale.

The naive approach was simple: take a listing, send it to GPT-4 with a prompt, get a score back. Simple, but expensive. At scale, that pattern would have made the product economically unviable.

So I rebuilt the pipeline from the ground up. Here's what the final architecture looks like.

Ingestion: The Data Pipeline Behind the Scoring

Before scoring anything, you need clean data at scale. The platform pulls from five major ATS providers using their public APIs. Greenhouse, Lever, Ashby, Workable, and Recruitee all expose listing data without OAuth, which makes ingestion straightforward but each returns a different shape of data.

The ingestion layer normalizes everything into a standard schema: title, company, description, location, posted date, and metadata. Then it writes to a MongoDB collection that the scoring pipeline reads from.

The first failure here was pagination. I was using MongoDB's skip() for offset-based pagination when reading from the collection. At 1M+ documents, deep skip calls caused Atlas CPU spikes because skip() doesn't skip computation. It scans every document up to the offset. The more listings we ingested, the worse it got.

The fix was cursor-based pagination using the _id field. Instead of skipping, the query says "give me the next 100 documents after this one." No scanning. No CPU spikes. The change took an afternoon to implement and permanently solved a problem that had been causing weekly incidents.

But pagination was just the warm-up. The real challenge was still ahead.

LLM Scoring: Function Calling Over Freeform Prompts

For the scoring pipeline, I needed structured, predictable output. Freeform prompts with "return a JSON object" instructions are fragile. One day the LLM decides to add commentary. The next day it renames a key. That breaks downstream systems.

Function calling fixed this. Here's the schema I use for scoring a job listing against a candidate profile:

const scoringFunctions = [
  {
    name: 'score_job_match',
    description: 'Score how well a job listing matches a candidate profile',
    parameters: {
      type: 'object',
      properties: {
        overall_score: {
          type: 'number',
          description: 'Match score from 0 to 100',
        },
        skill_match: {
          type: 'number',
          description: 'How well the candidate skills match requirements, 0-100',
        },
        experience_match: {
          type: 'number',
          description: 'How well candidate experience matches, 0-100',
        },
        location_match: {
          type: 'number',
          description: 'Location compatibility, 0-100',
        },
        reasons: {
          type: 'array',
          items: { type: 'string' },
          description: 'Top 3 reasons for this score',
        },
      },
      required: ['overall_score', 'skill_match', 'experience_match', 'location_match', 'reasons'],
    },
  },
];

With function calling, the LLM returns a deterministic JSON structure every time. No parsing errors. No hallucinations about the schema. Just clean data I can pipe directly into the database.

But even with perfect output structure, the cost problem remained.

Cost Management: Three Strategies That Made It Work

This is where most people give up on production AI. They see the OpenAI bill, panic, and either kill the feature or ship a broken version. I went through both phases before landing on a working approach.

Strategy 1: Batch everything possible.

OpenAI's Batch API gives you 50% cost reduction in exchange for delayed processing. For scoring, that's fine. Listings don't need scores within seconds. They need scores within hours. The batch endpoint accepts the same payload as the real-time API. I queue up 500 scoring requests, submit them as a batch file, and collect the results 30 to 60 minutes later. The per-listing cost drops immediately and the throughput stays the same.

Strategy 2: Tier your models.

Not every listing needs a GPT-4 level analysis. Simple listings with clear skill requirements get scored with GPT-4o mini. Complex executive roles or ambiguous descriptions go to GPT-4. The routing logic is straightforward: if the description is under 500 words and the required skills are well-defined, use the cheap model. Otherwise, escalate.

This alone cut the average per-listing cost by about 70% without measurable accuracy loss. The key insight is that most data in most systems is simple. Only a fraction needs the heavy model. Design for the majority.

Strategy 3: Cache aggressively.

If a listing has been scored before and nothing changed, don't pay to score it again. I built a cache layer keyed on a hash of the listing content plus the candidate profile ID. The pipeline checks the cache before making any LLM call. Hit rate runs around 40% on repeat listings. That's 40% of requests that cost nothing.

Even with all three strategies, the client's AI rewrite pipeline got shut down. The cost at 1M+ listing scale was still too high for the budget. That's the reality of production AI. You don't solve cost once. You keep optimizing, or you find a model that's cheap enough and try again. That's what I'm evaluating with DeepSeek V4 Flash right now.

The REST API Layer: Making Scored Data Consumable

The scored listings don't just sit in a database. They're served through a REST API that downstream consumers query.

The API accepts filters for score ranges, locations, skills, and posting dates. Each endpoint logs query patterns so I can optimize indexes and cache popular queries. The response format is flat JSON with the score fields at the top level, making it easy for frontend developers and integration partners to consume without transformation.

// API response shape for scored listings
{
  "id": "listing_abc123",
  "title": "Senior Software Engineer",
  "score": {
    "overall": 87,
    "skill": 92,
    "experience": 85,
    "location": 80,
    "reasons": [
      "Strong TypeScript and React experience matches requirements",
      "5 years of backend experience aligns with senior role",
      "Remote-first company, location not a barrier"
    ]
  },
  "posted_at": "2026-05-01T10:00:00Z"
}

The API layer also handles rate limiting, auth via API keys, and request validation. It's the part users and integrators see, so it has to be fast and reliable. Every endpoint returns scores within 50ms because all the LLM work happened hours ago during the batch window.

What I'd Do Differently

If I were starting this system today, I'd skip the GPT-4-only phase entirely and start with model tiering from day one. The cost of "let's just get it working with the best model" is real, and it creates pressure to optimize under fire instead of by design.

I'd also build the cache layer before the scoring pipeline, not after. Adding caching retroactively meant replaying scores that were already paid for. Building it first would have saved thousands in the first month.

And I'd have the cost conversation with the client earlier. The rewrite pipeline that got shut down would have been designed differently if we'd agreed on cost constraints before writing code. But that's a lesson in communication, not engineering.

Production AI isn't about using the smartest model. It's about designing for cost, latency, and reliability from day one. The model choice matters, but your architecture matters more.

If your team is building AI features and hitting the wall between "it works on my machine" and "it works at scale without burning cash", that's the kind of problem I help with. You can see how I build production AI pipelines at primestrides.com. Happy to compare notes on your specific challenges.

Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.