OpenAI Responses API vs Custom RAG: Cost, Latency and Control in 2026

#llm #ai #tutorial #python

When you need to add document retrieval to an LLM application, you have two realistic paths: use OpenAI built-in file_search tool via the Responses API, or build and manage your own RAG pipeline. The first option ships in a day; the second gives you full control over chunking, embeddings, retrieval logic, and cost. Picking the wrong one early will either lock you into an opaque managed service or waste weeks of engineering before the product even launches.

What the Responses API Actually Gives You

The OpenAI Responses API (released in 2025 as the successor to the Assistants API) exposes file_search as a first-class tool backed by managed vector stores. You upload documents, attach them to a vector store, and the model retrieves relevant chunks automatically during inference.

Here is a minimal working example:

from openai import OpenAI

client = OpenAI()

with open("security-policy.pdf", "rb") as f:
    uploaded = client.files.create(file=f, purpose="assistants")

vector_store = client.beta.vector_stores.create(name="policy-docs")
client.beta.vector_stores.files.create(
    vector_store_id=vector_store.id,
    file_id=uploaded.id
)

response = client.responses.create(
    model="gpt-4o",
    input="What is the password rotation policy?",
    tools=[{"type": "file_search", "vector_store_ids": [vector_store.id]}]
)

print(response.output_text)

Zero-ops retrieval that works on day one. What you give up: visibility into which chunks were retrieved, control over the embedding model, and the ability to filter by metadata at query time. The store is entirely managed — you cannot inspect it, replicate it, or move it.

Building a Custom RAG Pipeline

A custom pipeline involves four steps: chunk documents, embed chunks, store in a vector database, and query at runtime. Here is a stripped-down version using pgvector and OpenAI embeddings — every part is explicit and auditable:

import psycopg2
from openai import OpenAI

client = OpenAI()

def embed(text: str) -> list:
    return client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    ).data[0].embedding

def search_chunks(query: str, conn, top_k: int = 5) -> list:
    qvec = embed(query)
    with conn.cursor() as cur:
        cur.execute(
            "SELECT content, 1 - (embedding <=> %s::vector) AS score "
            "FROM chunks ORDER BY embedding <=> %s::vector LIMIT %s",
            (qvec, qvec, top_k)
        )
        return cur.fetchall()

def answer_with_rag(question: str, conn) -> str:
    results = search_chunks(question, conn)
    for content, score in results:
        print(f"[{score:.3f}] {content[:80]}")  # audit trail

    context = chr(10).join(c for c, _ in results)
    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Answer based only on:
{context}"},
            {"role": "user", "content": question}
        ]
    )
    return resp.choices[0].message.content

More code, but you own the full chain: similarity scores, metadata filters, chunk size, embedding model. You can add BM25 hybrid search, apply filters like doc_type = policy AND dept = engineering, or re-rank with a cross-encoder without touching the underlying data store.

Cost Breakdown

Here is what you are actually paying for in each approach.

Responses API with file_search:

Vector store storage: $0.10/GB/day — a 1 GB corpus costs $3/month just to sit there
Retrieval tokens: chunks injected into the context window are billed at the full output token rate — you have no visibility into how many tokens that adds per query
No way to optimize retrieval cost without reducing document volume

Custom RAG:

Embedding at index time: ~$0.02 per million tokens with text-embedding-3-small
Vector database: $0 (self-hosted pgvector or Chroma) to ~$50/month for a managed Qdrant cluster
Query embedding: same $0.02/million rate
LLM call: you control exactly what goes into context, so you control the per-query cost

At small scale — under 10,000 queries/month on a few dozen documents — the Responses API wins on engineering time. At medium scale (100k+ queries/month, a few hundred documents), a custom pgvector setup typically runs 30-50% cheaper in pure infrastructure cost. That crossover assumes you are not counting the engineering hours to maintain the pipeline, which are real.

Latency and Observability

The Responses API retrieval is fast — typically 200-400 ms for a vector search — but completely opaque. You cannot see which chunks were fetched, their similarity scores, or how many tokens they consumed. When the model starts hallucinating, you have no way to tell whether the right context was retrieved in the first place.

Custom RAG gives you full traceability at every step. Logging similarity scores lets you set alert thresholds — for example, reject answers when no chunk scores above 0.70 — and build a proper audit trail of what context the model received. For security-sensitive applications, that audit trail is often a compliance requirement, not a nice-to-have. The security hardening checklists we publish cover what that logging and audit chain should look like end-to-end.

When to Use Each

Use the Responses API with file_search if:

You are prototyping and need to ship in 24 hours
Document volume is small (under 500 files, under 200 MB)
You do not need metadata filtering
You have no infrastructure to run a vector database
Auditability is not a hard requirement

Build a custom RAG pipeline if:

You need metadata filtering at query time (by date, department, classification level)
You need full observability and cost control at scale
You want to swap embedding models without migrating an opaque external store
You process sensitive documents that cannot leave your infrastructure
You need hybrid search (dense vectors + BM25)
You are building on an open-weights model and cannot use managed stores anyway ## The Takeaway

The Responses API file_search is genuinely good for getting retrieval working fast. The managed setup removes operational friction and the integration is clean. But you pay for that convenience in three ways: reduced cost predictability at scale, zero visibility into retrieval quality, and lock-in to a single infrastructure provider.

A custom RAG pipeline costs more to build upfront. In exchange, you own the full chain — chunking strategy, embedding model, retrieval logic, metadata filters, and the complete audit trail. For teams operating at meaningful scale or with security and compliance requirements, those trade-offs almost always favor the custom approach.

Start with the Responses API to validate the use case quickly. Migrate to a custom pipeline when you hit the first real limit — whether that is a cost spike, a hallucination you cannot debug, or a metadata filter you cannot express. By that point, you will know exactly why the migration is worth it.

I run AYI NEDJIMI Consultants, a cybersecurity consulting firm. We publish free security hardening checklists — PDF and Excel.