AlaiKrm

Posted on Jul 3

What a Production RAG System Actually Looks Like After 18 Months

#ai #production #rag #systemdesign

I want to write the post I wish existed when I started building enterprise RAG systems. Not a tutorial on how to build one. There are enough of those. This is about what the system looks like after eighteen months of real usage has shaped it into something that actually works reliably.

The gap between a RAG system that works in a demo and one that works reliably in production is significant and specific. I am going to describe that gap concretely because "production-ready" gets used as a vague quality signal when it should describe a specific set of components and properties.

The ingestion pipeline has become a system of record

In a demo RAG system, documents go in and they stay in. You index your documents, you query them, done. In a production system that has been running for eighteen months, the ingestion pipeline has evolved into something that looks more like a data management system than a simple indexing operation.

Every document in our index has a canonical identifier that links it to its source, a version history, a status field (current, superseded, archived), an access control tier, and a freshness score based on when it was last verified against the source system. When a new version of a document is ingested, the previous version is marked as superseded rather than deleted, giving us rollback capability if the new version has quality problems. When a document is deleted from the source system, it is flagged in the index rather than removed immediately, to allow for investigation of why it was deleted before the vectors disappear from retrieval.

The code that implements this is not complicated. It is approximately 300 lines of Python that runs every four hours against our primary document sources. What makes it work is the discipline of treating document lifecycle as a first-class concern rather than an afterthought.

class DocumentLifecycleManager:
    def ingest(self, doc_path: str, source_system: str, metadata: dict):
        canonical_id = self.compute_canonical_id(doc_path, source_system)
        existing = self.index.get_by_canonical_id(canonical_id)

        if existing and existing.content_hash == self.compute_hash(doc_path):
            self.index.update_last_verified(canonical_id)
            return  # content unchanged, just update freshness timestamp

        if existing:
            self.index.mark_superseded(canonical_id)  # version the old one

        new_doc = self.prepare_document(doc_path, canonical_id, metadata)
        self.index.insert(new_doc)
        self.log_ingestion_event(canonical_id, "new_version" if existing else "first_ingestion")

    def run_freshness_audit(self, max_age_days: int = 30):
        stale_docs = self.index.get_docs_not_verified_since(days=max_age_days)
        for doc in stale_docs:
            source_exists = self.check_source_system(doc.source_path)
            if not source_exists:
                self.index.mark_for_review(doc.canonical_id, reason="source_deleted")
            else:
                self.re_ingest(doc.source_path, doc.source_system)

This is not the code you write in a tutorial. It is the code you write after you have been bitten by stale documents surfacing in retrieval for the third time.

The retrieval layer has access control baked in, not bolted on

The single most important architectural lesson from eighteen months of running a multi-user enterprise RAG system is that retrieval-layer access control cannot be an afterthought. If access control is applied after retrieval, as a filter on the results before they are shown to the user, you have a system that has already retrieved restricted content and is now choosing not to show it. That distinction matters in environments where the fact that restricted content exists is itself sensitive information.

The current architecture applies access control as a pre-retrieval filter, using metadata on the vector embeddings to ensure that the retrieval query itself only considers vectors that the requesting user is authorized to access.

def retrieve_with_access_control(
    query: str,
    user_context: UserContext,
    k: int = 8
) -> list:
    # Build filter from user's actual permissions, not a static role
    access_filter = {
        "$or": [
            {"access_tier": "public"},
            {"access_tier": "internal", "departments": {"$in": user_context.departments}},
            {"access_tier": "restricted", "authorized_users": {"$in": [user_context.user_id]}},
        ]
    }

    results = vectorstore.similarity_search_with_score(
        query=query,
        k=k * 2,  # retrieve more, will filter and re-rank
        filter=access_filter
    )

    # Re-rank by combining similarity score with document freshness
    reranked = self.freshness_weighted_rerank(results)
    return reranked[:k]

def freshness_weighted_rerank(self, results: list, freshness_weight: float = 0.15) -> list:
    scored = []
    for doc, similarity_score in results:
        days_since_verified = (datetime.now() - doc.metadata["last_verified"]).days
        freshness = max(0, 1 - (days_since_verified / 180))  # decay over 6 months
        combined_score = (1 - freshness_weight) * similarity_score + freshness_weight * freshness
        scored.append((doc, combined_score))
    return [doc for doc, _ in sorted(scored, key=lambda x: x[1], reverse=True)]

The freshness-weighted reranking came from a specific incident where a user was getting outdated answers from old documents that scored well on semantic similarity but poorly on factual currency. The weight is small (15%) but meaningful, and it has reduced complaints about outdated answers by roughly half.

Observability is not optional and it is not metrics dashboards

The observability layer in our current system is built around a concept I think of as "explainability by default." For every query-response cycle, we log enough information that a human investigator can reconstruct exactly what happened, starting from the user's query and ending with the generated response.

The log record for each interaction includes: the original query, the access control context applied to the retrieval, the top-k results with their similarity scores and freshness scores, the assembled prompt (hashed, not stored as raw text), the model and prompt version used, the response, the response latency by component, and any errors or fallbacks that occurred.

This logging architecture has two specific properties that I want to call out because they were not in the original design and had to be added after incidents made their absence painful.

The component-level latency tracking was added after we spent three days debugging a performance regression that turned out to be a specific document type causing embedding failures that triggered silent retries. The overall latency looked slightly elevated but not alarming. Component-level latency showed the embedding step spiking on a specific document category.

The model and prompt version logging was added after a prompt update introduced a subtle behavior change that we did not catch in evaluation because our evaluation set did not cover the edge case well. When user reports started coming in about changed response formatting, we could not initially determine whether the issue was the prompt change, a model update the provider had deployed, or a data quality issue. Version logging lets us correlate behavior changes with specific deployments.

The evaluation suite has become a regression prevention system

The evaluation suite we run against every change to the retrieval configuration or prompts has grown from twelve queries in the initial version to 340 query-response pairs representing the actual distribution of queries we see in production.

The queries are categorized by type (factual lookup, synthesis across documents, policy question, procedural guide) and by sensitivity of the underlying content. We measure recall at k for retrieval, groundedness of the response against retrieved content, and instruction following for the behavioral constraints in the system prompt.

The most important addition was the adversarial query set. These are queries specifically designed to probe failure modes: queries where we know the answer has changed recently, queries that reference content the user should not have access to, queries that contain instruction-like text designed to test prompt injection resistance, and queries where the AI's honest answer should be "I don't know" rather than a generated response.

This adversarial set has caught three significant issues that the standard evaluation set missed, in all three cases because the issue only manifested on query types that normal users do not typically ask but adversarial users or edge cases produce. Finding those in the evaluation suite is considerably better than finding them in production.

What the system cannot do well, and why that is okay

After eighteen months I have a clear-eyed view of where this system fails. It struggles with queries that require reasoning across more than four or five documents simultaneously. It degrades significantly when the knowledge base has not been maintained and documents are stale. It cannot handle procedural tasks that require maintaining state across multiple turns. It produces inconsistent results for queries that are ambiguous enough that the same query phrased differently would retrieve different document sets.

None of these failures are surprising and most of them are inherent to the current state of the technology rather than to specific implementation choices. What matters is that the failure modes are known, the system communicates uncertainty honestly when it encounters them, and the monitoring infrastructure surfaces new failure patterns quickly enough that they can be addressed before they erode user trust at scale.

The system that works reliably in production is not the one that never fails. It is the one where failure is understood, observable, and recoverable.

Top comments (3)

Mateo Ruiz • Jul 3

The point about the ingestion pipeline becoming a system of record really stood out. A lot of RAG discussions focus on embeddings and retrieval quality, but production issues are often caused by stale, duplicated, or poorly governed data not the vector search itself.

I also liked the emphasis on adversarial evals. Testing only "happy path" queries gives a false sense of confidence. The real confidence comes from knowing how the system behaves when access controls, prompt injection attempts, or outdated knowledge are involved. That's the difference between a demo and a system people can actually trust.

Tae Kim • Jul 4

The document lifecycle model you're describing -- canonical identifier, version chain, superseded versus deleted distinction -- is exactly the ingestion design that gets skipped in tutorials because it's unglamorous, but it's what separates a RAG system that degraded silently over 18 months from one that didn't. The superseded-versus-deleted distinction is particularly underrated: soft-deleting vectors is what lets you diagnose whether a retrieval regression was caused by a source change rather than a model or chunking change, because you can diff what the retriever was returning before and after. One thing I'd be curious about: how you handle the case where a document's canonical identifier is stable but its embedding distribution has drifted -- when content changes significantly enough that old vectors are semantically stale even though the lifecycle status still says current. The 4-hour ingest cadence also suggests you've accepted some indexing lag, so I'm guessing the freshness score feeds into query-time retrieval weighting rather than being purely an observability signal?

Med Marrouchi • Jul 3

Stale and poor data quality is the worst enemy for RAG. Data needs to be structured and well maintained as much as possible to have a reliable system.