WonderLab

Posted on May 19

RAG Series (20): Enterprise RAG Architecture

#rag #ragas #ai #qdrant

The Gap Between Demo and Production

Every article in this series has shared one architectural assumption: a single vector store, accessible to everyone, returning any document to any user.

That works in a demo. In an enterprise environment, it breaks immediately:

Company A's documents can be retrieved by Company B's users
Financial data can be pulled by any employee
HR policies visible to contractors
One user hammers the API and takes down the service for everyone else

Production enterprise RAG needs three layers:

Incoming request
  ↓ rate limit check   — is this user still within quota?
  ↓ cache lookup       — has this question been answered before?
  ↓ tenant routing     — which knowledge base?
  ↓ permission filter  — within that KB, what can this user see?
  ↓ retrieve + generate — answer from authorized content only
  ↓ cache write        — store for next time

This article implements each layer.

Layer 1: Multi-Tenancy

Strategy: one Qdrant Collection per tenant

Each customer or department gets its own Qdrant Collection. Collections are physically isolated — you can't search acme_corp's content by querying globex_corp, because the two collections are entirely separate vector spaces.

from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

qdrant_client = QdrantClient(":memory:")   # production: host="qdrant-server"

tenant_stores: dict[str, QdrantVectorStore] = {}

for tenant_id, docs in TENANT_DOCS.items():
    qdrant_client.create_collection(
        collection_name=tenant_id,
        vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
    )
    store = QdrantVectorStore(
        client=qdrant_client,
        collection_name=tenant_id,
        embedding=embeddings,
    )
    store.add_documents(docs)
    tenant_stores[tenant_id] = store

Routing is trivial — the request carries a tenant_id, the service selects the matching store:

def get_retriever(tenant_id: str, role: str, k: int = 3):
    if tenant_id not in tenant_stores:
        raise ValueError(f"Unknown tenant: {tenant_id}")
    store = tenant_stores[tenant_id]
    # permission filter added in Layer 2
    ...

Why not a shared Collection with a tenant_id metadata filter?

It works technically, but carries a risk: a filter bug means Tenant A's data leaks to Tenant B. There's no hard boundary. Collection-level isolation also makes teardown clean — removing a tenant means dropping their Collection, with no residue.

For soft isolation (departments within one company), metadata filtering is fine. For hard isolation (different customers), separate Collections are safer.

Layer 2: Access Control

Strategy: documents carry access_level; retrieval injects a Qdrant filter

Each document declares its access level in metadata:

Document(
    page_content="Annual bonus: S-tier 3 months, A-tier 2 months...",
    metadata={"source": "hr-policy", "access_level": "hr_only"},
)
Document(
    page_content="Robot control system: EtherCAT bus, latency <1ms...",
    metadata={"source": "robot-spec", "access_level": "engineering_only"},
)

Roles map to the access levels they can see:

ROLE_PERMISSIONS: dict[str, list[str]] = {
    "admin":    ["public", "engineering_only", "hr_only", "finance_only"],
    "engineer": ["public", "engineering_only"],
    "hr":       ["public", "hr_only"],
    "finance":  ["public", "finance_only"],
    "employee": ["public"],
}

At retrieval time, the role's allowed levels become a Qdrant MatchAny filter:

from qdrant_client.models import Filter, FieldCondition, MatchAny

def get_retriever(tenant_id: str, role: str, k: int = 3):
    levels = ROLE_PERMISSIONS.get(role, ["public"])

    access_filter = Filter(
        must=[
            FieldCondition(
                key="metadata.access_level",
                match=MatchAny(any=levels),
            )
        ]
    )
    return tenant_stores[tenant_id].as_retriever(
        search_kwargs={"k": k, "filter": access_filter}
    )

This filter executes at the vector database layer, not the application layer. Unauthorized documents never leave the database — they aren't returned to the application, so there's nothing to leak.

Layer 3: Caching

Strategy: (tenant_id, role, question) as cache key, TTL 300 seconds

@dataclass
class CacheEntry:
    answer: str
    created_at: float = field(default_factory=time.time)

class QueryCache:
    def __init__(self, ttl_seconds: int = 300):
        self._store: dict[tuple, CacheEntry] = {}
        self._ttl = ttl_seconds

    def get(self, tenant_id, role, question) -> Optional[str]:
        entry = self._store.get((tenant_id, role, question.strip().lower()))
        if entry and (time.time() - entry.created_at) < self._ttl:
            return entry.answer
        return None

    def set(self, tenant_id, role, question, answer) -> None:
        self._store[(tenant_id, role, question.strip().lower())] = CacheEntry(answer)

Including role in the cache key matters: an engineer and an HR manager asking the same question get different contexts (different documents pass the permission filter), so they may get different answers. Cache entries are not cross-role reusable.

Layer 4: Rate Limiting

Strategy: sliding window, 5 requests per user per minute

class RateLimiter:
    def __init__(self, max_requests: int = 5, window_seconds: int = 60):
        self._max = max_requests
        self._window = window_seconds
        self._log: dict[str, list[float]] = defaultdict(list)

    def allow(self, user_id: str) -> bool:
        now = time.time()
        self._log[user_id] = [t for t in self._log[user_id]
                               if now - t < self._window]
        if len(self._log[user_id]) >= self._max:
            return False
        self._log[user_id].append(now)
        return True

Sliding window vs. fixed window: a fixed window allows bursting at boundaries — a user can send 5 requests at second 59 and 5 more at second 61, sending 10 in 60 seconds. A sliding window enforces the limit across any 60-second interval.

Experiment Results

Scenario A: Normal Retrieval

Engineer alice queries company info and technical docs:

Q: What type of company is ACME Corp?
A: ACME Corp is a smart manufacturing company.
Sources: [company-intro, robot-spec]   ← public + engineering docs, correct
elapsed: 995ms

Q: What communication protocol does ACME's robot system use?
A: ACME Corp's robot control system uses the EtherCAT real-time bus.
Sources: [company-intro, robot-spec]   ← engineering doc correctly retrieved
elapsed: 1709ms

Scenario B: Permission Filtering

The key thing to read here is the sources array — not whether docs_retrieved > 0:

[B1] Engineer alice asks about annual bonus (hr_only doc):
  Sources: [company-intro, robot-spec]   ← hr-policy is NOT in sources
  A: The reference material does not contain information about the bonus policy.

[B2] HR bob asks about net profit (finance_only doc):
  Sources: [company-intro, hr-policy]    ← financial-report is NOT in sources
  A: The reference material does not contain ACME's 2025 net profit.

[B3] HR bob asks about annual leave (hr_only doc):
  Sources: [company-intro, hr-policy]    ← hr-policy correctly appears
  A: Year 1: 12 days. Each additional year: +2 days. Maximum: 20 days.

What access control actually looks like in practice: hr-policy never appears in alice's sources list; financial-report never appears in bob's sources list. The Qdrant filter intercepts these documents at the database layer. The LLM never receives them, so it correctly responds that the information isn't available.

This is the right behavior: users still get documents they can access (public + their role-specific docs); only the restricted documents are absent.

Scenario C: Tenant Isolation

[C1] Globex user charlie asks about ACME Corp's headcount:
  Tenant: globex_corp
  Sources: [products, company-intro]   ← these are Globex's own docs
  A: The reference material does not contain ACME's employee count.

[C2] Globex user queries their own product lines:
  Sources: [company-intro, products]   ← Globex docs correctly returned
  A: GlexCloud, GlexAnalytics, GlexAI...

Charlie is querying the globex_corp Collection for ACME Corp information. Of course nothing comes back — ACME's content doesn't physically exist in Globex's Collection.

Scenario D: Cache Hit

First request (Scenario A1): 995ms, cache_hit=false
Same question repeated:        0ms, cache_hit=true

0ms means the repeated request skipped both retrieval and LLM generation entirely. For frequently repeated questions — company policy, common workflows, product FAQs — caching compounds quickly.

Scenario E: Rate Limiting

Config: 5 req / 60s / user

Request 1: allowed
Request 2: allowed
Request 3: allowed
Request 4: allowed
Request 5: allowed
Request 6: RATE LIMITED   ← limit enforced
Request 7: RATE LIMITED

The rate limiter correctly allowed 5 and blocked 2 out of 7 requests.

FastAPI Service Layer

The four layers above are wired together in a single query() function, then exposed via FastAPI:

from fastapi import FastAPI, HTTPException, Header
from pydantic import BaseModel

app = FastAPI(title="Enterprise RAG Service")

class QueryRequest(BaseModel):
    tenant_id: str
    question: str

@app.post("/query")
async def query_endpoint(
    req: QueryRequest,
    x_user_id: str = Header(...),     # user identity from request header
    x_user_role: str = Header(...),   # user role from request header
):
    result = query(
        tenant_id=req.tenant_id,
        user_id=x_user_id,
        role=x_user_role,
        question=req.question,
    )
    if result.rate_limited:
        raise HTTPException(status_code=429, detail="Too many requests")
    return {
        "answer":    result.answer,
        "sources":   result.sources,
        "cache_hit": result.cache_hit,
    }

Start with: uvicorn enterprise_rag:app --host 0.0.0.0 --port 8080

In production, x_user_id and x_user_role should come from JWT token decoding, not raw client headers.

Production Upgrade Path

Component	Demo Implementation	Production Replacement
Qdrant	`:memory:`	Dedicated server, `host="qdrant-server"`
Cache	In-process dict	Redis (distributed, persistent, TTL native)
Rate limiter	In-process counter	Redis + sliding-window Lua script (safe across instances)
User identity	Raw Header	JWT token decode + signature verification
Logging	print()	Structured logs + alerting on LLM call volume / latency / errors

Full Code

Complete code is open-sourced at:

https://github.com/chendongqi/llm-in-action/tree/main/20-enterprise-rag

Key file:

enterprise_rag.py — full implementation: multi-tenancy + access control + cache + rate limiting + FastAPI + scenario verification

How to run:

git clone https://github.com/chendongqi/llm-in-action
cd 20-enterprise-rag
cp .env.example .env
pip install -r requirements.txt
python enterprise_rag.py

Summary

This article implemented a four-layer enterprise RAG architecture. Key findings:

Collection-level tenant isolation — separate Qdrant Collections per tenant provide a physical boundary; metadata filtering alone offers no hard guarantee
Permissions enforced at the DB layer — Qdrant's MatchAny filter means restricted documents never leave the database; there's nothing for the application to leak
Cache key must include role — same question, different role → different context → potentially different answer; cross-role cache reuse produces wrong results
Sliding window beats fixed window — eliminates boundary bursting; any 60-second interval is bounded, not just aligned windows
Access control is about absence — users see the documents they're allowed to see; restricted documents simply don't appear in sources; the LLM correctly reports "no information available" for what it never received

The gap between a RAG demo and a RAG production system is mostly engineering, not algorithms.

DEV Community