The Gap Between Demo and Production
Every article in this series has shared one architectural assumption: a single vector store, accessible to everyone, returning any document to any user.
That works in a demo. In an enterprise environment, it breaks immediately:
- Company A's documents can be retrieved by Company B's users
- Financial data can be pulled by any employee
- HR policies visible to contractors
- One user hammers the API and takes down the service for everyone else
Production enterprise RAG needs three layers:
Incoming request
↓ rate limit check — is this user still within quota?
↓ cache lookup — has this question been answered before?
↓ tenant routing — which knowledge base?
↓ permission filter — within that KB, what can this user see?
↓ retrieve + generate — answer from authorized content only
↓ cache write — store for next time
This article implements each layer.
Layer 1: Multi-Tenancy
Strategy: one Qdrant Collection per tenant
Each customer or department gets its own Qdrant Collection. Collections are physically isolated — you can't search acme_corp's content by querying globex_corp, because the two collections are entirely separate vector spaces.
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams
qdrant_client = QdrantClient(":memory:") # production: host="qdrant-server"
tenant_stores: dict[str, QdrantVectorStore] = {}
for tenant_id, docs in TENANT_DOCS.items():
qdrant_client.create_collection(
collection_name=tenant_id,
vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)
store = QdrantVectorStore(
client=qdrant_client,
collection_name=tenant_id,
embedding=embeddings,
)
store.add_documents(docs)
tenant_stores[tenant_id] = store
Routing is trivial — the request carries a tenant_id, the service selects the matching store:
def get_retriever(tenant_id: str, role: str, k: int = 3):
if tenant_id not in tenant_stores:
raise ValueError(f"Unknown tenant: {tenant_id}")
store = tenant_stores[tenant_id]
# permission filter added in Layer 2
...
Why not a shared Collection with a tenant_id metadata filter?
It works technically, but carries a risk: a filter bug means Tenant A's data leaks to Tenant B. There's no hard boundary. Collection-level isolation also makes teardown clean — removing a tenant means dropping their Collection, with no residue.
For soft isolation (departments within one company), metadata filtering is fine. For hard isolation (different customers), separate Collections are safer.
Layer 2: Access Control
Strategy: documents carry access_level; retrieval injects a Qdrant filter
Each document declares its access level in metadata:
Document(
page_content="Annual bonus: S-tier 3 months, A-tier 2 months...",
metadata={"source": "hr-policy", "access_level": "hr_only"},
)
Document(
page_content="Robot control system: EtherCAT bus, latency <1ms...",
metadata={"source": "robot-spec", "access_level": "engineering_only"},
)
Roles map to the access levels they can see:
ROLE_PERMISSIONS: dict[str, list[str]] = {
"admin": ["public", "engineering_only", "hr_only", "finance_only"],
"engineer": ["public", "engineering_only"],
"hr": ["public", "hr_only"],
"finance": ["public", "finance_only"],
"employee": ["public"],
}
At retrieval time, the role's allowed levels become a Qdrant MatchAny filter:
from qdrant_client.models import Filter, FieldCondition, MatchAny
def get_retriever(tenant_id: str, role: str, k: int = 3):
levels = ROLE_PERMISSIONS.get(role, ["public"])
access_filter = Filter(
must=[
FieldCondition(
key="metadata.access_level",
match=MatchAny(any=levels),
)
]
)
return tenant_stores[tenant_id].as_retriever(
search_kwargs={"k": k, "filter": access_filter}
)
This filter executes at the vector database layer, not the application layer. Unauthorized documents never leave the database — they aren't returned to the application, so there's nothing to leak.
Layer 3: Caching
Strategy: (tenant_id, role, question) as cache key, TTL 300 seconds
@dataclass
class CacheEntry:
answer: str
created_at: float = field(default_factory=time.time)
class QueryCache:
def __init__(self, ttl_seconds: int = 300):
self._store: dict[tuple, CacheEntry] = {}
self._ttl = ttl_seconds
def get(self, tenant_id, role, question) -> Optional[str]:
entry = self._store.get((tenant_id, role, question.strip().lower()))
if entry and (time.time() - entry.created_at) < self._ttl:
return entry.answer
return None
def set(self, tenant_id, role, question, answer) -> None:
self._store[(tenant_id, role, question.strip().lower())] = CacheEntry(answer)
Including role in the cache key matters: an engineer and an HR manager asking the same question get different contexts (different documents pass the permission filter), so they may get different answers. Cache entries are not cross-role reusable.
Layer 4: Rate Limiting
Strategy: sliding window, 5 requests per user per minute
class RateLimiter:
def __init__(self, max_requests: int = 5, window_seconds: int = 60):
self._max = max_requests
self._window = window_seconds
self._log: dict[str, list[float]] = defaultdict(list)
def allow(self, user_id: str) -> bool:
now = time.time()
self._log[user_id] = [t for t in self._log[user_id]
if now - t < self._window]
if len(self._log[user_id]) >= self._max:
return False
self._log[user_id].append(now)
return True
Sliding window vs. fixed window: a fixed window allows bursting at boundaries — a user can send 5 requests at second 59 and 5 more at second 61, sending 10 in 60 seconds. A sliding window enforces the limit across any 60-second interval.
Experiment Results
Scenario A: Normal Retrieval
Engineer alice queries company info and technical docs:
Q: What type of company is ACME Corp?
A: ACME Corp is a smart manufacturing company.
Sources: [company-intro, robot-spec] ← public + engineering docs, correct
elapsed: 995ms
Q: What communication protocol does ACME's robot system use?
A: ACME Corp's robot control system uses the EtherCAT real-time bus.
Sources: [company-intro, robot-spec] ← engineering doc correctly retrieved
elapsed: 1709ms
Scenario B: Permission Filtering
The key thing to read here is the sources array — not whether docs_retrieved > 0:
[B1] Engineer alice asks about annual bonus (hr_only doc):
Sources: [company-intro, robot-spec] ← hr-policy is NOT in sources
A: The reference material does not contain information about the bonus policy.
[B2] HR bob asks about net profit (finance_only doc):
Sources: [company-intro, hr-policy] ← financial-report is NOT in sources
A: The reference material does not contain ACME's 2025 net profit.
[B3] HR bob asks about annual leave (hr_only doc):
Sources: [company-intro, hr-policy] ← hr-policy correctly appears
A: Year 1: 12 days. Each additional year: +2 days. Maximum: 20 days.
What access control actually looks like in practice: hr-policy never appears in alice's sources list; financial-report never appears in bob's sources list. The Qdrant filter intercepts these documents at the database layer. The LLM never receives them, so it correctly responds that the information isn't available.
This is the right behavior: users still get documents they can access (public + their role-specific docs); only the restricted documents are absent.
Scenario C: Tenant Isolation
[C1] Globex user charlie asks about ACME Corp's headcount:
Tenant: globex_corp
Sources: [products, company-intro] ← these are Globex's own docs
A: The reference material does not contain ACME's employee count.
[C2] Globex user queries their own product lines:
Sources: [company-intro, products] ← Globex docs correctly returned
A: GlexCloud, GlexAnalytics, GlexAI...
Charlie is querying the globex_corp Collection for ACME Corp information. Of course nothing comes back — ACME's content doesn't physically exist in Globex's Collection.
Scenario D: Cache Hit
First request (Scenario A1): 995ms, cache_hit=false
Same question repeated: 0ms, cache_hit=true
0ms means the repeated request skipped both retrieval and LLM generation entirely. For frequently repeated questions — company policy, common workflows, product FAQs — caching compounds quickly.
Scenario E: Rate Limiting
Config: 5 req / 60s / user
Request 1: allowed
Request 2: allowed
Request 3: allowed
Request 4: allowed
Request 5: allowed
Request 6: RATE LIMITED ← limit enforced
Request 7: RATE LIMITED
The rate limiter correctly allowed 5 and blocked 2 out of 7 requests.
FastAPI Service Layer
The four layers above are wired together in a single query() function, then exposed via FastAPI:
from fastapi import FastAPI, HTTPException, Header
from pydantic import BaseModel
app = FastAPI(title="Enterprise RAG Service")
class QueryRequest(BaseModel):
tenant_id: str
question: str
@app.post("/query")
async def query_endpoint(
req: QueryRequest,
x_user_id: str = Header(...), # user identity from request header
x_user_role: str = Header(...), # user role from request header
):
result = query(
tenant_id=req.tenant_id,
user_id=x_user_id,
role=x_user_role,
question=req.question,
)
if result.rate_limited:
raise HTTPException(status_code=429, detail="Too many requests")
return {
"answer": result.answer,
"sources": result.sources,
"cache_hit": result.cache_hit,
}
Start with: uvicorn enterprise_rag:app --host 0.0.0.0 --port 8080
In production, x_user_id and x_user_role should come from JWT token decoding, not raw client headers.
Production Upgrade Path
| Component | Demo Implementation | Production Replacement |
|---|---|---|
| Qdrant | :memory: |
Dedicated server, host="qdrant-server"
|
| Cache | In-process dict | Redis (distributed, persistent, TTL native) |
| Rate limiter | In-process counter | Redis + sliding-window Lua script (safe across instances) |
| User identity | Raw Header | JWT token decode + signature verification |
| Logging | print() | Structured logs + alerting on LLM call volume / latency / errors |
Full Code
Complete code is open-sourced at:
https://github.com/chendongqi/llm-in-action/tree/main/20-enterprise-rag
Key file:
-
enterprise_rag.py— full implementation: multi-tenancy + access control + cache + rate limiting + FastAPI + scenario verification
How to run:
git clone https://github.com/chendongqi/llm-in-action
cd 20-enterprise-rag
cp .env.example .env
pip install -r requirements.txt
python enterprise_rag.py
Summary
This article implemented a four-layer enterprise RAG architecture. Key findings:
- Collection-level tenant isolation — separate Qdrant Collections per tenant provide a physical boundary; metadata filtering alone offers no hard guarantee
-
Permissions enforced at the DB layer — Qdrant's
MatchAnyfilter means restricted documents never leave the database; there's nothing for the application to leak - Cache key must include role — same question, different role → different context → potentially different answer; cross-role cache reuse produces wrong results
- Sliding window beats fixed window — eliminates boundary bursting; any 60-second interval is bounded, not just aligned windows
- Access control is about absence — users see the documents they're allowed to see; restricted documents simply don't appear in sources; the LLM correctly reports "no information available" for what it never received
The gap between a RAG demo and a RAG production system is mostly engineering, not algorithms.
Top comments (0)