We’ve all seen impressive GenAI demos. Yet, in day‑to‑day engineering, the questions are softer but more real: How do we keep answers trustworthy? How do we respect access boundaries without slowing teams down? This article offers a practical, human‑centered path—from raw documents and images to a secure, explainable knowledge layer—powered by session‑aware authorization (STS) and a simple agent + tools pattern.
Tone and structure are inspired by thoughtful architecture writing like “Architecture of AI‑Driven Systems” on Python Plain English, focusing on clarity, trade‑offs, and gentle guidance rather than hype.
Why This Matters
- RAG without authorization is a liability. Enterprise data needs session‑scoped controls, revocation, and auditability.
- Accuracy is not enough; answers must be explainable and reproducible across versions.
- Multimodal inputs (PDFs, images) require consistent ingestion and normalization before indexing.
Architecture Snapshot
- Knowledge Base Services: ingestion, chunking, embedding, indexing (vector + graph), retrieval, and an STS manager for authorization.
- Agent Services: an agent wrapper orchestrates LLMs, tools, and guardrails; file upload and history modules support UX continuity.
- Tool Services: domain tools (retriever, SQL, custom) invoked by agents.
Flow: Upload → Initialize → Read/Image2Text → Chunk → Embed → Index (Vector + Graph) → Retrieve → STS Filter → Agent Compose → Response with citations.
RAG Architecture :
![RAG Architecture]
.
- Image‑to‑Text: extract text from images, unify format.
- Initialization: bootstraps pipelines, configs, and version stamps.
Design tips:
- Normalize MIME and metadata early; downstream pipelines assume clean structure.
- Batch I/O with retries; track ingestion version to reproduce embeddings.
Smart Chunking for Better Retrieval
- Chunking: semantic and rule‑based chunkers.
- Keep chunks small enough for LLM context but rich in metadata (section, page, hierarchy).
- Add relationship edges to support graph queries (e.g., section→subsection).
Embeddings + Dual Indexes
- Embeddings: choose model, normalize vectors, stamp versions.
- Vector Indexing: push to a vector store for semantic search.
- Graph Indexing: persist relationships and provenance.
Why two indexes?
- Vector search finds semantically related content.
- Graph retrieves lineage and context (citations, related sections), improving explainability.
Retrieval Orchestration
- Vector and Graph Retrievers: specialized retrievers.
- Hybrid Retrieval Orchestrator: fuses results from both stores.
Pattern:
- Try semantic (vector) for recall.
- Expand via graph for context and provenance.
- Fuse, rank, and return with metadata for STS filtering.
STS‑Aware Authorization
- STS Manager: resolves session→permissions, applies policies, and filters retrieval candidates.
- Enforce authorization before the agent composes answers; never let tools see disallowed content.
Benefits:
- Session‑scoped access, policy revocation, and audit trails.
- Prevents prompt injection using forbidden context.
Agents + Tools: The Execution Layer
- Agent Wrapper: wires LLM prompts, tools, and guardrails; manages tool selection.
- Tools: retriever and SQL tools for controlled data access.
- Compose answers with citations sourced from graph metadata.
Execution pattern:
- Agent decides → Tool executes → STS filters → Agent composes → Return answer + sources.
Observability, Versioning, and Deletion
-
knowledge_base_services/deletion/: right‑to‑be‑forgotten and data lifecycle. -
agent_services/history_services/: conversational trace for monitoring and explainability. - Index/embedding version stamps to reproduce runs.
Example Flow (Generic Pseudo‑Code)
# 1) Ingest + Normalize
content_items = reading.read_batch(files)
image_text = image_to_text.extract(images)
normalized = initializer.normalize(content_items + image_text)
# 2) Chunk + Embed
chunks = chunker.semantic(normalized)
vecs = embedder.batch_embed(chunks)
# 3) Index (Vector + Graph)
vector_index.write(vecs, chunks)
graph_index.link(chunks, relations=...)
# 4) Retrieve with STS filter
candidates = hybrid_retriever.search(query, k=10)
authorized = sts.filter(session, candidates)
# 5) Agent + Tools compose
answer = agent.run(
query=query,
tools=[retriever_tool, sql_tool],
context=authorized,
with_citations=True
)
Gentle guidance: keep module names and interfaces simple. Start with clear, testable boundaries—ingest, chunk, embed, index, retrieve, filter, compose—and iterate. Good names reduce cognitive load and make onboarding kinder.
What to Showcase in the Post
- Ingestion dispatch by MIME and metadata (reading, image‑to‑text, initialization).
- Semantic chunker attaching rich metadata (chunking).
- Batched embeddings + vector indexing with versioned names (embeddings, vector index).
- Hybrid retrieval orchestrator with fusion and fallbacks (vector retriever, graph retriever).
- STS filter gating results before agent sees them (STS manager).
- Agent tool wiring and citation composition (agent wrapper, tools).
Benchmarks and Learnings
- Track latency across stages: ingestion, embedding, indexing, retrieval, STS filtering, agent composition.
- Measure precision@k and citation correctness.
- Common pitfalls: over‑aggressive chunking, stale embeddings after content updates, authorization drift.
Quick Demo Hooks
Consider adding a minimal script that:
- Loads a sample doc + image.
- Runs ingestion→chunk→embed→index.
- Executes a hybrid retrieval for a test query.
- Applies STS filter for two different sessions.
- Prints answer with citations and filtered item counts.
Optional starting point:
# Create a tiny virtual environment and run the demo
python -m venv .venv; .\.venv\Scripts\Activate.ps1
python demo/sts_rag_demo.py
Closing Checklist for Enterprise‑Grade RAG
- Ingestion discipline with consistent metadata.
- Chunking strategy matched to content structure.
- Dual index (vector + graph) for recall + explainability.
- Retrieval orchestration with fusion and fallbacks.
- STS enforcement before agent composition.
- Observability: versions, histories, and deletion paths.
About the Author
Written by Suraj Khaitan
— Gen AI Architect | Working on serverless AI & cloud platforms.
Top comments (0)