Problem Statement:
Current State: CHAOS
500GB of documents
Hours to find answers
Losing $10K/day in productivity
ChatGPT can't access our private data
Your Solution: RAG System
Instant answers (< 1 second)
100% accurate responses
Secure, private data
Save $300K/year
Your RAG Toolkit
Retrieval: Semantic search
Augmentation: Context injection
Generation: Smart responses
Task 1: Set Up Development Environment
- Installing Python Libraries ChromaDB - Vector DB Transformers - ML Models Flask - Web Server OpenAI - LLM API
Purpose: Install all dependencies required for building RAG
Steps:
- cd /root && mkdir -p rag-project && cd rag-project
- python3 -m venv venv && source venv/bin/activate
- pip install uv && uv pip install chromadb sentence-transformers openai flask
- echo "READY" > /root/rag-setup-complete.txt
Explainations:
python3 -m venv venv: create a virtual environment in a folder named venv. A venv is a self-contained Python environment so dependencies don’t leak into the system Python.
source venv/bin/activate: activate that environment, so pip and python now refer to the virtual environment instead of the global system install.
pip install uv: installs uv, a modern Python package installer and resolver (much faster than regular pip).
uv pip install ...: uses uv as a drop-in replacement for pip to install packages into the virtual environment:chromadb: vector database for embeddings (used in RAG pipelines).
sentence-transformers: pretrained transformer models for turning text into embeddings.
openai: OpenAI’s official Python client library.
flask: lightweight web framework for serving APIs.
Task 2: Explore TechCorp's Document Vault
employee-handbook/
pet-policy.md (CEO's dog!)
remote-work-policy.md
benefits-overview.md
product-specs/
cloudsync-pro.md ($1M product)
datavault.md
meeting-notes/
q3-planning-meeting.md
product-launch-review.md
customer-faqs/
general-faqs.md
Total: 500GB simulated as focused docs
Purpose: Review all the documents before building RAG system
Steps:
cd /root/techcorp-docs
ls -la
find . -name "*.md" | wc -l
find . -name "*.md" | wc -l > /root/doc-count.txt
Task 3: Initialize Vector Database
ChromaDB Architecture
Documents → Vectors → Semantic Space
"pet policy" → [0.2,-0.5...]
"remote work" → [0.1,0.8...]
"product" → [0.9,0.3...]
384-dimensional semantic understanding
Purpose: Create AI brain for storing document vectors
Steps:
- Create init_vectordb.py
import chromadb
from chromadb.config import Settings
print(" Initializing AI Brain...")
client = chromadb.PersistentClient(
path="./chroma_db",
settings=Settings(anonymized_telemetry=False)
)
collection = client.get_or_create_collection(
name="techcorp_docs",
metadata={"hnsw:space": "cosine"}
)
print(f" Brain Created: {collection.name}")
print(f" Memories: {collection.count()}")
print(" AI Brain Ready!")
- Run it:
python init_vectordb.py
Task 4: Learn Document Chunking Strategy
Smart Chunking Strategy
Original Document (2000 chars)
Chunked (500 chars, 100 overlap)
↑ Overlaps preserve context = 40% better accuracy
Purpose: Learn optimal chunking strategy BEFORE processing real documents
Steps:
- Create test_chunking.py:
import os
print(" DOCUMENT CHUNKING ENGINE")
print("="*40)
def chunk_text(text, size=500, overlap=100):
"""Smart chunking with overlap for context preservation"""
chunks = []
start = 0
while start < len(text):
end = min(start + size, len(text))
chunk = text[start:end]
chunks.append(chunk)
if end >= len(text):
break
start += size - overlap
return chunks
# Process sample document
sample_doc = """TechCorp Pet Policy:
Employees may bring pets to the office on Fridays.
Dogs must be well-behaved and vaccinated.
The CEO's golden retriever is the office mascot.
Remote Work Policy:
Employees can work remotely up to 3 days per week.
Core hours are 10 AM - 3 PM in your local timezone.
All meetings should be recorded for async collaboration.
Benefits Overview:
Comprehensive health insurance including dental and vision.
401k matching up to 6% of salary.
Unlimited PTO after first year.
Annual learning budget of $2,000."""
print(f" Original document: {len(sample_doc)} characters")
print("-"*40)
chunks = chunk_text(sample_doc, size=500, overlap=100)
print(f" Created {len(chunks)} chunks")
print("-"*40)
for i, chunk in enumerate(chunks, 1):
print(f"\nChunk {i} ({len(chunk)} chars):")
print(f"Preview: {chunk[:60]}...")
# Save verification
with open('/root/chunk-test.txt', 'w') as f:
f.write(f"CHUNKS:{len(chunks)}")
print("\n" + "="*40)
print(" Chunking complete!")
print(f" Stats: {len(chunks)} chunks from {len(sample_doc)} chars")
print(" Ready for vectorization!")
python test_chunking.py
Task 5: Understand How Embeddings Work
Semantic Embedding Transformation
"Dogs allowed Fridays" → AI Model → 384D Vector
[0.23, -0.45, 0.67, ..., 0.12]
Semantic Similarity:
"Pets permitted" ↔ "Dogs allowed" = 92%
"Remote work" ↔ "Dogs allowed" = 18%
Purpose: Learn how AI converts text to math BEFORE processing real documents in Task 6
Steps:
- test_embeddings.py:
from sentence_transformers import SentenceTransformer
import numpy as np
print(" Loading Google's AI Brain (all-MiniLM-L6-v2)...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print(" Brain loaded! 90M parameters ready!\n")
# TechCorp test sentences
sentences = [
"Dogs are allowed in the office on Fridays",
"Pets can come to work on Furry Fridays",
"Remote work policy allows 3 days from home"
]
print(" Converting text to vectors...")
embeddings = model.encode(sentences)
print(f" Created {len(embeddings)} vectors of {len(embeddings[0])} dimensions each!\n")
# Calculate semantic similarities
sim_1_2 = np.dot(embeddings[0], embeddings[1])
sim_1_3 = np.dot(embeddings[0], embeddings[2])
print(" Semantic Similarity Analysis:")
print("="*50)
print(f"'Dogs allowed' ←→ 'Pets permitted'")
print(f"Similarity: {sim_1_2:.3f} (Very Related! )\n")
print(f"'Dogs allowed' ←→ 'Remote work'")
print(f"Similarity: {sim_1_3:.3f} (Not Related )\n")
# Visualization
print(" Similarity Scale:")
print("0.0 1.0")
print(f" Remote {'' * int(sim_1_3*20)}")
print(f" Pets {'' * int(sim_1_2*20)}")
# Save results
with open('/root/embedding-test.txt', 'w') as f:
f.write(f"SIM_PET:{sim_1_2:.3f},SIM_REMOTE:{sim_1_3:.3f}")
print("\n You've unlocked semantic understanding!")
python test_embeddings.py
Task 6: Feed the AI Brain
Purpose: Process ALL documents using chunking (Task 4) and embeddings (Task 5) into database (Task 3)
Steps:
ingest_documents.py
import os
import chromadb
from sentence_transformers import SentenceTransformer
from pathlib import Path
print("TECHCORP KNOWLEDGE INGESTION SYSTEM")
print("="*50)
# Initialize systems
print("Connecting to AI Brain (from Task 3)...")
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_collection("techcorp_docs")
print("Loading Semantic Processor (from Task 5)...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print("All systems online!\n")
# Process documents
print("Beginning knowledge transfer...")
doc_count = 0
total_chunks = 0
for category in Path('/root/techcorp-docs').iterdir():
if category.is_dir():
print(f"\nProcessing {category.name}:")
for doc in category.glob('*.md'):
print(f" {doc.name}", end="")
with open(doc, 'r') as f:
content = f.read()
# Apply chunking strategy from Task 4!
chunks = [content[i:i+500] for i in range(0, len(content), 400)]
for i, chunk in enumerate(chunks):
doc_id = f"{doc.stem}_{i}"
# Apply embedding from Task 5!
embedding = model.encode(chunk).tolist()
# Store in database from Task 3!
collection.add(
ids=[doc_id],
embeddings=[embedding],
documents=[chunk],
metadatas={"file": doc.name, "category": category.name}
)
total_chunks += 1
doc_count += 1
print(f" ({len(chunks)} chunks)")
print("\n" + "="*50)
print(f"INGESTION COMPLETE!")
print(f"Statistics:")
print(f" • Documents processed: {doc_count}")
print(f" • Knowledge chunks: {total_chunks}")
print(f" • AI IQ increased: +{doc_count*10} points")
print(f"\nValue delivered: $500K in searchable knowledge!")
# Save results
with open('/root/ingest-complete.txt', 'w') as f:
f.write(f"DOCS:{doc_count},CHUNKS:{collection.count()}")
python ingest_documents.py
Task 7: Activate Semantic Search Superpowers
Semantic Search in Action
"Can I bring my dog to work?"
↓
Vector Encoding → [0.23, -0.45, 0.67, ...]
↓
Searching 384D Space...
Top Results (by meaning, not keywords!):
- pet-policy.md (95% match) "Dogs allowed on Fridays..."
- employee-handbook.md (67% match) "Office policies include..."
- benefits.md (23% match) "Health benefits for..." Search time: 0.003 seconds
Purpose: Build semantic search that understands MEANING, not just keywords
Steps:
- test_search.py
import chromadb
from sentence_transformers import SentenceTransformer
print(" TECHCORP SEMANTIC SEARCH ENGINE")
print("="*50)
# Initialize
print(" Connecting to Knowledge Base...")
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_collection("techcorp_docs")
print(" Loading AI Understanding...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print(" Search Engine Ready!\n")
# CEO's test queries
queries = [
"What is the pet policy at TechCorp?",
"Tell me about CloudSync Pro features",
"How many days of remote work are allowed?"
]
results_file = open('/root/search-results.txt', 'w')
for query in queries:
print(f" Query: '{query}'")
print("-" * 50)
results_file.write(f"QUERY:{query}\n")
# Convert question to vector
query_embedding = model.encode(query).tolist()
# Semantic search!
results = collection.query(
query_embeddings=[query_embedding],
n_results=3
)
# Display results
print(" Top Results (by semantic similarity):")
for i, (doc, meta) in enumerate(zip(results['documents'][0], results['metadatas'][0])):
relevance = 100 - (i * 15) # Simulated relevance
print(f"\n {i+1}. [{meta['category']}] {meta['file']} ({relevance}% match)")
print(f" Preview: '{doc[:80]}...'")
results_file.write(f"RESULT:{meta['category']}/{meta['file']}\n")
print("\n" + "="*50 + "\n")
results_file.close()
print(" SEARCH TEST COMPLETE!")
print(" Notice: Found 'pet policy' even when searching 'bring my dog'!")
print(" This is the power of semantic understanding!")
python test_search.py
Task 8: Complete RAG Pipeline Test
Complete RAG Pipeline Flow
- RETRIEVAL "Benefits?" → [0.3,-0.2,...] → Top 3 Docs
- AUGMENTATION Context + Question → Prompt Engineering "Based on: [docs]... Answer: [question]"
- GENERATION LLM + Context → Accurate Answer "TechCorp offers healthcare, 401k..." Total Time: < 1 second | Accuracy: 100%
Purpose: Test all three phases of your RAG pipeline working together
Steps:
- test_rag_pipeline.py
import chromadb
from sentence_transformers import SentenceTransformer
import openai
import os
print(" TECHCORP RAG PIPELINE TEST")
print("="*50)
# Initialize all systems
print(" Initializing RAG Components...")
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_collection("techcorp_docs")
model = SentenceTransformer('all-MiniLM-L6-v2')
print(" All systems operational!\n")
def test_rag_pipeline(question):
"""Test the complete RAG Pipeline"""
print(f" Question: '{question}'")
print("-" * 50)
# 1. RETRIEVAL PHASE
print("\n PHASE 1: RETRIEVAL")
print(" Converting question to vector...")
query_embedding = model.encode(question).tolist()
print(" Searching knowledge base...")
results = collection.query(
query_embeddings=[query_embedding],
n_results=3
)
print(f" Found {len(results['documents'][0])} relevant documents!")
# 2. AUGMENTATION PHASE
print("\n PHASE 2: AUGMENTATION")
print(" Preparing context for AI...")
context = "\n\n".join(results['documents'][0])
# 3. GENERATION PHASE (Simulated)
print("\n PHASE 3: GENERATION")
print(" AI processing with context...")
# Simulated response
if "benefits" in question.lower():
answer = "Based on TechCorp documents: Employees enjoy comprehensive health insurance, 401k matching up to 6%, unlimited PTO, and professional development budgets."
else:
answer = f"Based on the retrieved TechCorp documents, here's the answer to '{question}'..."
print(" Response generated!")
return {
'question': question,
'sources_used': len(results['documents'][0]),
'answer': answer
}
# Test the pipeline
print("\n" + "="*50)
print(" TESTING COMPLETE PIPELINE")
print("="*50)
test_question = "What are the benefits of working at TechCorp?"
result = test_rag_pipeline(test_question)
print("\n" + "="*50)
print(" PIPELINE RESULTS")
print("="*50)
print(f" Question: {result['question']}")
print(f" Sources Used: {result['sources_used']} documents")
print(f" Answer: {result['answer']}")
# Performance metrics
print("\n PERFORMANCE METRICS:")
print(" • Retrieval: 0.012 seconds")
print(" • Augmentation: 0.003 seconds")
print(" • Generation: 0.234 seconds")
print(" • Total: 0.249 seconds")
# Save pipeline verification
with open('/root/rag-pipeline-test.txt', 'w') as f:
f.write(f"PIPELINE:COMPLETE,SOURCES:{result['sources_used']}")
print("\n" + "="*50)
print(" SUCCESS! RAG Pipeline Working!")
print("="*50)
python test_rag_pipeline.py
Task 9: Launch Your AI Assistant
Purpose: Deploy and interact with your complete RAG system via web interface
Top comments (0)