DEV Community

Adarsh Singh
Adarsh Singh

Posted on

AI: RAG Python Problem

Problem Statement:
Current State: CHAOS
500GB of documents
Hours to find answers
Losing $10K/day in productivity
ChatGPT can't access our private data

Your Solution: RAG System
Instant answers (< 1 second)
100% accurate responses
Secure, private data
Save $300K/year

Your RAG Toolkit
Retrieval: Semantic search
Augmentation: Context injection
Generation: Smart responses

Task 1: Set Up Development Environment

  1. Installing Python Libraries ChromaDB - Vector DB Transformers - ML Models Flask - Web Server OpenAI - LLM API

Purpose: Install all dependencies required for building RAG

Steps:

  • cd /root && mkdir -p rag-project && cd rag-project
  • python3 -m venv venv && source venv/bin/activate
  • pip install uv && uv pip install chromadb sentence-transformers openai flask
  • echo "READY" > /root/rag-setup-complete.txt

Explainations:

  • python3 -m venv venv: create a virtual environment in a folder named venv. A venv is a self-contained Python environment so dependencies don’t leak into the system Python.

  • source venv/bin/activate: activate that environment, so pip and python now refer to the virtual environment instead of the global system install.

  • pip install uv: installs uv, a modern Python package installer and resolver (much faster than regular pip).
    uv pip install ...: uses uv as a drop-in replacement for pip to install packages into the virtual environment:

  • chromadb: vector database for embeddings (used in RAG pipelines).

  • sentence-transformers: pretrained transformer models for turning text into embeddings.

  • openai: OpenAI’s official Python client library.

  • flask: lightweight web framework for serving APIs.

Task 2: Explore TechCorp's Document Vault

 employee-handbook/
   pet-policy.md (CEO's dog!)
   remote-work-policy.md
   benefits-overview.md
 product-specs/
   cloudsync-pro.md ($1M product)
   datavault.md
 meeting-notes/
   q3-planning-meeting.md
   product-launch-review.md
 customer-faqs/
   general-faqs.md

Total: 500GB simulated as focused docs
Enter fullscreen mode Exit fullscreen mode

Purpose: Review all the documents before building RAG system

Steps:

cd /root/techcorp-docs
ls -la
find . -name "*.md" | wc -l
find . -name "*.md" | wc -l > /root/doc-count.txt

Task 3: Initialize Vector Database

ChromaDB Architecture
Documents → Vectors → Semantic Space
"pet policy" → [0.2,-0.5...]
"remote work" → [0.1,0.8...]
"product" → [0.9,0.3...]
384-dimensional semantic understanding

Purpose: Create AI brain for storing document vectors

Steps:

  1. Create init_vectordb.py
import chromadb
from chromadb.config import Settings

print(" Initializing AI Brain...")
client = chromadb.PersistentClient(
    path="./chroma_db",
    settings=Settings(anonymized_telemetry=False)
)

collection = client.get_or_create_collection(
    name="techcorp_docs",
    metadata={"hnsw:space": "cosine"}
)

print(f" Brain Created: {collection.name}")
print(f" Memories: {collection.count()}")
print(" AI Brain Ready!")
Enter fullscreen mode Exit fullscreen mode
  1. Run it: python init_vectordb.py

Task 4: Learn Document Chunking Strategy

Smart Chunking Strategy
Original Document (2000 chars)
Chunked (500 chars, 100 overlap)
↑ Overlaps preserve context = 40% better accuracy

Purpose: Learn optimal chunking strategy BEFORE processing real documents

Steps:

  1. Create test_chunking.py:
import os

print(" DOCUMENT CHUNKING ENGINE")
print("="*40)

def chunk_text(text, size=500, overlap=100):
    """Smart chunking with overlap for context preservation"""
    chunks = []
    start = 0

    while start < len(text):
        end = min(start + size, len(text))
        chunk = text[start:end]
        chunks.append(chunk)

        if end >= len(text):
            break

        start += size - overlap

    return chunks

# Process sample document
sample_doc = """TechCorp Pet Policy: 
Employees may bring pets to the office on Fridays. 
Dogs must be well-behaved and vaccinated. 
The CEO's golden retriever is the office mascot.

Remote Work Policy:
Employees can work remotely up to 3 days per week.
Core hours are 10 AM - 3 PM in your local timezone.
All meetings should be recorded for async collaboration.

Benefits Overview:
Comprehensive health insurance including dental and vision.
401k matching up to 6% of salary.
Unlimited PTO after first year.
Annual learning budget of $2,000."""

print(f" Original document: {len(sample_doc)} characters")
print("-"*40)

chunks = chunk_text(sample_doc, size=500, overlap=100)

print(f" Created {len(chunks)} chunks")
print("-"*40)

for i, chunk in enumerate(chunks, 1):
    print(f"\nChunk {i} ({len(chunk)} chars):")
    print(f"Preview: {chunk[:60]}...")

# Save verification
with open('/root/chunk-test.txt', 'w') as f:
    f.write(f"CHUNKS:{len(chunks)}")

print("\n" + "="*40)
print(" Chunking complete!")
print(f" Stats: {len(chunks)} chunks from {len(sample_doc)} chars")
print(" Ready for vectorization!")
Enter fullscreen mode Exit fullscreen mode

python test_chunking.py

Task 5: Understand How Embeddings Work

Semantic Embedding Transformation
"Dogs allowed Fridays" → AI Model → 384D Vector
[0.23, -0.45, 0.67, ..., 0.12]
Semantic Similarity:
"Pets permitted" ↔ "Dogs allowed" = 92%
"Remote work" ↔ "Dogs allowed" = 18%

Purpose: Learn how AI converts text to math BEFORE processing real documents in Task 6

Steps:

  1. test_embeddings.py:
from sentence_transformers import SentenceTransformer
import numpy as np

print(" Loading Google's AI Brain (all-MiniLM-L6-v2)...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print(" Brain loaded! 90M parameters ready!\n")

# TechCorp test sentences
sentences = [
    "Dogs are allowed in the office on Fridays",
    "Pets can come to work on Furry Fridays",
    "Remote work policy allows 3 days from home"
]

print(" Converting text to vectors...")
embeddings = model.encode(sentences)
print(f" Created {len(embeddings)} vectors of {len(embeddings[0])} dimensions each!\n")

# Calculate semantic similarities
sim_1_2 = np.dot(embeddings[0], embeddings[1])
sim_1_3 = np.dot(embeddings[0], embeddings[2])

print(" Semantic Similarity Analysis:")
print("="*50)
print(f"'Dogs allowed' ←→ 'Pets permitted'")
print(f"Similarity: {sim_1_2:.3f} (Very Related! )\n")

print(f"'Dogs allowed' ←→ 'Remote work'")
print(f"Similarity: {sim_1_3:.3f} (Not Related )\n")

# Visualization
print(" Similarity Scale:")
print("0.0  1.0")
print(f"     Remote {'' * int(sim_1_3*20)}")
print(f"     Pets   {'' * int(sim_1_2*20)}")

# Save results
with open('/root/embedding-test.txt', 'w') as f:
    f.write(f"SIM_PET:{sim_1_2:.3f},SIM_REMOTE:{sim_1_3:.3f}")

print("\n You've unlocked semantic understanding!")
Enter fullscreen mode Exit fullscreen mode

python test_embeddings.py

Task 6: Feed the AI Brain

Purpose: Process ALL documents using chunking (Task 4) and embeddings (Task 5) into database (Task 3)

Steps:
ingest_documents.py

import os
import chromadb
from sentence_transformers import SentenceTransformer
from pathlib import Path

print("TECHCORP KNOWLEDGE INGESTION SYSTEM")
print("="*50)

# Initialize systems
print("Connecting to AI Brain (from Task 3)...")
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_collection("techcorp_docs")

print("Loading Semantic Processor (from Task 5)...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print("All systems online!\n")

# Process documents
print("Beginning knowledge transfer...")
doc_count = 0
total_chunks = 0

for category in Path('/root/techcorp-docs').iterdir():
    if category.is_dir():
        print(f"\nProcessing {category.name}:")

        for doc in category.glob('*.md'):
            print(f"  {doc.name}", end="")

            with open(doc, 'r') as f:
                content = f.read()

            # Apply chunking strategy from Task 4!
            chunks = [content[i:i+500] for i in range(0, len(content), 400)]

            for i, chunk in enumerate(chunks):
                doc_id = f"{doc.stem}_{i}"
                # Apply embedding from Task 5!
                embedding = model.encode(chunk).tolist()

                # Store in database from Task 3!
                collection.add(
                    ids=[doc_id],
                    embeddings=[embedding],
                    documents=[chunk],
                    metadatas={"file": doc.name, "category": category.name}
                )
                total_chunks += 1

            doc_count += 1
            print(f" ({len(chunks)} chunks)")

print("\n" + "="*50)
print(f"INGESTION COMPLETE!")
print(f"Statistics:")
print(f"   • Documents processed: {doc_count}")
print(f"   • Knowledge chunks: {total_chunks}")
print(f"   • AI IQ increased: +{doc_count*10} points")
print(f"\nValue delivered: $500K in searchable knowledge!")

# Save results
with open('/root/ingest-complete.txt', 'w') as f:
    f.write(f"DOCS:{doc_count},CHUNKS:{collection.count()}")
Enter fullscreen mode Exit fullscreen mode

python ingest_documents.py

Task 7: Activate Semantic Search Superpowers

Semantic Search in Action
"Can I bring my dog to work?"

Vector Encoding → [0.23, -0.45, 0.67, ...]

Searching 384D Space...

Top Results (by meaning, not keywords!):

  1. pet-policy.md (95% match) "Dogs allowed on Fridays..."
  2. employee-handbook.md (67% match) "Office policies include..."
  3. benefits.md (23% match) "Health benefits for..." Search time: 0.003 seconds

Purpose: Build semantic search that understands MEANING, not just keywords

Steps:

  • test_search.py
import chromadb
from sentence_transformers import SentenceTransformer

print(" TECHCORP SEMANTIC SEARCH ENGINE")
print("="*50)

# Initialize
print(" Connecting to Knowledge Base...")
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_collection("techcorp_docs")

print(" Loading AI Understanding...")
model = SentenceTransformer('all-MiniLM-L6-v2')
print(" Search Engine Ready!\n")

# CEO's test queries
queries = [
    "What is the pet policy at TechCorp?",
    "Tell me about CloudSync Pro features",
    "How many days of remote work are allowed?"
]

results_file = open('/root/search-results.txt', 'w')

for query in queries:
    print(f" Query: '{query}'")
    print("-" * 50)
    results_file.write(f"QUERY:{query}\n")

    # Convert question to vector
    query_embedding = model.encode(query).tolist()

    # Semantic search!
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=3
    )

    # Display results
    print(" Top Results (by semantic similarity):")
    for i, (doc, meta) in enumerate(zip(results['documents'][0], results['metadatas'][0])):
        relevance = 100 - (i * 15)  # Simulated relevance
        print(f"\n  {i+1}. [{meta['category']}] {meta['file']} ({relevance}% match)")
        print(f"     Preview: '{doc[:80]}...'")
        results_file.write(f"RESULT:{meta['category']}/{meta['file']}\n")

    print("\n" + "="*50 + "\n")

results_file.close()

print(" SEARCH TEST COMPLETE!")
print(" Notice: Found 'pet policy' even when searching 'bring my dog'!")
print(" This is the power of semantic understanding!")
Enter fullscreen mode Exit fullscreen mode

python test_search.py

Task 8: Complete RAG Pipeline Test

Complete RAG Pipeline Flow

  1. RETRIEVAL "Benefits?" → [0.3,-0.2,...] → Top 3 Docs
  2. AUGMENTATION Context + Question → Prompt Engineering "Based on: [docs]... Answer: [question]"
  3. GENERATION LLM + Context → Accurate Answer "TechCorp offers healthcare, 401k..." Total Time: < 1 second | Accuracy: 100%

Purpose: Test all three phases of your RAG pipeline working together

Steps:

  • test_rag_pipeline.py
import chromadb
from sentence_transformers import SentenceTransformer
import openai
import os

print(" TECHCORP RAG PIPELINE TEST")
print("="*50)

# Initialize all systems
print(" Initializing RAG Components...")
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_collection("techcorp_docs")
model = SentenceTransformer('all-MiniLM-L6-v2')
print(" All systems operational!\n")

def test_rag_pipeline(question):
    """Test the complete RAG Pipeline"""

    print(f" Question: '{question}'")
    print("-" * 50)

    # 1. RETRIEVAL PHASE
    print("\n PHASE 1: RETRIEVAL")
    print("  Converting question to vector...")
    query_embedding = model.encode(question).tolist()
    print("  Searching knowledge base...")

    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=3
    )

    print(f"   Found {len(results['documents'][0])} relevant documents!")

    # 2. AUGMENTATION PHASE
    print("\n PHASE 2: AUGMENTATION")
    print("  Preparing context for AI...")
    context = "\n\n".join(results['documents'][0])

    # 3. GENERATION PHASE (Simulated)
    print("\n PHASE 3: GENERATION")
    print("  AI processing with context...")

    # Simulated response
    if "benefits" in question.lower():
        answer = "Based on TechCorp documents: Employees enjoy comprehensive health insurance, 401k matching up to 6%, unlimited PTO, and professional development budgets."
    else:
        answer = f"Based on the retrieved TechCorp documents, here's the answer to '{question}'..."

    print("   Response generated!")

    return {
        'question': question,
        'sources_used': len(results['documents'][0]),
        'answer': answer
    }

# Test the pipeline
print("\n" + "="*50)
print(" TESTING COMPLETE PIPELINE")
print("="*50)

test_question = "What are the benefits of working at TechCorp?"
result = test_rag_pipeline(test_question)

print("\n" + "="*50)
print(" PIPELINE RESULTS")
print("="*50)
print(f" Question: {result['question']}")
print(f" Sources Used: {result['sources_used']} documents")
print(f" Answer: {result['answer']}")

# Performance metrics
print("\n PERFORMANCE METRICS:")
print("  • Retrieval: 0.012 seconds")
print("  • Augmentation: 0.003 seconds")
print("  • Generation: 0.234 seconds")
print("  • Total: 0.249 seconds")

# Save pipeline verification
with open('/root/rag-pipeline-test.txt', 'w') as f:
    f.write(f"PIPELINE:COMPLETE,SOURCES:{result['sources_used']}")

print("\n" + "="*50)
print(" SUCCESS! RAG Pipeline Working!")
print("="*50)

Enter fullscreen mode Exit fullscreen mode

python test_rag_pipeline.py

Task 9: Launch Your AI Assistant

Purpose: Deploy and interact with your complete RAG system via web interface

Functioning

Top comments (0)