Phase 8 has been building to this.
Tokenization. Embeddings. Transformers. BERT. GPT. HuggingFace. Fine-tuning. Vector search. RAG. Chatbot architecture. OpenAI API. Claude API.
Fourteen posts of foundations.
Now you ship something real.
This capstone builds DocuMind: an AI research assistant that lets users upload any PDF or text document and have intelligent conversations about it. It retrieves relevant passages, cites sources, maintains conversation context, and deploys as a shareable web application.
It is not the most complex system ever built. It is exactly complex enough to require every concept from Phase 8, and simple enough to build completely in one post.
What You Are Building
DocuMind: AI Research Assistant
Features:
✓ Upload PDFs and text files
✓ Automatic chunking and embedding
✓ Semantic search over uploaded documents
✓ Multi-turn conversation with memory
✓ Source citations on every answer
✓ Multi-document support
✓ Deployable Streamlit app
Tech stack:
- sentence-transformers (embeddings)
- chromadb (vector store)
- anthropic / openai (LLM)
- streamlit (UI)
- pypdf (PDF parsing)
- langchain text splitter (chunking)
Project Structure
documind/
├── app.py # Streamlit application entry point
├── config.py # Configuration and constants
├── document_processor.py # PDF/text loading, chunking, embedding
├── knowledge_base.py # ChromaDB vector store wrapper
├── conversation.py # Multi-turn conversation management
├── llm_client.py # Pluggable LLM backend (Claude or OpenAI)
├── rag_engine.py # Retrieval-augmented generation pipeline
├── requirements.txt # All dependencies
└── README.md # Setup and usage instructions
config.py
# config.py
from dataclasses import dataclass
@dataclass
class Config:
# Embedding model
EMBED_MODEL: str = "all-MiniLM-L6-v2"
EMBED_DIM: int = 384
# Chunking
CHUNK_SIZE: int = 512
CHUNK_OVERLAP: int = 64
# Retrieval
TOP_K_RETRIEVE: int = 5
TOP_K_RERANK: int = 3
MIN_SCORE: float = 0.25
# Conversation
MAX_HISTORY_TURNS: int = 8
# LLM
LLM_PROVIDER: str = "anthropic" # "anthropic" or "openai"
ANTHROPIC_MODEL: str = "claude-3-5-haiku-20241022"
OPENAI_MODEL: str = "gpt-4o-mini"
MAX_TOKENS: int = 800
TEMPERATURE: float = 0.2
# ChromaDB
CHROMA_PATH: str = "./chroma_db"
COLLECTION_NAME: str = "documind"
CONFIG = Config()
document_processor.py
# document_processor.py
import re
from pathlib import Path
from typing import List, Dict
from dataclasses import dataclass
@dataclass
class Chunk:
text: str
source: str
page: int
chunk_idx: int
char_start: int
def load_text_file(filepath: str) -> str:
with open(filepath, "r", encoding="utf-8", errors="replace") as f:
return f.read()
def load_pdf(filepath: str) -> List[Dict]:
"""Return list of {text, page} dicts."""
try:
import pypdf
pages = []
with open(filepath, "rb") as f:
reader = pypdf.PdfReader(f)
for i, page in enumerate(reader.pages):
text = page.extract_text() or ""
if text.strip():
pages.append({"text": text, "page": i + 1})
return pages
except ImportError:
raise ImportError("Install: pip install pypdf")
def clean_text(text: str) -> str:
text = re.sub(r"\s+", " ", text)
text = re.sub(r"(\n\s*){3,}", "\n\n", text)
return text.strip()
def chunk_text(text: str, source: str, page: int = 1,
chunk_size: int = 512, overlap: int = 64) -> List[Chunk]:
"""Recursive character splitting with overlap."""
text = clean_text(text)
words = text.split()
chunks = []
idx = 0
step = chunk_size - overlap
while idx < len(words):
chunk_words = words[idx:idx + chunk_size]
chunk_text = " ".join(chunk_words)
if len(chunk_text.strip()) > 50:
chunks.append(Chunk(
text = chunk_text,
source = source,
page = page,
chunk_idx = len(chunks),
char_start= idx
))
idx += step
return chunks
def process_document(filepath: str,
chunk_size: int = 512,
overlap: int = 64) -> List[Chunk]:
"""Load and chunk any supported document type."""
path = Path(filepath)
all_chunks = []
if path.suffix.lower() == ".pdf":
pages = load_pdf(filepath)
for page_data in pages:
chunks = chunk_text(
page_data["text"],
source = path.name,
page = page_data["page"],
chunk_size = chunk_size,
overlap = overlap
)
all_chunks.extend(chunks)
elif path.suffix.lower() in [".txt", ".md", ".rst"]:
text = load_text_file(filepath)
chunks = chunk_text(
text,
source = path.name,
page = 1,
chunk_size = chunk_size,
overlap = overlap
)
all_chunks.extend(chunks)
else:
raise ValueError(f"Unsupported file type: {path.suffix}")
print(f" Processed {path.name}: {len(all_chunks)} chunks")
return all_chunks
knowledge_base.py
# knowledge_base.py
import uuid
from typing import List, Dict, Optional
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
from document_processor import Chunk
class KnowledgeBase:
def __init__(self, embed_model: str, persist_dir: str, collection_name: str):
self.embedder = SentenceTransformer(embed_model)
self.client = chromadb.PersistentClient(path=persist_dir)
self.collection = self.client.get_or_create_collection(
name = collection_name,
metadata = {"hnsw:space": "cosine"}
)
self._source_cache = {}
def add_chunks(self, chunks: List[Chunk]) -> int:
if not chunks:
return 0
texts = [c.text for c in chunks]
embeddings = self.embedder.encode(texts, show_progress_bar=True).tolist()
ids = [str(uuid.uuid4()) for _ in chunks]
metadatas = [
{"source": c.source,
"page": c.page,
"chunk_idx": c.chunk_idx}
for c in chunks
]
self.collection.add(
ids = ids,
embeddings = embeddings,
documents = texts,
metadatas = metadatas
)
for chunk in chunks:
self._source_cache[chunk.source] = True
return len(chunks)
def retrieve(self, query: str, top_k: int = 5,
min_score: float = 0.25,
filter_source: Optional[str] = None) -> List[Dict]:
query_emb = self.embedder.encode([query]).tolist()
where = {"source": filter_source} if filter_source else None
results = self.collection.query(
query_embeddings = query_emb,
n_results = top_k,
where = where,
include = ["documents", "metadatas", "distances"]
)
retrieved = []
for text, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
):
score = 1 - dist
if score >= min_score:
retrieved.append({
"text": text,
"source": meta["source"],
"page": meta["page"],
"score": round(score, 4)
})
return retrieved
def get_sources(self) -> List[str]:
return list(self._source_cache.keys())
def document_count(self) -> int:
return self.collection.count()
def clear(self):
self.client.delete_collection(self.collection.name)
self.collection = self.client.get_or_create_collection(
name=self.collection.name,
metadata={"hnsw:space": "cosine"}
)
self._source_cache = {}
llm_client.py
# llm_client.py
import os
from abc import ABC, abstractmethod
from typing import List, Dict
class LLMClient(ABC):
@abstractmethod
def complete(self, system: str, messages: List[Dict],
max_tokens: int = 800, temperature: float = 0.2) -> str:
pass
class ClaudeClient(LLMClient):
def __init__(self, model: str = "claude-3-5-haiku-20241022"):
import anthropic
self.client = anthropic.Anthropic(
api_key=os.environ.get("ANTHROPIC_API_KEY"))
self.model = model
def complete(self, system, messages, max_tokens=800, temperature=0.2):
response = self.client.messages.create(
model=self.model, max_tokens=max_tokens,
temperature=temperature, system=system, messages=messages
)
return response.content[0].text
class OpenAIClient(LLMClient):
def __init__(self, model: str = "gpt-4o-mini"):
import openai
self.client = openai.OpenAI(
api_key=os.environ.get("OPENAI_API_KEY"))
self.model = model
def complete(self, system, messages, max_tokens=800, temperature=0.2):
all_msgs = [{"role": "system", "content": system}] + messages
response = self.client.chat.completions.create(
model=self.model, messages=all_msgs,
max_tokens=max_tokens, temperature=temperature
)
return response.choices[0].message.content
def get_llm_client(provider: str = "anthropic", **kwargs) -> LLMClient:
if provider == "anthropic":
return ClaudeClient(**kwargs)
elif provider == "openai":
return OpenAIClient(**kwargs)
raise ValueError(f"Unknown provider: {provider}")
rag_engine.py
# rag_engine.py
from typing import List, Dict, Tuple
from knowledge_base import KnowledgeBase
from llm_client import LLMClient
SYSTEM_PROMPT = """You are DocuMind, an AI research assistant.
Your job is to answer questions based ONLY on the provided document context.
Rules:
- Use only information from the provided context
- Cite sources using [1], [2] etc. at the end of relevant sentences
- If the answer is not in the context, say exactly: "I couldn't find information about this in the uploaded documents."
- Be precise and concise
- If multiple documents are relevant, synthesize them coherently
- Format responses with markdown when helpful"""
def build_context_block(retrieved: List[Dict]) -> str:
parts = []
for i, doc in enumerate(retrieved, 1):
parts.append(
f"[{i}] Source: {doc['source']} (page {doc['page']}, "
f"relevance: {doc['score']:.0%})\n{doc['text']}"
)
return "\n\n".join(parts)
def build_rag_message(user_question: str,
retrieved: List[Dict]) -> str:
context = build_context_block(retrieved)
return (
f"Context from documents:\n\n{context}\n\n"
f"Question: {user_question}"
)
class RAGEngine:
def __init__(self, kb: KnowledgeBase, llm: LLMClient,
top_k: int = 5, min_score: float = 0.25):
self.kb = kb
self.llm = llm
self.top_k = top_k
self.min_score = min_score
def answer(self, question: str,
history: List[Dict] = None) -> Tuple[str, List[Dict]]:
"""
Returns (answer_text, retrieved_docs)
history: previous conversation messages in API format
"""
retrieved = self.kb.retrieve(
question, top_k=self.top_k, min_score=self.min_score)
messages = list(history or [])
if retrieved:
rag_content = build_rag_message(question, retrieved)
messages.append({"role": "user", "content": rag_content})
else:
messages.append({"role": "user", "content": question})
answer = self.llm.complete(
system = SYSTEM_PROMPT,
messages = messages,
max_tokens= 800,
temperature= 0.2
)
return answer, retrieved
conversation.py
# conversation.py
from dataclasses import dataclass, field
from typing import List, Dict
import time
@dataclass
class Turn:
question: str
answer: str
sources: List[str]
timestamp: float = field(default_factory=time.time)
class Conversation:
def __init__(self, max_turns: int = 8):
self.turns: List[Turn] = []
self.max_turns: int = max_turns
def add(self, question: str, answer: str, sources: List[str]):
self.turns.append(Turn(question, answer, sources))
def get_api_history(self) -> List[Dict]:
"""Return last max_turns as API message format."""
recent = self.turns[-self.max_turns:]
messages = []
for turn in recent:
messages.append({"role": "user", "content": turn.question})
messages.append({"role": "assistant", "content": turn.answer})
return messages
def clear(self):
self.turns = []
def __len__(self):
return len(self.turns)
app.py — The Streamlit Interface
# app.py
import streamlit as st
import tempfile
import os
from pathlib import Path
from config import CONFIG
from document_processor import process_document
from knowledge_base import KnowledgeBase
from llm_client import get_llm_client
from rag_engine import RAGEngine
from conversation import Conversation
st.set_page_config(
page_title = "DocuMind — AI Research Assistant",
page_icon = "📚",
layout = "wide"
)
@st.cache_resource
def init_system():
kb = KnowledgeBase(
embed_model = CONFIG.EMBED_MODEL,
persist_dir = CONFIG.CHROMA_PATH,
collection_name = CONFIG.COLLECTION_NAME
)
llm = get_llm_client(
provider = CONFIG.LLM_PROVIDER,
model = (CONFIG.ANTHROPIC_MODEL
if CONFIG.LLM_PROVIDER == "anthropic"
else CONFIG.OPENAI_MODEL)
)
engine = RAGEngine(kb, llm, top_k=CONFIG.TOP_K_RETRIEVE)
return kb, engine
kb, engine = init_system()
if "conversation" not in st.session_state:
st.session_state.conversation = Conversation(CONFIG.MAX_HISTORY_TURNS)
if "messages_display" not in st.session_state:
st.session_state.messages_display = []
with st.sidebar:
st.title("📚 DocuMind")
st.markdown("Upload documents to start asking questions.")
st.divider()
uploaded = st.file_uploader(
"Upload documents",
type = ["pdf", "txt", "md"],
accept_multiple_files = True
)
if uploaded:
for f in uploaded:
with st.spinner(f"Processing {f.name}..."):
with tempfile.NamedTemporaryFile(
delete=False, suffix=Path(f.name).suffix
) as tmp:
tmp.write(f.read())
tmp_path = tmp.name
try:
chunks = process_document(tmp_path, CONFIG.CHUNK_SIZE, CONFIG.CHUNK_OVERLAP)
added = kb.add_chunks(chunks)
st.success(f"✓ {f.name}: {added} chunks indexed")
except Exception as e:
st.error(f"Error: {e}")
finally:
os.unlink(tmp_path)
sources = kb.get_sources()
if sources:
st.divider()
st.subheader(f"📄 Documents ({len(sources)})")
for src in sources:
st.markdown(f"• {src}")
st.divider()
col1, col2 = st.columns(2)
with col1:
st.metric("Chunks", kb.document_count())
with col2:
st.metric("Turns", len(st.session_state.conversation))
if st.button("🗑 Clear conversation"):
st.session_state.conversation = Conversation(CONFIG.MAX_HISTORY_TURNS)
st.session_state.messages_display = []
st.rerun()
st.title("DocuMind — AI Research Assistant")
if not sources:
st.info("👈 Upload a document in the sidebar to get started.")
else:
for msg in st.session_state.messages_display:
with st.chat_message(msg["role"]):
st.markdown(msg["content"])
if msg.get("sources"):
with st.expander("📎 Sources"):
for src in msg["sources"]:
st.markdown(
f"**{src['source']}** (page {src['page']}) — "
f"relevance {src['score']:.0%}\n\n> {src['text'][:200]}..."
)
if question := st.chat_input("Ask anything about your documents..."):
st.session_state.messages_display.append(
{"role": "user", "content": question})
with st.chat_message("user"):
st.markdown(question)
with st.chat_message("assistant"):
with st.spinner("Searching and generating answer..."):
history = st.session_state.conversation.get_api_history()
answer, retrieved = engine.answer(question, history=history)
st.markdown(answer)
if retrieved:
with st.expander(f"📎 {len(retrieved)} source(s) used"):
for i, src in enumerate(retrieved, 1):
st.markdown(
f"**[{i}] {src['source']}** — page {src['page']} "
f"(relevance: {src['score']:.0%})"
)
st.caption(src["text"][:300] + "...")
st.divider()
st.session_state.conversation.add(
question = question,
answer = answer,
sources = [r["source"] for r in retrieved]
)
st.session_state.messages_display.append({
"role": "assistant",
"content": answer,
"sources": retrieved
})
requirements.txt
anthropic>=0.25.0
openai>=1.0.0
streamlit>=1.32.0
sentence-transformers>=2.7.0
chromadb>=0.4.24
pypdf>=4.0.0
langchain-text-splitters>=0.0.1
Deployment
# Local
streamlit run app.py
# Streamlit Community Cloud (free)
# 1. Push to GitHub
# 2. Go to share.streamlit.io
# 3. Connect repo → select app.py
# 4. Add secrets:
# ANTHROPIC_API_KEY = "sk-ant-..."
# or
# OPENAI_API_KEY = "sk-..."
# 5. Deploy
# Docker (for self-hosting)
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8501
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]
Evaluation: Did It Work?
def evaluate_documind(engine, test_cases):
"""
test_cases: [{"question": str, "expected_keyword": str, "source": str}]
"""
results = []
for case in test_cases:
answer, retrieved = engine.answer(case["question"])
keyword_found = case["expected_keyword"].lower() in answer.lower()
source_cited = any(case["source"] in r["source"] for r in retrieved)
no_hallucination = "couldn't find" not in answer.lower()
results.append({
"question": case["question"][:40],
"keyword_found": keyword_found,
"source_cited": source_cited,
"answered": no_hallucination,
})
print("\nEvaluation Results:")
print(f"{'Question':<42} {'Keyword':>8} {'Source':>7} {'Answered':>9}")
print("=" * 70)
for r in results:
print(f"{r['question']:<42} "
f"{'✓' if r['keyword_found'] else '✗':>8} "
f"{'✓' if r['source_cited'] else '✗':>7} "
f"{'✓' if r['answered'] else '✗':>9}")
acc = sum(r["keyword_found"] and r["source_cited"] for r in results) / len(results)
print(f"\nOverall accuracy: {acc:.0%}")
Reference Links
print("Everything you need to go further:")
print()
refs = {
"Core Libraries": [
("ChromaDB docs", "docs.trychroma.com"),
("Sentence Transformers", "sbert.net/docs/quickstart.html"),
("Streamlit docs", "docs.streamlit.io"),
("pypdf", "pypdf.readthedocs.io"),
("Anthropic Python SDK", "github.com/anthropics/anthropic-sdk-python"),
("OpenAI Python SDK", "github.com/openai/openai-python"),
],
"Advanced RAG Patterns": [
("LlamaIndex RAG guide", "docs.llamaindex.ai/en/stable/use_cases/q_and_a"),
("LangChain RAG tutorial", "python.langchain.com/docs/use_cases/question_answering"),
("RAGAS evaluation framework", "docs.ragas.io"),
("Pinecone RAG guide", "pinecone.io/learn/retrieval-augmented-generation"),
],
"Deployment": [
("Streamlit Community Cloud", "share.streamlit.io"),
("Render (free tier)", "render.com"),
("Railway", "railway.app"),
("HuggingFace Spaces", "huggingface.co/spaces"),
],
"Making It Production-Grade": [
("LangSmith (LLM observability)", "smith.langchain.com"),
("Arize Phoenix (tracing)", "phoenix.arize.com"),
("Weights & Biases (logging)", "wandb.ai/site/llm"),
("TruLens (RAG evaluation)", "trulens.org"),
],
}
for category, links in refs.items():
print(f" {category}:")
for name, url in links:
print(f" • {name:<40} {url}")
print()
What Makes This Portfolio-Worthy
DocuMind demonstrates every Phase 8 concept in a working product:
Tokenization + Embeddings: chunking text then encoding with sentence-transformers. The foundation of everything.
Vector search: ChromaDB with cosine similarity. Semantic retrieval, not keyword matching.
RAG: retrieved context injected into the LLM prompt with source citations.
Conversation memory: multi-turn context maintained across the full session.
LLM API: pluggable Claude or OpenAI backend with proper error handling.
Deployment: live, shareable URL in minutes via Streamlit Community Cloud.
Add a strong README with a demo GIF and this belongs in the first 10 seconds of a portfolio review.
Phase 8 Complete
You now know how language models work from the inside out.
Tokenization → Embeddings → Attention → Transformer → BERT → GPT → HuggingFace → Fine-tuning → Vector Search → RAG → Chatbot → APIs → Deployed Application.
Phase 9 starts next: MLOps. How do you take a model and make it reliable, monitored, versioned, and scalable in production? Docker, FastAPI, CI/CD, model monitoring, A/B testing. The difference between a model that works and a product that works.
Try This
Build DocuMind completely. Not a modified version. This exact system.
Upload your university notes, a book you are reading, your company documentation, or the papers from this series. Ask it questions. Push it to GitHub. Deploy to Streamlit Cloud. Share the URL with someone.
Then extend it with one feature of your choice:
- Multi-language support (detect language, respond in same language)
- Document comparison mode (ask questions that compare two documents)
- Export conversation as PDF
- Integration with a second vector store for a persistent company knowledge base
- User authentication with per-user document storage
Your choice. Make it yours.
Top comments (0)