Akhilesh

Posted on May 20

90. Phase 8 Capstone: Build a Full AI Application

#ai #productivity #beginners #aiapp

Phase 8 has been building to this.

Tokenization. Embeddings. Transformers. BERT. GPT. HuggingFace. Fine-tuning. Vector search. RAG. Chatbot architecture. OpenAI API. Claude API.

Fourteen posts of foundations.

Now you ship something real.

This capstone builds DocuMind: an AI research assistant that lets users upload any PDF or text document and have intelligent conversations about it. It retrieves relevant passages, cites sources, maintains conversation context, and deploys as a shareable web application.

It is not the most complex system ever built. It is exactly complex enough to require every concept from Phase 8, and simple enough to build completely in one post.

What You Are Building

DocuMind: AI Research Assistant

Features:
  ✓ Upload PDFs and text files
  ✓ Automatic chunking and embedding
  ✓ Semantic search over uploaded documents
  ✓ Multi-turn conversation with memory
  ✓ Source citations on every answer
  ✓ Multi-document support
  ✓ Deployable Streamlit app

Tech stack:
  - sentence-transformers  (embeddings)
  - chromadb               (vector store)
  - anthropic / openai     (LLM)
  - streamlit              (UI)
  - pypdf                  (PDF parsing)
  - langchain text splitter (chunking)

Project Structure

documind/
├── app.py                  # Streamlit application entry point
├── config.py               # Configuration and constants
├── document_processor.py   # PDF/text loading, chunking, embedding
├── knowledge_base.py       # ChromaDB vector store wrapper
├── conversation.py         # Multi-turn conversation management
├── llm_client.py           # Pluggable LLM backend (Claude or OpenAI)
├── rag_engine.py           # Retrieval-augmented generation pipeline
├── requirements.txt        # All dependencies
└── README.md               # Setup and usage instructions

config.py

# config.py
from dataclasses import dataclass

@dataclass
class Config:
    # Embedding model
    EMBED_MODEL:      str   = "all-MiniLM-L6-v2"
    EMBED_DIM:        int   = 384

    # Chunking
    CHUNK_SIZE:       int   = 512
    CHUNK_OVERLAP:    int   = 64

    # Retrieval
    TOP_K_RETRIEVE:   int   = 5
    TOP_K_RERANK:     int   = 3
    MIN_SCORE:        float = 0.25

    # Conversation
    MAX_HISTORY_TURNS: int  = 8

    # LLM
    LLM_PROVIDER:     str   = "anthropic"     # "anthropic" or "openai"
    ANTHROPIC_MODEL:  str   = "claude-3-5-haiku-20241022"
    OPENAI_MODEL:     str   = "gpt-4o-mini"
    MAX_TOKENS:       int   = 800
    TEMPERATURE:      float = 0.2

    # ChromaDB
    CHROMA_PATH:      str   = "./chroma_db"
    COLLECTION_NAME:  str   = "documind"

CONFIG = Config()

document_processor.py

# document_processor.py
import re
from pathlib import Path
from typing import List, Dict
from dataclasses import dataclass

@dataclass
class Chunk:
    text:       str
    source:     str
    page:       int
    chunk_idx:  int
    char_start: int

def load_text_file(filepath: str) -> str:
    with open(filepath, "r", encoding="utf-8", errors="replace") as f:
        return f.read()

def load_pdf(filepath: str) -> List[Dict]:
    """Return list of {text, page} dicts."""
    try:
        import pypdf
        pages = []
        with open(filepath, "rb") as f:
            reader = pypdf.PdfReader(f)
            for i, page in enumerate(reader.pages):
                text = page.extract_text() or ""
                if text.strip():
                    pages.append({"text": text, "page": i + 1})
        return pages
    except ImportError:
        raise ImportError("Install: pip install pypdf")

def clean_text(text: str) -> str:
    text = re.sub(r"\s+", " ", text)
    text = re.sub(r"(\n\s*){3,}", "\n\n", text)
    return text.strip()

def chunk_text(text: str, source: str, page: int = 1,
               chunk_size: int = 512, overlap: int = 64) -> List[Chunk]:
    """Recursive character splitting with overlap."""
    text    = clean_text(text)
    words   = text.split()
    chunks  = []
    idx     = 0
    step    = chunk_size - overlap

    while idx < len(words):
        chunk_words = words[idx:idx + chunk_size]
        chunk_text  = " ".join(chunk_words)

        if len(chunk_text.strip()) > 50:
            chunks.append(Chunk(
                text      = chunk_text,
                source    = source,
                page      = page,
                chunk_idx = len(chunks),
                char_start= idx
            ))
        idx += step

    return chunks

def process_document(filepath: str,
                     chunk_size: int = 512,
                     overlap: int    = 64) -> List[Chunk]:
    """Load and chunk any supported document type."""
    path = Path(filepath)
    all_chunks = []

    if path.suffix.lower() == ".pdf":
        pages = load_pdf(filepath)
        for page_data in pages:
            chunks = chunk_text(
                page_data["text"],
                source     = path.name,
                page       = page_data["page"],
                chunk_size = chunk_size,
                overlap    = overlap
            )
            all_chunks.extend(chunks)

    elif path.suffix.lower() in [".txt", ".md", ".rst"]:
        text   = load_text_file(filepath)
        chunks = chunk_text(
            text,
            source     = path.name,
            page       = 1,
            chunk_size = chunk_size,
            overlap    = overlap
        )
        all_chunks.extend(chunks)

    else:
        raise ValueError(f"Unsupported file type: {path.suffix}")

    print(f"  Processed {path.name}: {len(all_chunks)} chunks")
    return all_chunks

knowledge_base.py

# knowledge_base.py
import uuid
from typing import List, Dict, Optional
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
from document_processor import Chunk

class KnowledgeBase:
    def __init__(self, embed_model: str, persist_dir: str, collection_name: str):
        self.embedder   = SentenceTransformer(embed_model)
        self.client     = chromadb.PersistentClient(path=persist_dir)
        self.collection = self.client.get_or_create_collection(
            name     = collection_name,
            metadata = {"hnsw:space": "cosine"}
        )
        self._source_cache = {}

    def add_chunks(self, chunks: List[Chunk]) -> int:
        if not chunks:
            return 0

        texts      = [c.text for c in chunks]
        embeddings = self.embedder.encode(texts, show_progress_bar=True).tolist()

        ids        = [str(uuid.uuid4()) for _ in chunks]
        metadatas  = [
            {"source":    c.source,
             "page":      c.page,
             "chunk_idx": c.chunk_idx}
            for c in chunks
        ]

        self.collection.add(
            ids        = ids,
            embeddings = embeddings,
            documents  = texts,
            metadatas  = metadatas
        )

        for chunk in chunks:
            self._source_cache[chunk.source] = True

        return len(chunks)

    def retrieve(self, query: str, top_k: int = 5,
                 min_score: float = 0.25,
                 filter_source: Optional[str] = None) -> List[Dict]:
        query_emb  = self.embedder.encode([query]).tolist()
        where      = {"source": filter_source} if filter_source else None

        results = self.collection.query(
            query_embeddings = query_emb,
            n_results        = top_k,
            where            = where,
            include          = ["documents", "metadatas", "distances"]
        )

        retrieved = []
        for text, meta, dist in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0]
        ):
            score = 1 - dist
            if score >= min_score:
                retrieved.append({
                    "text":   text,
                    "source": meta["source"],
                    "page":   meta["page"],
                    "score":  round(score, 4)
                })

        return retrieved

    def get_sources(self) -> List[str]:
        return list(self._source_cache.keys())

    def document_count(self) -> int:
        return self.collection.count()

    def clear(self):
        self.client.delete_collection(self.collection.name)
        self.collection = self.client.get_or_create_collection(
            name=self.collection.name,
            metadata={"hnsw:space": "cosine"}
        )
        self._source_cache = {}

llm_client.py

# llm_client.py
import os
from abc import ABC, abstractmethod
from typing import List, Dict

class LLMClient(ABC):
    @abstractmethod
    def complete(self, system: str, messages: List[Dict],
                 max_tokens: int = 800, temperature: float = 0.2) -> str:
        pass

class ClaudeClient(LLMClient):
    def __init__(self, model: str = "claude-3-5-haiku-20241022"):
        import anthropic
        self.client = anthropic.Anthropic(
            api_key=os.environ.get("ANTHROPIC_API_KEY"))
        self.model  = model

    def complete(self, system, messages, max_tokens=800, temperature=0.2):
        response = self.client.messages.create(
            model=self.model, max_tokens=max_tokens,
            temperature=temperature, system=system, messages=messages
        )
        return response.content[0].text

class OpenAIClient(LLMClient):
    def __init__(self, model: str = "gpt-4o-mini"):
        import openai
        self.client = openai.OpenAI(
            api_key=os.environ.get("OPENAI_API_KEY"))
        self.model  = model

    def complete(self, system, messages, max_tokens=800, temperature=0.2):
        all_msgs = [{"role": "system", "content": system}] + messages
        response = self.client.chat.completions.create(
            model=self.model, messages=all_msgs,
            max_tokens=max_tokens, temperature=temperature
        )
        return response.choices[0].message.content

def get_llm_client(provider: str = "anthropic", **kwargs) -> LLMClient:
    if provider == "anthropic":
        return ClaudeClient(**kwargs)
    elif provider == "openai":
        return OpenAIClient(**kwargs)
    raise ValueError(f"Unknown provider: {provider}")

rag_engine.py

# rag_engine.py
from typing import List, Dict, Tuple
from knowledge_base import KnowledgeBase
from llm_client import LLMClient

SYSTEM_PROMPT = """You are DocuMind, an AI research assistant.
Your job is to answer questions based ONLY on the provided document context.

Rules:
- Use only information from the provided context
- Cite sources using [1], [2] etc. at the end of relevant sentences
- If the answer is not in the context, say exactly: "I couldn't find information about this in the uploaded documents."
- Be precise and concise
- If multiple documents are relevant, synthesize them coherently
- Format responses with markdown when helpful"""

def build_context_block(retrieved: List[Dict]) -> str:
    parts = []
    for i, doc in enumerate(retrieved, 1):
        parts.append(
            f"[{i}] Source: {doc['source']} (page {doc['page']}, "
            f"relevance: {doc['score']:.0%})\n{doc['text']}"
        )
    return "\n\n".join(parts)

def build_rag_message(user_question: str,
                       retrieved: List[Dict]) -> str:
    context = build_context_block(retrieved)
    return (
        f"Context from documents:\n\n{context}\n\n"
        f"Question: {user_question}"
    )

class RAGEngine:
    def __init__(self, kb: KnowledgeBase, llm: LLMClient,
                 top_k: int = 5, min_score: float = 0.25):
        self.kb        = kb
        self.llm       = llm
        self.top_k     = top_k
        self.min_score = min_score

    def answer(self, question: str,
               history: List[Dict] = None) -> Tuple[str, List[Dict]]:
        """
        Returns (answer_text, retrieved_docs)
        history: previous conversation messages in API format
        """
        retrieved = self.kb.retrieve(
            question, top_k=self.top_k, min_score=self.min_score)

        messages = list(history or [])

        if retrieved:
            rag_content = build_rag_message(question, retrieved)
            messages.append({"role": "user", "content": rag_content})
        else:
            messages.append({"role": "user", "content": question})

        answer = self.llm.complete(
            system    = SYSTEM_PROMPT,
            messages  = messages,
            max_tokens= 800,
            temperature= 0.2
        )

        return answer, retrieved

conversation.py

# conversation.py
from dataclasses import dataclass, field
from typing import List, Dict
import time

@dataclass
class Turn:
    question:  str
    answer:    str
    sources:   List[str]
    timestamp: float = field(default_factory=time.time)

class Conversation:
    def __init__(self, max_turns: int = 8):
        self.turns:     List[Turn] = []
        self.max_turns: int        = max_turns

    def add(self, question: str, answer: str, sources: List[str]):
        self.turns.append(Turn(question, answer, sources))

    def get_api_history(self) -> List[Dict]:
        """Return last max_turns as API message format."""
        recent = self.turns[-self.max_turns:]
        messages = []
        for turn in recent:
            messages.append({"role": "user",      "content": turn.question})
            messages.append({"role": "assistant",  "content": turn.answer})
        return messages

    def clear(self):
        self.turns = []

    def __len__(self):
        return len(self.turns)

app.py — The Streamlit Interface

# app.py
import streamlit as st
import tempfile
import os
from pathlib import Path

from config import CONFIG
from document_processor import process_document
from knowledge_base import KnowledgeBase
from llm_client import get_llm_client
from rag_engine import RAGEngine
from conversation import Conversation

st.set_page_config(
    page_title = "DocuMind — AI Research Assistant",
    page_icon  = "📚",
    layout     = "wide"
)

@st.cache_resource
def init_system():
    kb  = KnowledgeBase(
        embed_model     = CONFIG.EMBED_MODEL,
        persist_dir     = CONFIG.CHROMA_PATH,
        collection_name = CONFIG.COLLECTION_NAME
    )
    llm = get_llm_client(
        provider = CONFIG.LLM_PROVIDER,
        model    = (CONFIG.ANTHROPIC_MODEL
                    if CONFIG.LLM_PROVIDER == "anthropic"
                    else CONFIG.OPENAI_MODEL)
    )
    engine = RAGEngine(kb, llm, top_k=CONFIG.TOP_K_RETRIEVE)
    return kb, engine

kb, engine = init_system()

if "conversation" not in st.session_state:
    st.session_state.conversation = Conversation(CONFIG.MAX_HISTORY_TURNS)
if "messages_display" not in st.session_state:
    st.session_state.messages_display = []

with st.sidebar:
    st.title("📚 DocuMind")
    st.markdown("Upload documents to start asking questions.")
    st.divider()

    uploaded = st.file_uploader(
        "Upload documents",
        type    = ["pdf", "txt", "md"],
        accept_multiple_files = True
    )

    if uploaded:
        for f in uploaded:
            with st.spinner(f"Processing {f.name}..."):
                with tempfile.NamedTemporaryFile(
                    delete=False, suffix=Path(f.name).suffix
                ) as tmp:
                    tmp.write(f.read())
                    tmp_path = tmp.name

                try:
                    chunks = process_document(tmp_path, CONFIG.CHUNK_SIZE, CONFIG.CHUNK_OVERLAP)
                    added  = kb.add_chunks(chunks)
                    st.success(f"✓ {f.name}: {added} chunks indexed")
                except Exception as e:
                    st.error(f"Error: {e}")
                finally:
                    os.unlink(tmp_path)

    sources = kb.get_sources()
    if sources:
        st.divider()
        st.subheader(f"📄 Documents ({len(sources)})")
        for src in sources:
            st.markdown(f"• {src}")

    st.divider()
    col1, col2 = st.columns(2)
    with col1:
        st.metric("Chunks", kb.document_count())
    with col2:
        st.metric("Turns", len(st.session_state.conversation))

    if st.button("🗑 Clear conversation"):
        st.session_state.conversation = Conversation(CONFIG.MAX_HISTORY_TURNS)
        st.session_state.messages_display = []
        st.rerun()

st.title("DocuMind — AI Research Assistant")

if not sources:
    st.info("👈 Upload a document in the sidebar to get started.")
else:
    for msg in st.session_state.messages_display:
        with st.chat_message(msg["role"]):
            st.markdown(msg["content"])
            if msg.get("sources"):
                with st.expander("📎 Sources"):
                    for src in msg["sources"]:
                        st.markdown(
                            f"**{src['source']}** (page {src['page']}) — "
                            f"relevance {src['score']:.0%}\n\n> {src['text'][:200]}..."
                        )

    if question := st.chat_input("Ask anything about your documents..."):
        st.session_state.messages_display.append(
            {"role": "user", "content": question})
        with st.chat_message("user"):
            st.markdown(question)

        with st.chat_message("assistant"):
            with st.spinner("Searching and generating answer..."):
                history = st.session_state.conversation.get_api_history()
                answer, retrieved = engine.answer(question, history=history)

            st.markdown(answer)

            if retrieved:
                with st.expander(f"📎 {len(retrieved)} source(s) used"):
                    for i, src in enumerate(retrieved, 1):
                        st.markdown(
                            f"**[{i}] {src['source']}** — page {src['page']} "
                            f"(relevance: {src['score']:.0%})"
                        )
                        st.caption(src["text"][:300] + "...")
                        st.divider()

        st.session_state.conversation.add(
            question = question,
            answer   = answer,
            sources  = [r["source"] for r in retrieved]
        )
        st.session_state.messages_display.append({
            "role":    "assistant",
            "content": answer,
            "sources": retrieved
        })

requirements.txt

anthropic>=0.25.0
openai>=1.0.0
streamlit>=1.32.0
sentence-transformers>=2.7.0
chromadb>=0.4.24
pypdf>=4.0.0
langchain-text-splitters>=0.0.1

Deployment

# Local
streamlit run app.py

# Streamlit Community Cloud (free)
# 1. Push to GitHub
# 2. Go to share.streamlit.io
# 3. Connect repo → select app.py
# 4. Add secrets:
#    ANTHROPIC_API_KEY = "sk-ant-..."
#    or
#    OPENAI_API_KEY = "sk-..."
# 5. Deploy

# Docker (for self-hosting)
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
EXPOSE 8501
CMD ["streamlit", "run", "app.py", "--server.port=8501", "--server.address=0.0.0.0"]

Evaluation: Did It Work?

def evaluate_documind(engine, test_cases):
    """
    test_cases: [{"question": str, "expected_keyword": str, "source": str}]
    """
    results = []
    for case in test_cases:
        answer, retrieved = engine.answer(case["question"])

        keyword_found = case["expected_keyword"].lower() in answer.lower()
        source_cited  = any(case["source"] in r["source"] for r in retrieved)
        no_hallucination = "couldn't find" not in answer.lower()

        results.append({
            "question":      case["question"][:40],
            "keyword_found": keyword_found,
            "source_cited":  source_cited,
            "answered":      no_hallucination,
        })

    print("\nEvaluation Results:")
    print(f"{'Question':<42} {'Keyword':>8} {'Source':>7} {'Answered':>9}")
    print("=" * 70)
    for r in results:
        print(f"{r['question']:<42} "
              f"{'✓' if r['keyword_found'] else '✗':>8} "
              f"{'✓' if r['source_cited'] else '✗':>7} "
              f"{'✓' if r['answered'] else '✗':>9}")

    acc = sum(r["keyword_found"] and r["source_cited"] for r in results) / len(results)
    print(f"\nOverall accuracy: {acc:.0%}")

Reference Links

print("Everything you need to go further:")
print()

refs = {
    "Core Libraries": [
        ("ChromaDB docs",              "docs.trychroma.com"),
        ("Sentence Transformers",      "sbert.net/docs/quickstart.html"),
        ("Streamlit docs",             "docs.streamlit.io"),
        ("pypdf",                      "pypdf.readthedocs.io"),
        ("Anthropic Python SDK",       "github.com/anthropics/anthropic-sdk-python"),
        ("OpenAI Python SDK",          "github.com/openai/openai-python"),
    ],
    "Advanced RAG Patterns": [
        ("LlamaIndex RAG guide",       "docs.llamaindex.ai/en/stable/use_cases/q_and_a"),
        ("LangChain RAG tutorial",     "python.langchain.com/docs/use_cases/question_answering"),
        ("RAGAS evaluation framework", "docs.ragas.io"),
        ("Pinecone RAG guide",         "pinecone.io/learn/retrieval-augmented-generation"),
    ],
    "Deployment": [
        ("Streamlit Community Cloud",  "share.streamlit.io"),
        ("Render (free tier)",         "render.com"),
        ("Railway",                    "railway.app"),
        ("HuggingFace Spaces",         "huggingface.co/spaces"),
    ],
    "Making It Production-Grade": [
        ("LangSmith (LLM observability)", "smith.langchain.com"),
        ("Arize Phoenix (tracing)",    "phoenix.arize.com"),
        ("Weights & Biases (logging)", "wandb.ai/site/llm"),
        ("TruLens (RAG evaluation)",   "trulens.org"),
    ],
}

for category, links in refs.items():
    print(f"  {category}:")
    for name, url in links:
        print(f"    • {name:<40} {url}")
    print()

What Makes This Portfolio-Worthy

DocuMind demonstrates every Phase 8 concept in a working product:

Tokenization + Embeddings: chunking text then encoding with sentence-transformers. The foundation of everything.

Vector search: ChromaDB with cosine similarity. Semantic retrieval, not keyword matching.

RAG: retrieved context injected into the LLM prompt with source citations.

Conversation memory: multi-turn context maintained across the full session.

LLM API: pluggable Claude or OpenAI backend with proper error handling.

Deployment: live, shareable URL in minutes via Streamlit Community Cloud.

Add a strong README with a demo GIF and this belongs in the first 10 seconds of a portfolio review.

Phase 8 Complete

You now know how language models work from the inside out.

Tokenization → Embeddings → Attention → Transformer → BERT → GPT → HuggingFace → Fine-tuning → Vector Search → RAG → Chatbot → APIs → Deployed Application.

Phase 9 starts next: MLOps. How do you take a model and make it reliable, monitored, versioned, and scalable in production? Docker, FastAPI, CI/CD, model monitoring, A/B testing. The difference between a model that works and a product that works.

Try This

Build DocuMind completely. Not a modified version. This exact system.

Upload your university notes, a book you are reading, your company documentation, or the papers from this series. Ask it questions. Push it to GitHub. Deploy to Streamlit Cloud. Share the URL with someone.

Then extend it with one feature of your choice:

Multi-language support (detect language, respond in same language)
Document comparison mode (ask questions that compare two documents)
Export conversation as PDF
Integration with a second vector store for a persistent company knowledge base
User authentication with per-user document storage

Your choice. Make it yours.

DEV Community