You have spent eight posts in Phase 8 learning the theory.
Tokenization. Embeddings. Attention. Transformers. BERT. GPT. HuggingFace. Fine-tuning. Vector search. RAG.
Now you build the thing that uses all of it.
A chatbot with:
- Conversation memory (it remembers what you said earlier in the conversation)
- RAG over a knowledge base (it knows your specific documents)
- A clean chat interface (Streamlit, deployable in minutes)
- Support for OpenAI, Claude, and local models
By the end of this post you will have a working AI assistant you can deploy and use. Not a toy. Not a demo. Something real.
Architecture Overview
print("Complete Chatbot Architecture:")
print()
print("Layer 1: User Interface (Streamlit)")
print(" Web chat interface, message history display, file upload")
print()
print("Layer 2: Conversation Management")
print(" Store and format message history")
print(" Maintain context window (last N turns)")
print(" Handle system prompt injection")
print()
print("Layer 3: Retrieval (optional RAG)")
print(" Detect if query needs knowledge base lookup")
print(" Retrieve relevant documents")
print(" Inject into prompt as context")
print()
print("Layer 4: LLM Backend (pluggable)")
print(" OpenAI GPT-3.5/GPT-4")
print(" Anthropic Claude")
print(" Local Ollama (Llama 3, Mistral, etc.)")
print()
print("The layers are independent. Swap any one without changing the others.")
Layer 1: Conversation Management
from dataclasses import dataclass, field
from typing import List, Optional, Dict
import json
import time
@dataclass
class Message:
role: str
content: str
timestamp: float = field(default_factory=time.time)
metadata: Dict = field(default_factory=dict)
class ConversationMemory:
"""
Manages conversation history with a sliding context window.
Prevents exceeding the model's context limit.
"""
def __init__(self, max_turns: int = 10, system_prompt: str = ""):
self.messages: List[Message] = []
self.max_turns = max_turns
self.system_prompt = system_prompt
def add(self, role: str, content: str, metadata: Dict = None):
self.messages.append(Message(
role=role,
content=content,
metadata=metadata or {}
))
def get_history(self) -> List[Dict]:
"""Return the last max_turns messages in API format."""
recent = self.messages[-self.max_turns * 2:]
return [{"role": m.role, "content": m.content} for m in recent]
def build_messages(self) -> List[Dict]:
"""Build the complete messages list for the API call."""
messages = []
if self.system_prompt:
messages.append({"role": "system", "content": self.system_prompt})
messages.extend(self.get_history())
return messages
def clear(self):
self.messages = []
def export(self) -> str:
return json.dumps([{
"role": m.role,
"content": m.content,
"timestamp": m.timestamp
} for m in self.messages], indent=2)
def __len__(self):
return len(self.messages)
system_prompt = """You are a helpful AI assistant for our company.
You have access to our internal knowledge base through the context provided.
Always be precise, cite your sources when using knowledge base information,
and say 'I don't have that information' when you don't know something."""
memory = ConversationMemory(max_turns=10, system_prompt=system_prompt)
memory.add("user", "What is our Q3 revenue?")
memory.add("assistant", "According to the Q3 Financial Report, revenue was $4.2M, up 23%.")
memory.add("user", "What drove the subscription growth?")
print("Conversation history:")
for msg in memory.get_history():
print(f" [{msg['role']:<10}]: {msg['content'][:60]}")
print(f"\nContext window: {len(memory)} messages stored, last {memory.max_turns*2} sent to API")
print()
print("Full messages for API:")
for msg in memory.build_messages():
print(f" role={msg['role']:<12} content={msg['content'][:50]}...")
Layer 2: RAG Integration
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
class KnowledgeBase:
"""Simple in-memory knowledge base with semantic search."""
def __init__(self, model_name="all-MiniLM-L6-v2"):
self.model = SentenceTransformer(model_name)
self.documents = []
self.embeddings = None
def add_documents(self, docs: List[Dict]):
self.documents.extend(docs)
texts = [d["text"] for d in docs]
new_emb = self.model.encode(texts, show_progress_bar=False)
self.embeddings = (new_emb if self.embeddings is None
else np.vstack([self.embeddings, new_emb]))
def retrieve(self, query: str, top_k: int = 3,
min_score: float = 0.3) -> List[Dict]:
if not self.documents:
return []
query_emb = self.model.encode([query])
scores = cosine_similarity(query_emb, self.embeddings)[0]
indices = np.argsort(scores)[::-1][:top_k]
return [
{**self.documents[i], "score": float(scores[i])}
for i in indices
if scores[i] >= min_score
]
def needs_retrieval(self, query: str) -> bool:
"""Simple heuristic: retrieve if query seems specific."""
specific_patterns = [
r"\$", r"\d+%", r"Q[1-4]", r"how much", r"what is the",
r"when did", r"who is", r"policy", r"price", r"cost",
r"support", r"refund", r"plan", r"feature"
]
import re
query_lower = query.lower()
return any(re.search(p, query_lower, re.IGNORECASE)
for p in specific_patterns)
kb = KnowledgeBase()
kb.add_documents([
{"text": "Q3 2024 revenue: $4.2M (+23% YoY). Subscriptions: $3.1M (+31%).",
"source": "Q3 Financial Report"},
{"text": "Refund policy: 30 days for physical products. Digital products non-refundable.",
"source": "Customer Policy"},
{"text": "Premium plan: $49/month. Unlimited API calls. Priority support.",
"source": "Pricing"},
{"text": "System requirements: Python 3.8+, torch>=2.0, 8GB VRAM for fine-tuning.",
"source": "Technical Docs"},
{"text": "CEO: Sarah Chen (since 2023). CTO: James Park (since 2019). HQ: San Francisco.",
"source": "Company Overview"},
{"text": "Support: email support@company.com. Premium: 2hr response. Free: 24hr.",
"source": "Support Guide"},
])
test_queries = [
("What was Q3 revenue?", True),
("Tell me a joke", False),
("What does the premium plan cost?", True),
("How are you today?", False),
]
print("Retrieval trigger detection:")
for query, expected in test_queries:
detected = kb.needs_retrieval(query)
match = "✓" if detected == expected else "✗"
print(f" {match} '{query}' → needs_retrieval={detected}")
Layer 3: LLM Backend (Pluggable)
from abc import ABC, abstractmethod
class LLMBackend(ABC):
@abstractmethod
def complete(self, messages: List[Dict], **kwargs) -> str:
pass
class OpenAIBackend(LLMBackend):
def __init__(self, model="gpt-3.5-turbo", api_key=None):
import openai
self.client = openai.OpenAI(api_key=api_key)
self.model = model
def complete(self, messages, temperature=0.7, max_tokens=800, **kwargs):
response = self.client.chat.completions.create(
model=self.model, messages=messages,
temperature=temperature, max_tokens=max_tokens
)
return response.choices[0].message.content
class AnthropicBackend(LLMBackend):
def __init__(self, model="claude-3-haiku-20240307", api_key=None):
import anthropic
self.client = anthropic.Anthropic(api_key=api_key)
self.model = model
def complete(self, messages, max_tokens=800, **kwargs):
system = next((m["content"] for m in messages if m["role"] == "system"), "")
user_msgs = [m for m in messages if m["role"] != "system"]
response = self.client.messages.create(
model=self.model, max_tokens=max_tokens,
system=system, messages=user_msgs
)
return response.content[0].text
class OllamaBackend(LLMBackend):
def __init__(self, model="llama3", base_url="http://localhost:11434"):
import requests
self.model = model
self.base_url = base_url
self.requests = requests
def complete(self, messages, **kwargs):
prompt = "\n".join([f"{m['role'].upper()}: {m['content']}" for m in messages])
res = self.requests.post(
f"{self.base_url}/api/generate",
json={"model": self.model, "prompt": prompt, "stream": False}
)
return res.json()["response"]
class MockBackend(LLMBackend):
"""For testing without API keys."""
def complete(self, messages, **kwargs):
last_user = next((m["content"] for m in reversed(messages)
if m["role"] == "user"), "")
context = next((m["content"] for m in messages
if m.get("role") == "system" and "context" in m.get("content","").lower()), "")
if "[1]" in " ".join(m["content"] for m in messages):
return ("Based on the provided context [1], the Q3 revenue was $4.2M, "
"representing a 23% increase year-over-year.")
return f"I understand you asked: '{last_user[:50]}'. This is a mock response."
print("Available LLM backends:")
backends = {
"OpenAI": "gpt-3.5-turbo, gpt-4-turbo — best quality, paid API",
"Anthropic": "claude-3-haiku (fast), claude-3-sonnet (balanced) — paid API",
"Ollama": "llama3, mistral, phi3 — free, runs locally, needs GPU",
"Mock": "for testing without API keys",
}
for name, desc in backends.items():
print(f" {name:<12}: {desc}")
Layer 4: The Chatbot Engine
class Chatbot:
"""
Complete chatbot combining memory, RAG, and LLM backends.
Drop-in replacement: swap backend without changing anything else.
"""
def __init__(self, llm: LLMBackend, knowledge_base: KnowledgeBase = None,
system_prompt: str = "", max_turns: int = 10):
self.llm = llm
self.kb = knowledge_base
self.memory = ConversationMemory(max_turns, system_prompt)
self.stats = {"turns": 0, "retrievals": 0, "tokens_est": 0}
def chat(self, user_message: str,
use_rag: bool = True,
verbose: bool = False) -> str:
self.memory.add("user", user_message)
self.stats["turns"] += 1
context_docs = []
if use_rag and self.kb and self.kb.needs_retrieval(user_message):
context_docs = self.kb.retrieve(user_message, top_k=3)
self.stats["retrievals"] += 1
if verbose and context_docs:
print(f" [RAG] Retrieved {len(context_docs)} documents:")
for doc in context_docs:
print(f" [{doc['score']:.3f}] {doc['source']}: {doc['text'][:50]}...")
messages = self.memory.build_messages()
if context_docs:
context_text = "\n".join(
f"[{i+1}] {doc['source']}: {doc['text']}"
for i, doc in enumerate(context_docs)
)
messages[-1]["content"] = (
f"Context from knowledge base:\n{context_text}\n\n"
f"User question: {user_message}"
)
response = self.llm.complete(messages)
self.memory.add("assistant", response,
metadata={"retrieved_sources": [d["source"] for d in context_docs]})
return response
def reset(self):
self.memory.clear()
print("Conversation cleared.")
def get_stats(self):
return self.stats
bot = Chatbot(
llm = MockBackend(),
knowledge_base = kb,
system_prompt = system_prompt,
max_turns = 10
)
print("\nChatbot conversation demo:")
print("=" * 60)
conversations = [
"What was our Q3 revenue growth?",
"That's great! What drove the subscription increase?",
"How much does the premium plan cost?",
"Can I get a refund on a digital product?",
"Who is the CEO of the company?",
]
for turn, message in enumerate(conversations, 1):
print(f"\nUser [{turn}]: {message}")
response = bot.chat(message, verbose=True)
print(f"Bot: {response[:150]}...")
print(f"\nConversation stats: {bot.get_stats()}")
The Streamlit Interface
streamlit_app = '''# chatbot_app.py
import streamlit as st
from chatbot import Chatbot, KnowledgeBase, OpenAIBackend, AnthropicBackend
st.set_page_config(page_title="AI Assistant", layout="wide")
SYSTEM_PROMPT = """You are a helpful AI assistant.
Answer questions based on the provided context when available.
Always cite sources. Be concise and accurate."""
@st.cache_resource
def load_chatbot():
kb = KnowledgeBase()
kb.add_documents([
{"text": "Q3 2024 revenue: $4.2M (+23% YoY).", "source": "Q3 Report"},
{"text": "Premium plan: $49/month unlimited API.", "source": "Pricing"},
])
llm = OpenAIBackend(model="gpt-3.5-turbo", api_key=st.secrets["OPENAI_API_KEY"])
return Chatbot(llm=llm, knowledge_base=kb, system_prompt=SYSTEM_PROMPT)
bot = load_chatbot()
st.title("AI Assistant")
st.markdown("Ask me anything about our company, products, or policies.")
col1, col2 = st.columns([4, 1])
with col2:
if st.button("Clear conversation", type="secondary"):
bot.reset()
st.session_state.messages = []
st.rerun()
if "messages" not in st.session_state:
st.session_state.messages = []
for msg in st.session_state.messages:
with st.chat_message(msg["role"]):
st.write(msg["content"])
if prompt := st.chat_input("Ask a question..."):
st.session_state.messages.append({"role": "user", "content": prompt})
with st.chat_message("user"):
st.write(prompt)
with st.chat_message("assistant"):
with st.spinner("Thinking..."):
response = bot.chat(prompt, verbose=False)
st.write(response)
st.session_state.messages.append({"role": "assistant", "content": response})
with st.sidebar:
st.header("Settings")
st.metric("Conversation turns", bot.stats["turns"])
st.metric("RAG retrievals", bot.stats["retrievals"])
st.divider()
st.caption("Powered by LLM + RAG")
'''
print("Streamlit chatbot app:")
print()
print("Save above code as chatbot_app.py")
print()
print("Run:")
print(" streamlit run chatbot_app.py")
print()
print("Deploy (free):")
print(" 1. Push code to GitHub")
print(" 2. Go to share.streamlit.io")
print(" 3. Connect your repo")
print(" 4. Add secrets (API keys) in Streamlit dashboard")
print(" 5. Deploy — live URL in 2 minutes")
Production Considerations
print("Checklist Before Shipping a Chatbot to Real Users:")
print()
checklist = {
"Safety & Guardrails": [
"Add content moderation (OpenAI Moderation API or nemo-guardrails)",
"Implement rate limiting per user",
"Block prompt injection attempts",
"Log conversations for review (with user consent)",
"Add a fallback when LLM fails",
],
"Quality": [
"Test with adversarial inputs ('ignore all previous instructions')",
"Validate RAG retrieval quality on sample queries",
"Set appropriate temperature (0.1-0.3 for factual, 0.7 for creative)",
"Add source citations to every RAG answer",
"Test the 'I don't know' case — does the bot refuse gracefully?",
],
"Performance": [
"Cache embeddings at startup, not per query",
"Use streaming responses for long answers (feels faster to users)",
"Use async if handling multiple users simultaneously",
"Monitor latency: retrieval + LLM combined should be <3 seconds",
],
"Infrastructure": [
"Store conversation history in a database (not in-memory)",
"Use environment variables for all API keys (never hardcode)",
"Set up error monitoring (Sentry or similar)",
"Add usage tracking to control costs",
],
}
for category, items in checklist.items():
print(f" {category}:")
for item in items:
print(f" □ {item}")
print()
Reference Links and Cheat Sheets
print("Essential Reference Links for This Post:")
print()
references = {
"Official Documentation": [
("OpenAI Chat API", "platform.openai.com/docs/guides/chat"),
("Anthropic Messages API", "docs.anthropic.com/en/api/messages"),
("Streamlit Chat Elements", "docs.streamlit.io/develop/api-reference/chat"),
("LangChain Conversation", "python.langchain.com/docs/modules/memory"),
("Ollama API reference", "github.com/ollama/ollama/blob/main/docs/api.md"),
],
"Cheat Sheets": [
("OpenAI Python SDK cheatsheet", "github.com/openai/openai-python"),
("Streamlit component cheatsheet", "docs.streamlit.io/develop/quick-reference/cheat-sheet"),
("Sentence Transformers guide", "sbert.net/docs/quickstart.html"),
("ChromaDB quickstart", "docs.trychroma.com/getting-started"),
],
"Tutorials Worth Reading": [
("Build a RAG chatbot (LangChain)", "python.langchain.com/docs/use_cases/question_answering"),
("Chat with your data (DeepLearning.AI)", "learn.deeplearning.ai/langchain-chat-with-your-data"),
("Build with Claude (Anthropic)", "docs.anthropic.com/en/docs/build-with-claude"),
("OpenAI Cookbook (examples)", "cookbook.openai.com"),
],
"Production Frameworks": [
("LangChain (agent orchestration)", "python.langchain.com"),
("LlamaIndex (data framework)", "docs.llamaindex.ai"),
("Haystack (RAG pipelines)", "haystack.deepset.ai/tutorials"),
("Semantic Kernel (Microsoft)", "learn.microsoft.com/semantic-kernel"),
],
"Vector Databases": [
("ChromaDB (simple, open source)", "docs.trychroma.com"),
("Pinecone (managed, production)", "docs.pinecone.io"),
("Qdrant (open source + cloud)", "qdrant.tech/documentation"),
("Weaviate (open source)", "weaviate.io/developers/weaviate"),
("FAISS (local, Facebook)", "faiss.wiki.kernel.org"),
],
}
for category, links in references.items():
print(f" {category}:")
for name, url in links:
print(f" • {name:<40} {url}")
print()
A Resource Worth Reading
Fullstack Deep Learning's "LLM Bootcamp" at fullstackdeeplearning.com covers building production LLM applications end to end, including evaluation, deployment, and monitoring. The 2023 lectures are freely available as YouTube videos. Covers many of the production considerations in this post in much greater depth. Search "Fullstack Deep Learning LLM Bootcamp 2023."
DeepLearning.AI's short course "Building Systems with the ChatGPT API" by Isa Fulford and Andrew Ng is free on their platform and covers exactly this content: multi-turn conversations, RAG, evaluation, and safety. 2-3 hours. Search "DeepLearning.AI building systems ChatGPT API."
Simon Willison's blog at simonwillison.net covers LLM applications from a pragmatic, skeptical practitioner's perspective. His posts on prompt injection, evaluation, and what LLMs cannot do are essential counterbalance to the hype. Search "Simon Willison LLM blog."
Try This
Create a folder called my_chatbot/ with three files.
knowledge_base.py: Load at least 30 real documents (company docs, Wikipedia articles, any domain you know). Chunk them properly. Build the vector store. Test retrieval with 10 queries. Measure and print recall metrics.
chatbot.py: Implement the Chatbot class with at least two interchangeable backends (one API-based, one local via Ollama). Test multi-turn conversation: ask a question, follow up referencing the previous answer, verify context is maintained.
app.py: Build the Streamlit interface. Add the sidebar with stats. Add a "clear conversation" button. Handle loading states properly.
Deploy to Streamlit Community Cloud (free). Share the URL.
What's Next
You have a working chatbot. The next two posts cover the API clients you will use to power it: the OpenAI API with all its features (streaming, function calling, embeddings) and the Anthropic Claude API with its unique capabilities. Then the Phase 8 capstone project.
Top comments (0)