Welcome back to our climate chatbot series! In Lesson 1, we built a basic chatbot with Gemini AI. In this tutorial, we're improving it with real knowledge base by imitating a technique called Retrieval-Augmented Generation (RAG). Your chatbot will now learn from actual climate reports, sustainability articles, and climate research papers!
By the end of this tutorial, your chatbot won't just rely on Gemini's training data, it will have its own knowledge base that you control.
What We're Building Today
New Features:
PDF Document Loading: Upload and process climate PDFs (IPCC reports, research papers)
Web Article Integration: Automatically fetch content from climate websites
Smart Document Chunking: Break large documents into manageable pieces
Source Citation: Bot cites specific documents when answering
Knowledge Base Display: Show loaded documents in sidebar
Getting Started
Prerequisites
- Completed Day 1 tutorial (basic chatbot working)
- Same setup: Python, Streamlit, Gemini API key
- 45-60 minutes to implement
Installing Our New Dependencies
Before we begin coding, let's understand what each new library does and why we need it:
# Install additional packages for document processing
pip install pypdf langchain-text-splitters requests beautifulsoup4
** Library Breakdown:**
pypdf
- It extracts text content from PDF files.langchain-text-splitters
- It breaks large documents into smaller, manageable pieces.requests
- It downloads web pages and handles HTTP requests.beautifulsoup4
- It parses HTML and extracts clean text from web pages.
Setting Up Your Project:
# Create the lesson folder
mkdir lesson2-knowledge-base
cd lesson2-knowledge-base
Setting Up Your Project Files
Open your project folder in VS Code (or your preferred code editor) and create these files:
-
app.py
knowledge_base.py
requirements.txt
Your folder structure should look like this:
climate-chatbot-series/
βββ lesson1-basic-chatbot/
β βββ app.py
β βββ requirements.txt
βββ lesson2-knowledge-base/ # β This lesson
β βββ app.py # Main application
β βββ knowledge_base.py # Document processing system
β βββ requirements.txt # Updated dependencies
β
Step 1: Document Processing
Before we can build a chatbot that understands research papers or articles, we need a way to process documents into smaller, searchable pieces.
We will build this logic inside our knowledge_base.py
file, it will handle loading PDFs, fetching online articles, splitting text into chunks, and keeping track of all sources.
Hereβs the complete code you can copy into knowledge_base.py:
# knowledge_base.py
import os
import requests
from pypdf import PdfReader
from bs4 import BeautifulSoup
from langchain_text_splitters import RecursiveCharacterTextSplitter
import streamlit as st
from typing import List, Dict
import hashlib
class ClimateKnowledgeBase:
def __init__(self):
self.documents = []
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
)
def load_pdf(self, file_path: str, source_name: str = None) -> List[Dict]:
"""Load and process a PDF file"""
try:
if source_name is None:
source_name = os.path.basename(file_path)
# Read PDF
reader = PdfReader(file_path)
full_text = ""
for page in reader.pages:
full_text += page.extract_text() + "\n"
# Split into chunks
chunks = self.text_splitter.split_text(full_text)
# Create document objects
documents = []
for i, chunk in enumerate(chunks):
doc = {
"content": chunk,
"source": source_name,
"source_type": "PDF",
"chunk_id": f"{source_name}_chunk_{i}",
"page_count": len(reader.pages)
}
documents.append(doc)
self.documents.extend(documents)
return documents
except Exception as e:
st.error(f"Error loading PDF {file_path}: {e}")
return []
def load_web_article(self, url: str, source_name: str = None) -> List[Dict]:
"""Load and process a web article"""
try:
if source_name is None:
source_name = url.split("//")[1].split("/")[0] # Extract domain
# Fetch webpage
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get(url, headers=headers, timeout=10)
response.raise_for_status()
# Parse HTML
soup = BeautifulSoup(response.content, 'html.parser')
# Remove unwanted elements
for element in soup(['script', 'style', 'nav', 'header', 'footer', 'aside']):
element.decompose()
# Extract text content
text = soup.get_text()
# Clean up text
lines = [line.strip() for line in text.splitlines()]
text = ' '.join([line for line in lines if line])
# Split into chunks
chunks = self.text_splitter.split_text(text)
# Create document objects
documents = []
for i, chunk in enumerate(chunks):
doc = {
"content": chunk,
"source": source_name,
"source_type": "Web Article",
"url": url,
"chunk_id": f"{source_name}_chunk_{i}"
}
documents.append(doc)
self.documents.extend(documents)
return documents
except Exception as e:
st.error(f"Error loading web article {url}: {e}")
return []
def search_documents(self, query: str, max_results: int = 3) -> List[Dict]:
"""Simple keyword search through documents"""
query_lower = query.lower()
scored_docs = []
for doc in self.documents:
content_lower = doc["content"].lower()
# Simple scoring based on keyword matches
score = 0
query_words = query_lower.split()
for word in query_words:
if len(word) > 3: # Only count longer words
score += content_lower.count(word)
if score > 0:
doc_copy = doc.copy()
doc_copy["relevance_score"] = score
scored_docs.append(doc_copy)
# Sort by relevance and return top results
scored_docs.sort(key=lambda x: x["relevance_score"], reverse=True)
return scored_docs[:max_results]
def get_document_stats(self) -> Dict:
"""Get statistics about loaded documents"""
if not self.documents:
return {"total_chunks": 0, "sources": 0, "types": []}
sources = set(doc["source"] for doc in self.documents)
types = set(doc["source_type"] for doc in self.documents)
return {
"total_chunks": len(self.documents),
"sources": len(sources),
"source_list": list(sources),
"types": list(types)
}
# Predefined climate knowledge sources
CLIMATE_SOURCES = {
"IPCC AR6 Summary": "https://www.ipcc.ch/report/ar6/wg1/downloads/report/IPCC_AR6_WGI_SPM.pdf",
"NASA Climate Change": "https://climate.nasa.gov/what-is-climate-change/",
"EPA Climate Indicators": "https://www.epa.gov/climate-indicators",
"NOAA Climate Science": "https://www.climate.gov/news-features/understanding-climate/climate-change-snow-and-ice",
"IEA Energy Transition": "https://www.iea.org/topics/energy-transitions"
}
Step 2: Build the Main Application
Now that we have our knowledge_base.py for processing and storing documents, the next step is to create the main application file app.py.
This file will act as the entry point of our chatbot. It will import the knowledge base we built earlier, provide a simple Streamlit interface so users can upload PDFs or paste article links, display stats about the loaded knowledge base, allow basic keyword search over documents
Hereβs the complete code for app.py:
# app.py - Enhanced with Document Loading
import os
from dotenv import load_dotenv
load_dotenv()
import streamlit as st
import google.generativeai as genai
from knowledge_base import ClimateKnowledgeBase, CLIMATE_SOURCES
# Configure the Gemini API
API_KEY = os.getenv("GEMINI_API_KEY")
genai.configure(api_key=API_KEY)
# Configure Streamlit page
st.set_page_config(
page_title="Climate Helper v3 - Knowledge Enhanced",
layout="wide"
)
# Initialize knowledge base in session state
if "knowledge_base" not in st.session_state:
st.session_state.knowledge_base = ClimateKnowledgeBase()
if "messages" not in st.session_state:
st.session_state.messages = [
{
"role": "assistant",
"content": "Hi! I'm your enhanced climate helper with access to real climate documents and research! Upload PDFs or let me fetch climate articles to give you more accurate, source-backed answers. What would you like to learn about?"
}
]
def create_knowledge_enhanced_prompt(user_question: str, relevant_docs: list) -> str:
"""Create a prompt that includes relevant document context"""
context = ""
sources_used = []
if relevant_docs:
context = "\n\nRELEVANT CLIMATE KNOWLEDGE:\n"
for i, doc in enumerate(relevant_docs, 1):
context += f"\nSource {i} ({doc['source']}):\n{doc['content']}\n"
sources_used.append(doc['source'])
prompt = f"""You are a helpful climate and sustainability expert. Answer the user's question using your knowledge and the provided climate documents.
USER QUESTION: {user_question}
{context}
Please provide a helpful, accurate response that:
1. Answers the user's question thoroughly
2. References the provided sources when relevant
3. Cites specific source names when using information from documents
4. Is educational and encouraging
5. Suggests related topics they might want to explore
If using information from the provided sources, please mention them like: "According to [Source Name]..." or "The [Document Name] indicates..."
"""
return prompt, sources_used
def display_knowledge_base_sidebar():
"""Display knowledge base management in sidebar"""
with st.sidebar:
st.markdown("## π Knowledge Base")
# Show current stats
stats = st.session_state.knowledge_base.get_document_stats()
if stats["total_chunks"] > 0:
st.success(f"π Loaded: {stats['total_chunks']} chunks from {stats['sources']} sources")
with st.expander("π Loaded Sources"):
for source in stats["source_list"]:
st.write(f"β’ {source}")
else:
st.info("No documents loaded yet. Add some below!")
st.markdown("### π Upload PDF Documents")
uploaded_file = st.file_uploader(
"Upload climate reports, research papers, etc.",
type=['pdf'],
help="Upload IPCC reports, climate research papers, or sustainability documents"
)
if uploaded_file is not None:
# Save uploaded file
os.makedirs("documents", exist_ok=True)
file_path = os.path.join("documents", uploaded_file.name)
with open(file_path, "wb") as f:
f.write(uploaded_file.getvalue())
# Process the PDF
with st.spinner(f"Processing {uploaded_file.name}..."):
documents = st.session_state.knowledge_base.load_pdf(file_path, uploaded_file.name)
if documents:
st.success(f"β
Loaded {len(documents)} chunks from {uploaded_file.name}")
st.rerun()
st.markdown("### π Load Web Articles")
# Predefined sources
st.markdown("#### Quick Add Climate Sources:")
selected_source = st.selectbox(
"Choose a climate source",
[""] + list(CLIMATE_SOURCES.keys())
)
if st.button("π₯ Load Selected Source") and selected_source:
with st.spinner(f"Loading {selected_source}..."):
url = CLIMATE_SOURCES[selected_source]
documents = st.session_state.knowledge_base.load_web_article(url, selected_source)
if documents:
st.success(f"β
Loaded {len(documents)} chunks from {selected_source}")
st.rerun()
# Custom URL input
st.markdown("#### Add Custom Article:")
custom_url = st.text_input("Enter article URL")
custom_name = st.text_input("Source name (optional)")
if st.button("π₯ Load Custom Article") and custom_url:
with st.spinner("Loading article..."):
documents = st.session_state.knowledge_base.load_web_article(
custom_url,
custom_name if custom_name else None
)
if documents:
st.success(f"β
Loaded {len(documents)} chunks")
st.rerun()
# Clear knowledge base
if st.button("ποΈ Clear Knowledge Base"):
st.session_state.knowledge_base = ClimateKnowledgeBase()
st.success("Knowledge base cleared!")
st.rerun()
def friendly_wrap_with_sources(raw_text: str, sources_used: list) -> str:
"""Enhanced friendly wrapper that includes source information"""
sources_section = ""
if sources_used:
unique_sources = list(set(sources_used))
sources_section = f"\n\nπ **Sources used in this response:**\n"
for source in unique_sources:
sources_section += f"β’ {source}\n"
return f"Great question! π±\n\n{raw_text.strip()}{sources_section}\n\nWould you like me to elaborate on any part of this, or do you have other climate questions?"
def display_messages():
"""Display all messages in the chat"""
for msg in st.session_state.messages:
author = "user" if msg["role"] == "user" else "assistant"
with st.chat_message(author):
st.write(msg["content"])
# Create main layout
col1, col2 = st.columns([2, 1])
with col1:
st.title("π± Climate Helper v3.0")
st.subheader("Enhanced with Real Climate Knowledge!")
# Display messages
display_messages()
# Handle new user input
prompt = st.chat_input("Ask me about climate topics (I can now reference documents!)...")
if prompt:
# Add user message to history
st.session_state.messages.append({"role": "user", "content": prompt})
# Show user message
with st.chat_message("user"):
st.write(prompt)
# Show thinking indicator
with st.chat_message("assistant"):
placeholder = st.empty()
# Search for relevant documents
with st.spinner("Searching knowledge base..."):
relevant_docs = st.session_state.knowledge_base.search_documents(prompt, max_results=3)
if relevant_docs:
placeholder.write("π Found relevant documents, generating response...")
else:
placeholder.write("π€ Thinking... (no specific documents found, using general knowledge)")
try:
# Create enhanced prompt with document context
enhanced_prompt, sources_used = create_knowledge_enhanced_prompt(prompt, relevant_docs)
# Generate response with Gemini
model = genai.GenerativeModel('gemini-1.5-flash')
response = model.generate_content(enhanced_prompt)
answer = response.text
friendly_answer = friendly_wrap_with_sources(answer, sources_used)
except Exception as e:
friendly_answer = f"I'm sorry, I encountered an error: {e}. Please try asking your question again."
# Replace placeholder with actual response
placeholder.write(friendly_answer)
# Add assistant response to history
st.session_state.messages.append({"role": "assistant", "content": friendly_answer})
# Show relevant document chunks if found
if relevant_docs:
with st.expander(f"π View {len(relevant_docs)} relevant document excerpts"):
for i, doc in enumerate(relevant_docs, 1):
st.markdown(f"**Source {i}: {doc['source']}** (Score: {doc['relevance_score']})")
st.markdown(f"```
{% endraw %}
\n{doc['content'][:300]}...\n
{% raw %}
```")
st.markdown("---")
with col2:
display_knowledge_base_sidebar()
# Footer
st.markdown("---")
stats = st.session_state.knowledge_base.get_document_stats()
st.markdown(f"π‘ **Knowledge Enhanced**: {stats['total_chunks']} document chunks loaded from {stats['sources']} sources")
Step 3: Updated Requirements
Update your requirements.txt
for this lesson:
# Core Streamlit and AI
streamlit==1.28.0
python-dotenv==1.0.0
google-generativeai==0.3.0
# Document Processing Dependencies
pypdf==3.17.0 # PDF text extraction
langchain-text-splitters==0.2.0 # Intelligent text chunking
requests==2.31.0 # Web content fetching
beautifulsoup4==4.12.0 # HTML parsing and cleaning
Step 4: Testing Your Enhanced Chatbot
Installation and Setup:
# Navigate to lesson folder
cd lesson2-knowledge-base
# Install dependencies
pip install -r requirements.txt
# Run the enhanced chatbot
streamlit run app.py
Testing Workflow:
- Start with Web Sources: Click "Load Selected Source" β Choose "NASA Climate Change"
- Upload a PDF: Find any climate PDF online and upload it
-
Ask Specific Questions:
- "What does NASA say about greenhouse gases?"
- "According to the uploaded report, what are the main climate impacts?"
- Check Citations: Verify responses include source names
- View Document Excerpts: Expand the document sections to see original text
Expected Behavior:
- Responses include phrases like "According to NASA Climate Change..."
- Sidebar shows loaded document statistics
- Document excerpts appear below responses
- Knowledge persists during the chat session
Understanding the Architecture
The RAG Pipeline:
User Question β Document Search β Relevant Chunks β Enhanced Prompt β AI Response with Citations
Key Components:
- Document Ingestion: PDF + Web β Clean Text
- Text Chunking: Large Text β Manageable Pieces
- Storage: In-Memory Document Store
- Retrieval: Keyword-Based Search
- Generation: AI with Context + Sources
Why This Approach Works:
- Scalable: Add unlimited documents
- Transparent: Users see which sources inform responses
- Accurate: AI has access to latest information
- Educational: Promotes learning from authoritative sources
Troubleshooting
Common Issues and Solutions:
PDF not loading?
- Ensure PDF contains text (not scanned images)
- Try smaller files first (< 10MB)
- Check that
pypdf
installed correctly:pip show pypdf
Web articles not loading?
- Some sites block automated requests
- Try different URLs (government sites usually work well)
- Check internet connection
"No relevant documents found"
- Use specific keywords from your uploaded documents
- Verify documents actually loaded (check sidebar stats)
- Try broader search terms
Slow performance?
- Large documents take time to process
- Consider reducing chunk size in
knowledge_base.py
- Limit simultaneous document loading
What's Next?
We've successfully built a knowledge-enhanced chatbot that can learn from real climate documents and cite its sources! This is the foundation of professional RAG systems.
Future Enhancement Ideas:
- Document Categories: Organize by topic (mitigation, adaptation, science)
- Automatic Updates: Scheduled fetching of new reports
- Quality Scoring: Rate source reliability
- Multi-language Support: Process documents in different languages
Get the Complete Code
GitHub Repository: https://github.com/your-username/climate-chatbot-series
- Includes sample climate documents for testing
Questions? Connect with me on LinkedIn - I'd love to see what climate knowledge you add to your bot!
Conclusion
You've transformed your basic chatbot into a specialized climate knowledge system that can:
- Process PDF reports and web articles
- Search through loaded content
- Provide source-backed responses
- Show transparent document usage
This is the basics of RAG (Retrieval Augmented Generation) in action! Your chatbot now learns from authoritative climate sources and can provide more accurate, up-to-date information than general AI models alone.
In the next lesson where we'll add semantic search and advanced retrieval techniques. For now, try adding your own climate documents to the knowledge base and see how the chatbot responds!
Top comments (0)