DEV Community

Cover image for Teach Your Free AI Chatbot with Reports & Web Data (RAG Basics)
Timmy Dahunsi
Timmy Dahunsi

Posted on

Teach Your Free AI Chatbot with Reports & Web Data (RAG Basics)

Welcome back to our climate chatbot series! In Lesson 1, we built a basic chatbot with Gemini AI. In this tutorial, we're improving it with real knowledge base by imitating a technique called Retrieval-Augmented Generation (RAG). Your chatbot will now learn from actual climate reports, sustainability articles, and climate research papers!

By the end of this tutorial, your chatbot won't just rely on Gemini's training data, it will have its own knowledge base that you control.

What We're Building Today

New Features:

PDF Document Loading: Upload and process climate PDFs (IPCC reports, research papers)

Web Article Integration: Automatically fetch content from climate websites

Smart Document Chunking: Break large documents into manageable pieces

Source Citation: Bot cites specific documents when answering

Knowledge Base Display: Show loaded documents in sidebar

Getting Started

Prerequisites

  • Completed Day 1 tutorial (basic chatbot working)
  • Same setup: Python, Streamlit, Gemini API key
  • 45-60 minutes to implement

Installing Our New Dependencies

Before we begin coding, let's understand what each new library does and why we need it:

# Install additional packages for document processing
pip install pypdf langchain-text-splitters requests beautifulsoup4
Enter fullscreen mode Exit fullscreen mode

** Library Breakdown:**

  1. pypdf - It extracts text content from PDF files.

  2. langchain-text-splitters - It breaks large documents into smaller, manageable pieces.

  3. requests - It downloads web pages and handles HTTP requests.

  4. beautifulsoup4 - It parses HTML and extracts clean text from web pages.

Setting Up Your Project:

# Create the lesson folder
mkdir lesson2-knowledge-base
cd lesson2-knowledge-base
Enter fullscreen mode Exit fullscreen mode

Setting Up Your Project Files

Open your project folder in VS Code (or your preferred code editor) and create these files:

  • app.py
  • knowledge_base.py
  • requirements.txt

Your folder structure should look like this:

climate-chatbot-series/
β”œβ”€β”€ lesson1-basic-chatbot/
β”‚   β”œβ”€β”€ app.py
β”‚   β”œβ”€β”€ requirements.txt
β”œβ”€β”€ lesson2-knowledge-base/          # ← This lesson
β”‚   β”œβ”€β”€ app.py                       # Main application
β”‚   β”œβ”€β”€ knowledge_base.py            # Document processing system
β”‚   β”œβ”€β”€ requirements.txt             # Updated dependencies                         
β””
Enter fullscreen mode Exit fullscreen mode

Step 1: Document Processing

Before we can build a chatbot that understands research papers or articles, we need a way to process documents into smaller, searchable pieces.
We will build this logic inside our knowledge_base.py file, it will handle loading PDFs, fetching online articles, splitting text into chunks, and keeping track of all sources.

Here’s the complete code you can copy into knowledge_base.py:

# knowledge_base.py
import os
import requests
from pypdf import PdfReader
from bs4 import BeautifulSoup
from langchain_text_splitters import RecursiveCharacterTextSplitter
import streamlit as st
from typing import List, Dict
import hashlib

class ClimateKnowledgeBase:
    def __init__(self):
        self.documents = []
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            length_function=len,
        )

    def load_pdf(self, file_path: str, source_name: str = None) -> List[Dict]:
        """Load and process a PDF file"""
        try:
            if source_name is None:
                source_name = os.path.basename(file_path)

            # Read PDF
            reader = PdfReader(file_path)
            full_text = ""

            for page in reader.pages:
                full_text += page.extract_text() + "\n"

            # Split into chunks
            chunks = self.text_splitter.split_text(full_text)

            # Create document objects
            documents = []
            for i, chunk in enumerate(chunks):
                doc = {
                    "content": chunk,
                    "source": source_name,
                    "source_type": "PDF",
                    "chunk_id": f"{source_name}_chunk_{i}",
                    "page_count": len(reader.pages)
                }
                documents.append(doc)

            self.documents.extend(documents)
            return documents

        except Exception as e:
            st.error(f"Error loading PDF {file_path}: {e}")
            return []

    def load_web_article(self, url: str, source_name: str = None) -> List[Dict]:
        """Load and process a web article"""
        try:
            if source_name is None:
                source_name = url.split("//")[1].split("/")[0]  # Extract domain

            # Fetch webpage
            headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
            }
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()

            # Parse HTML
            soup = BeautifulSoup(response.content, 'html.parser')

            # Remove unwanted elements
            for element in soup(['script', 'style', 'nav', 'header', 'footer', 'aside']):
                element.decompose()

            # Extract text content
            text = soup.get_text()

            # Clean up text
            lines = [line.strip() for line in text.splitlines()]
            text = ' '.join([line for line in lines if line])

            # Split into chunks
            chunks = self.text_splitter.split_text(text)

            # Create document objects
            documents = []
            for i, chunk in enumerate(chunks):
                doc = {
                    "content": chunk,
                    "source": source_name,
                    "source_type": "Web Article",
                    "url": url,
                    "chunk_id": f"{source_name}_chunk_{i}"
                }
                documents.append(doc)

            self.documents.extend(documents)
            return documents

        except Exception as e:
            st.error(f"Error loading web article {url}: {e}")
            return []

    def search_documents(self, query: str, max_results: int = 3) -> List[Dict]:
        """Simple keyword search through documents"""
        query_lower = query.lower()
        scored_docs = []

        for doc in self.documents:
            content_lower = doc["content"].lower()

            # Simple scoring based on keyword matches
            score = 0
            query_words = query_lower.split()

            for word in query_words:
                if len(word) > 3:  # Only count longer words
                    score += content_lower.count(word)

            if score > 0:
                doc_copy = doc.copy()
                doc_copy["relevance_score"] = score
                scored_docs.append(doc_copy)

        # Sort by relevance and return top results
        scored_docs.sort(key=lambda x: x["relevance_score"], reverse=True)
        return scored_docs[:max_results]

    def get_document_stats(self) -> Dict:
        """Get statistics about loaded documents"""
        if not self.documents:
            return {"total_chunks": 0, "sources": 0, "types": []}

        sources = set(doc["source"] for doc in self.documents)
        types = set(doc["source_type"] for doc in self.documents)

        return {
            "total_chunks": len(self.documents),
            "sources": len(sources),
            "source_list": list(sources),
            "types": list(types)
        }

# Predefined climate knowledge sources
CLIMATE_SOURCES = {
    "IPCC AR6 Summary": "https://www.ipcc.ch/report/ar6/wg1/downloads/report/IPCC_AR6_WGI_SPM.pdf",
    "NASA Climate Change": "https://climate.nasa.gov/what-is-climate-change/",
    "EPA Climate Indicators": "https://www.epa.gov/climate-indicators",
    "NOAA Climate Science": "https://www.climate.gov/news-features/understanding-climate/climate-change-snow-and-ice",
    "IEA Energy Transition": "https://www.iea.org/topics/energy-transitions"
}
Enter fullscreen mode Exit fullscreen mode

Step 2: Build the Main Application

Now that we have our knowledge_base.py for processing and storing documents, the next step is to create the main application file app.py.

This file will act as the entry point of our chatbot. It will import the knowledge base we built earlier, provide a simple Streamlit interface so users can upload PDFs or paste article links, display stats about the loaded knowledge base, allow basic keyword search over documents

Here’s the complete code for app.py:

# app.py - Enhanced with Document Loading
import os
from dotenv import load_dotenv
load_dotenv()

import streamlit as st
import google.generativeai as genai
from knowledge_base import ClimateKnowledgeBase, CLIMATE_SOURCES

# Configure the Gemini API
API_KEY = os.getenv("GEMINI_API_KEY")
genai.configure(api_key=API_KEY)

# Configure Streamlit page
st.set_page_config(
    page_title="Climate Helper v3 - Knowledge Enhanced", 
    layout="wide"
)

# Initialize knowledge base in session state
if "knowledge_base" not in st.session_state:
    st.session_state.knowledge_base = ClimateKnowledgeBase()

if "messages" not in st.session_state:
    st.session_state.messages = [
        {
            "role": "assistant", 
            "content": "Hi! I'm your enhanced climate helper with access to real climate documents and research! Upload PDFs or let me fetch climate articles to give you more accurate, source-backed answers. What would you like to learn about?"
        }
    ]

def create_knowledge_enhanced_prompt(user_question: str, relevant_docs: list) -> str:
    """Create a prompt that includes relevant document context"""

    context = ""
    sources_used = []

    if relevant_docs:
        context = "\n\nRELEVANT CLIMATE KNOWLEDGE:\n"
        for i, doc in enumerate(relevant_docs, 1):
            context += f"\nSource {i} ({doc['source']}):\n{doc['content']}\n"
            sources_used.append(doc['source'])

    prompt = f"""You are a helpful climate and sustainability expert. Answer the user's question using your knowledge and the provided climate documents.

USER QUESTION: {user_question}
{context}

Please provide a helpful, accurate response that:
1. Answers the user's question thoroughly
2. References the provided sources when relevant
3. Cites specific source names when using information from documents
4. Is educational and encouraging
5. Suggests related topics they might want to explore

If using information from the provided sources, please mention them like: "According to [Source Name]..." or "The [Document Name] indicates..."
"""

    return prompt, sources_used

def display_knowledge_base_sidebar():
    """Display knowledge base management in sidebar"""
    with st.sidebar:
        st.markdown("## πŸ“š Knowledge Base")

        # Show current stats
        stats = st.session_state.knowledge_base.get_document_stats()

        if stats["total_chunks"] > 0:
            st.success(f"πŸ“Š Loaded: {stats['total_chunks']} chunks from {stats['sources']} sources")

            with st.expander("πŸ“‹ Loaded Sources"):
                for source in stats["source_list"]:
                    st.write(f"β€’ {source}")
        else:
            st.info("No documents loaded yet. Add some below!")

        st.markdown("### πŸ“„ Upload PDF Documents")
        uploaded_file = st.file_uploader(
            "Upload climate reports, research papers, etc.",
            type=['pdf'],
            help="Upload IPCC reports, climate research papers, or sustainability documents"
        )

        if uploaded_file is not None:
            # Save uploaded file
            os.makedirs("documents", exist_ok=True)
            file_path = os.path.join("documents", uploaded_file.name)

            with open(file_path, "wb") as f:
                f.write(uploaded_file.getvalue())

            # Process the PDF
            with st.spinner(f"Processing {uploaded_file.name}..."):
                documents = st.session_state.knowledge_base.load_pdf(file_path, uploaded_file.name)
                if documents:
                    st.success(f"βœ… Loaded {len(documents)} chunks from {uploaded_file.name}")
                    st.rerun()

        st.markdown("### 🌐 Load Web Articles")

        # Predefined sources
        st.markdown("#### Quick Add Climate Sources:")
        selected_source = st.selectbox(
            "Choose a climate source",
            [""] + list(CLIMATE_SOURCES.keys())
        )

        if st.button("πŸ“₯ Load Selected Source") and selected_source:
            with st.spinner(f"Loading {selected_source}..."):
                url = CLIMATE_SOURCES[selected_source]
                documents = st.session_state.knowledge_base.load_web_article(url, selected_source)
                if documents:
                    st.success(f"βœ… Loaded {len(documents)} chunks from {selected_source}")
                    st.rerun()

        # Custom URL input
        st.markdown("#### Add Custom Article:")
        custom_url = st.text_input("Enter article URL")
        custom_name = st.text_input("Source name (optional)")

        if st.button("πŸ“₯ Load Custom Article") and custom_url:
            with st.spinner("Loading article..."):
                documents = st.session_state.knowledge_base.load_web_article(
                    custom_url, 
                    custom_name if custom_name else None
                )
                if documents:
                    st.success(f"βœ… Loaded {len(documents)} chunks")
                    st.rerun()

        # Clear knowledge base
        if st.button("πŸ—‘οΈ Clear Knowledge Base"):
            st.session_state.knowledge_base = ClimateKnowledgeBase()
            st.success("Knowledge base cleared!")
            st.rerun()

def friendly_wrap_with_sources(raw_text: str, sources_used: list) -> str:
    """Enhanced friendly wrapper that includes source information"""

    sources_section = ""
    if sources_used:
        unique_sources = list(set(sources_used))
        sources_section = f"\n\nπŸ“š **Sources used in this response:**\n"
        for source in unique_sources:
            sources_section += f"β€’ {source}\n"

    return f"Great question! 🌱\n\n{raw_text.strip()}{sources_section}\n\nWould you like me to elaborate on any part of this, or do you have other climate questions?"

def display_messages():
    """Display all messages in the chat"""
    for msg in st.session_state.messages:
        author = "user" if msg["role"] == "user" else "assistant"
        with st.chat_message(author):
            st.write(msg["content"])

# Create main layout
col1, col2 = st.columns([2, 1])

with col1:
    st.title("🌱 Climate Helper v3.0")
    st.subheader("Enhanced with Real Climate Knowledge!")

    # Display messages
    display_messages()

    # Handle new user input
    prompt = st.chat_input("Ask me about climate topics (I can now reference documents!)...")

    if prompt:
        # Add user message to history
        st.session_state.messages.append({"role": "user", "content": prompt})

        # Show user message
        with st.chat_message("user"):
            st.write(prompt)

        # Show thinking indicator
        with st.chat_message("assistant"):
            placeholder = st.empty()

            # Search for relevant documents
            with st.spinner("Searching knowledge base..."):
                relevant_docs = st.session_state.knowledge_base.search_documents(prompt, max_results=3)

            if relevant_docs:
                placeholder.write("πŸ” Found relevant documents, generating response...")
            else:
                placeholder.write("πŸ€” Thinking... (no specific documents found, using general knowledge)")

            try:
                # Create enhanced prompt with document context
                enhanced_prompt, sources_used = create_knowledge_enhanced_prompt(prompt, relevant_docs)

                # Generate response with Gemini
                model = genai.GenerativeModel('gemini-1.5-flash')
                response = model.generate_content(enhanced_prompt)

                answer = response.text
                friendly_answer = friendly_wrap_with_sources(answer, sources_used)

            except Exception as e:
                friendly_answer = f"I'm sorry, I encountered an error: {e}. Please try asking your question again."

            # Replace placeholder with actual response
            placeholder.write(friendly_answer)

            # Add assistant response to history
            st.session_state.messages.append({"role": "assistant", "content": friendly_answer})

            # Show relevant document chunks if found
            if relevant_docs:
                with st.expander(f"πŸ“„ View {len(relevant_docs)} relevant document excerpts"):
                    for i, doc in enumerate(relevant_docs, 1):
                        st.markdown(f"**Source {i}: {doc['source']}** (Score: {doc['relevance_score']})")
                        st.markdown(f"```
{% endraw %}
\n{doc['content'][:300]}...\n
{% raw %}
```")
                        st.markdown("---")

with col2:
    display_knowledge_base_sidebar()

# Footer
st.markdown("---")
stats = st.session_state.knowledge_base.get_document_stats()
st.markdown(f"πŸ’‘ **Knowledge Enhanced**: {stats['total_chunks']} document chunks loaded from {stats['sources']} sources")
Enter fullscreen mode Exit fullscreen mode

Step 3: Updated Requirements

Update your requirements.txt for this lesson:

# Core Streamlit and AI
streamlit==1.28.0
python-dotenv==1.0.0
google-generativeai==0.3.0

# Document Processing Dependencies
pypdf==3.17.0              # PDF text extraction
langchain-text-splitters==0.2.0  # Intelligent text chunking
requests==2.31.0           # Web content fetching
beautifulsoup4==4.12.0     # HTML parsing and cleaning
Enter fullscreen mode Exit fullscreen mode

Step 4: Testing Your Enhanced Chatbot

Installation and Setup:

# Navigate to lesson folder
cd lesson2-knowledge-base

# Install dependencies
pip install -r requirements.txt

# Run the enhanced chatbot
streamlit run app.py
Enter fullscreen mode Exit fullscreen mode

Testing Workflow:

  1. Start with Web Sources: Click "Load Selected Source" β†’ Choose "NASA Climate Change"
  2. Upload a PDF: Find any climate PDF online and upload it
  3. Ask Specific Questions:
    • "What does NASA say about greenhouse gases?"
    • "According to the uploaded report, what are the main climate impacts?"
  4. Check Citations: Verify responses include source names
  5. View Document Excerpts: Expand the document sections to see original text

Expected Behavior:

  • Responses include phrases like "According to NASA Climate Change..."
  • Sidebar shows loaded document statistics
  • Document excerpts appear below responses
  • Knowledge persists during the chat session

Understanding the Architecture

The RAG Pipeline:

User Question β†’ Document Search β†’ Relevant Chunks β†’ Enhanced Prompt β†’ AI Response with Citations
Enter fullscreen mode Exit fullscreen mode

Key Components:

  1. Document Ingestion: PDF + Web β†’ Clean Text
  2. Text Chunking: Large Text β†’ Manageable Pieces
  3. Storage: In-Memory Document Store
  4. Retrieval: Keyword-Based Search
  5. Generation: AI with Context + Sources

Why This Approach Works:

  • Scalable: Add unlimited documents
  • Transparent: Users see which sources inform responses
  • Accurate: AI has access to latest information
  • Educational: Promotes learning from authoritative sources

Troubleshooting

Common Issues and Solutions:

PDF not loading?

  • Ensure PDF contains text (not scanned images)
  • Try smaller files first (< 10MB)
  • Check that pypdf installed correctly: pip show pypdf

Web articles not loading?

  • Some sites block automated requests
  • Try different URLs (government sites usually work well)
  • Check internet connection

"No relevant documents found"

  • Use specific keywords from your uploaded documents
  • Verify documents actually loaded (check sidebar stats)
  • Try broader search terms

Slow performance?

  • Large documents take time to process
  • Consider reducing chunk size in knowledge_base.py
  • Limit simultaneous document loading

What's Next?

We've successfully built a knowledge-enhanced chatbot that can learn from real climate documents and cite its sources! This is the foundation of professional RAG systems.

Future Enhancement Ideas:

  • Document Categories: Organize by topic (mitigation, adaptation, science)
  • Automatic Updates: Scheduled fetching of new reports
  • Quality Scoring: Rate source reliability
  • Multi-language Support: Process documents in different languages

Get the Complete Code

GitHub Repository: https://github.com/your-username/climate-chatbot-series

  • Includes sample climate documents for testing

Questions? Connect with me on LinkedIn - I'd love to see what climate knowledge you add to your bot!

Conclusion

You've transformed your basic chatbot into a specialized climate knowledge system that can:

  • Process PDF reports and web articles
  • Search through loaded content
  • Provide source-backed responses
  • Show transparent document usage

This is the basics of RAG (Retrieval Augmented Generation) in action! Your chatbot now learns from authoritative climate sources and can provide more accurate, up-to-date information than general AI models alone.

In the next lesson where we'll add semantic search and advanced retrieval techniques. For now, try adding your own climate documents to the knowledge base and see how the chatbot responds!

Top comments (0)