Navas Herbert

Posted on Aug 26

🤖 AI Web Scraper & Q&A

#langchain #rag #ai #programming

A powerful web scraping tool that combines intelligent content extraction with AI-powered question answering. Built with Streamlit, LangChain, and Ollama for local AI processing.

🚀 Features

Smart Web Scraping: Automatically extracts content from any URL using multiple fallback methods
AI-Powered Q&A: Ask questions about scraped content and get intelligent responses
Local AI Processing: Uses Ollama for privacy-focused, offline AI processing
Multiple Scraping Methods:
- Selenium WebDriver for JavaScript-heavy sites
- Simple HTTP requests for basic HTML pages
Interactive Chat Interface: Real-time conversation with the scraped content
Content Chunking: Intelligent text splitting for better context retrieval
Source Citations: See exactly which parts of the content were used to answer your questions
Error Recovery: Robust error handling with graceful fallbacks

🛠 Tech Stack

Frontend: Streamlit
AI/LLM: Ollama (llama3.2)
Web Scraping: Selenium WebDriver, BeautifulSoup
Text Processing: LangChain
Vector Store: In-memory vector storage
Embeddings: Ollama embeddings for semantic search

📋 Prerequisites

Before running this application, make sure you have:

Python 3.8+ installed
Ollama installed and running
Chrome browser installed (for Selenium)

🔧 Installation

1. Clone the Repository

git clone <your-repo-url>
cd ai-scraper

2. Install Python Dependencies

pip install -r requirements.txt

3. Install and Setup Ollama

On Windows/Mac/Linux:

# Install Ollama from https://ollama.ai
# Then pull the required model
ollama pull llama3.2

Start Ollama Service:

ollama serve

4. Verify Installation

# Check if Ollama is running
curl http://localhost:11434/api/tags

# Check if llama3.2 model is installed
ollama list

🚀 Usage

Starting the Application

streamlit run ai_scraper.py

The app will open in your browser at http://localhost:8501

How to Use

Enter a URL in the input field (e.g., https://example.com)
Click "Load & Process URL" to scrape and index the content
Wait for processing - you'll see progress indicators
Ask questions in the chat interface about the scraped content
View sources - expand the sources section to see which content was used

Example Workflows

Scraping a News Article

1. Enter: https://example-news-site.com/article
2. Wait for "Documents indexed successfully!"
3. Ask: "What is the main topic of this article?"
4. Ask: "Who are the key people mentioned?"

Analyzing Documentation

1. Enter: https://docs.example.com/api-guide
2. Wait for processing
3. Ask: "How do I authenticate with this API?"
4. Ask: "What are the rate limits?"

⚙️ Configuration

Environment Variables (Optional)

# Set custom Ollama host
export OLLAMA_HOST=http://localhost:11434

# Set custom model
export OLLAMA_MODEL=llama3.2

Customizing the AI Model

You can use different Ollama models by changing the model name in the code:

# In ai_scraper.py, change:
embeddings = OllamaEmbeddings(model="llama3.2")
model = OllamaLLM(model="llama3.2")

# To:
embeddings = OllamaEmbeddings(model="llama2")  # or another model
model = OllamaLLM(model="llama2")

Available models:

llama3.2 (recommended)
llama2
mistral
codellama

🔍 Troubleshooting

Common Issues

Segmentation Fault

Cause: Chrome/Selenium driver issues
Solution: The app automatically handles this with fallback methods

"Ollama not found"

# Check if Ollama is running
ollama serve

# Check if model is installed
ollama pull llama3.2

Chrome Driver Issues

# The app automatically downloads Chrome driver
# If issues persist, manually install:
pip install --upgrade webdriver-manager

Empty Content

Cause: Website blocks automated scraping
Solution: Try different URLs or check the website's robots.txt

Slow Processing

Cause: Large pages or complex content
Solutions:
- Use more specific URLs
- Wait for processing to complete
- Consider using a more powerful model

Performance Tips

Use specific URLs rather than homepages
Close unused browser tabs to free memory
Use headless mode (already enabled)
Clear chat history regularly for better performance

🔒 Privacy & Security

Local Processing: All AI processing happens locally with Ollama
No Data Sent to Cloud: Your scraped content stays on your machine
Secure Scraping: Respects robots.txt and rate limits
No Persistent Storage: Data is only stored in memory during the session

🤝 Contributing

Contributions are welcome! Here's how to contribute:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Setup

# Clone and setup development environment
git clone https://github.com/Navashub/AI-Agents/tree/main/ai-scraper
cd ai-scraper
pip install -r requirements.txt

📈 Roadmap

[ ] Multi-language Support - Support for more Ollama models
[ ] PDF Scraping - Add PDF document processing
[ ] Batch Processing - Process multiple URLs at once
[ ] Export Functionality - Export Q&A sessions
[ ] Advanced Filtering - Content filtering and preprocessing
[ ] API Mode - REST API for programmatic access
[ ] Docker Support - Containerized deployment
[ ] Cloud Deployment - Deploy to cloud platforms

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Ollama - For providing excellent local AI capabilities
LangChain - For the powerful document processing framework
Streamlit - For the amazing web app framework
Selenium - For robust web scraping capabilities

📞 Support

If you encounter any issues or have questions:

Check the Troubleshooting section
Search existing GitHub Issues
Create a new issue with:
- Your operating system
- Python version
- Error message (if any)
- Steps to reproduce

🌟 Show Your Support

If this project helped you, please consider:

⭐ Starring the repository
🔄 Sharing it with others
🐛 Reporting bugs
💡 Suggesting new features

Happy Scraping! 🎉

Built with using Python, Streamlit, and Ollama.

DEV Community