A powerful web scraping tool that combines intelligent content extraction with AI-powered question answering. Built with Streamlit, LangChain, and Ollama for local AI processing.
π Features
- Smart Web Scraping: Automatically extracts content from any URL using multiple fallback methods
- AI-Powered Q&A: Ask questions about scraped content and get intelligent responses
- Local AI Processing: Uses Ollama for privacy-focused, offline AI processing
-
Multiple Scraping Methods:
- Selenium WebDriver for JavaScript-heavy sites
- Simple HTTP requests for basic HTML pages
- Interactive Chat Interface: Real-time conversation with the scraped content
- Content Chunking: Intelligent text splitting for better context retrieval
- Source Citations: See exactly which parts of the content were used to answer your questions
- Error Recovery: Robust error handling with graceful fallbacks
π Tech Stack
- Frontend: Streamlit
- AI/LLM: Ollama (llama3.2)
- Web Scraping: Selenium WebDriver, BeautifulSoup
- Text Processing: LangChain
- Vector Store: In-memory vector storage
- Embeddings: Ollama embeddings for semantic search
π Prerequisites
Before running this application, make sure you have:
- Python 3.8+ installed
- Ollama installed and running
- Chrome browser installed (for Selenium)
π§ Installation
1. Clone the Repository
git clone <your-repo-url>
cd ai-scraper
2. Install Python Dependencies
pip install -r requirements.txt
3. Install and Setup Ollama
On Windows/Mac/Linux:
# Install Ollama from https://ollama.ai
# Then pull the required model
ollama pull llama3.2
Start Ollama Service:
ollama serve
4. Verify Installation
# Check if Ollama is running
curl http://localhost:11434/api/tags
# Check if llama3.2 model is installed
ollama list
π Usage
Starting the Application
streamlit run ai_scraper.py
The app will open in your browser at http://localhost:8501
How to Use
-
Enter a URL in the input field (e.g.,
https://example.com
) - Click "Load & Process URL" to scrape and index the content
- Wait for processing - you'll see progress indicators
- Ask questions in the chat interface about the scraped content
- View sources - expand the sources section to see which content was used
Example Workflows
Scraping a News Article
1. Enter: https://example-news-site.com/article
2. Wait for "Documents indexed successfully!"
3. Ask: "What is the main topic of this article?"
4. Ask: "Who are the key people mentioned?"
Analyzing Documentation
1. Enter: https://docs.example.com/api-guide
2. Wait for processing
3. Ask: "How do I authenticate with this API?"
4. Ask: "What are the rate limits?"
βοΈ Configuration
Environment Variables (Optional)
# Set custom Ollama host
export OLLAMA_HOST=http://localhost:11434
# Set custom model
export OLLAMA_MODEL=llama3.2
Customizing the AI Model
You can use different Ollama models by changing the model name in the code:
# In ai_scraper.py, change:
embeddings = OllamaEmbeddings(model="llama3.2")
model = OllamaLLM(model="llama3.2")
# To:
embeddings = OllamaEmbeddings(model="llama2") # or another model
model = OllamaLLM(model="llama2")
Available models:
-
llama3.2
(recommended) llama2
mistral
codellama
π Troubleshooting
Common Issues
Segmentation Fault
- Cause: Chrome/Selenium driver issues
- Solution: The app automatically handles this with fallback methods
"Ollama not found"
# Check if Ollama is running
ollama serve
# Check if model is installed
ollama pull llama3.2
Chrome Driver Issues
# The app automatically downloads Chrome driver
# If issues persist, manually install:
pip install --upgrade webdriver-manager
Empty Content
- Cause: Website blocks automated scraping
- Solution: Try different URLs or check the website's robots.txt
Slow Processing
- Cause: Large pages or complex content
-
Solutions:
- Use more specific URLs
- Wait for processing to complete
- Consider using a more powerful model
Performance Tips
- Use specific URLs rather than homepages
- Close unused browser tabs to free memory
- Use headless mode (already enabled)
- Clear chat history regularly for better performance
π Privacy & Security
- Local Processing: All AI processing happens locally with Ollama
- No Data Sent to Cloud: Your scraped content stays on your machine
- Secure Scraping: Respects robots.txt and rate limits
- No Persistent Storage: Data is only stored in memory during the session
π€ Contributing
Contributions are welcome! Here's how to contribute:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
Development Setup
# Clone and setup development environment
git clone https://github.com/Navashub/AI-Agents/tree/main/ai-scraper
cd ai-scraper
pip install -r requirements.txt
π Roadmap
- [ ] Multi-language Support - Support for more Ollama models
- [ ] PDF Scraping - Add PDF document processing
- [ ] Batch Processing - Process multiple URLs at once
- [ ] Export Functionality - Export Q&A sessions
- [ ] Advanced Filtering - Content filtering and preprocessing
- [ ] API Mode - REST API for programmatic access
- [ ] Docker Support - Containerized deployment
- [ ] Cloud Deployment - Deploy to cloud platforms
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π Acknowledgments
- Ollama - For providing excellent local AI capabilities
- LangChain - For the powerful document processing framework
- Streamlit - For the amazing web app framework
- Selenium - For robust web scraping capabilities
π Support
If you encounter any issues or have questions:
- Check the Troubleshooting section
- Search existing GitHub Issues
- Create a new issue with:
- Your operating system
- Python version
- Error message (if any)
- Steps to reproduce
π Show Your Support
If this project helped you, please consider:
- β Starring the repository
- π Sharing it with others
- π Reporting bugs
- π‘ Suggesting new features
Happy Scraping! π
Built with using Python, Streamlit, and Ollama.
Top comments (0)