I Built an AI-Powered Web Scraper That Understands Plain English — Here's How
No XPaths. No CSS selectors. No regex nightmares. Just tell it what you want, and it figures out the rest.
1. Introduction — The Problem I Was Solving
If you've done any web scraping, you know the pain:
- You write a perfectly working scraper today.
- The website changes its layout tomorrow.
- Your scraper breaks. You fix it. It breaks again.
The real frustration isn't writing the scraper — it's maintaining it. Every time a site updates its HTML structure, you're back to inspecting elements, rewriting selectors, and re-testing.
I kept thinking: What if the scraper could just understand what I want, the same way I'd explain it to a person?
That's what led me to build this project. Instead of writing brittle rules about where data lives in the HTML, I wanted to just ask a question in plain English and get an answer back.
2. Solution — What I Built
I built a Python AI Web Scraper that combines two powerful tools:
- ScrapeGraphAI — a Python library that uses LLMs to intelligently scrape websites
- Groq + Llama 3.1 — a fast, free AI model that reads and understands web page content
The result? You give it a URL and a question like:
"What is the purpose of this website?"
And it returns the answer — no HTML parsing, no CSS selectors, no fragile scraping rules.
Tech Stack at a glance:
Python 3.x-
scrapegraphai— the AI scraping framework -
playwright— for browser-based page rendering -
Groq API— free LLM inference (Llama 3.1 8B model)
3. Code Explanation — Step by Step
Here's the complete main.py (it's beautifully short):
from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
"llm": {
"api_key": "your-groq-api-key-here",
"model": "groq/llama-3.1-8b-instant",
},
"verbose": True,
"headless": True,
}
smart_scraper_graph = SmartScraperGraph(
prompt="Extract purpose of this website: https://qaul.ai/",
source="https://qaul.ai/",
config=graph_config
)
result = smart_scraper_graph.run()
print(result)
Let's break it down piece by piece.
Step 1 — Import the SmartScraperGraph
from scrapegraphai.graphs import SmartScraperGraph
SmartScraperGraph is the core class from the ScrapeGraphAI library. Think of it as a smart agent that knows how to visit a webpage, read its content, and answer your question using an LLM.
Step 2 — Configure the AI Model
graph_config = {
"llm": {
"api_key": "your-groq-api-key-here",
"model": "groq/llama-3.1-8b-instant",
},
"verbose": True,
"headless": True,
}
This config dictionary has three things:
-
llm— tells ScrapeGraphAI which AI model to use. Here we're usinggroq/llama-3.1-8b-instant, which is fast and available for free via the Groq API. -
verbose: True— prints what the scraper is doing at each step, great for debugging. -
headless: True— runs the browser in the background (no visible window), which is standard for production scrapers.
💡 Why Groq? Groq offers blazing-fast LLM inference with a generous free tier. Llama 3.1 8B is more than capable for most scraping tasks.
Step 3 — Define Your Scraping Task
smart_scraper_graph = SmartScraperGraph(
prompt="Extract purpose of this website: https://qaul.ai/",
source="https://qaul.ai/",
config=graph_config
)
This is the magic part. You provide:
-
prompt— your question in plain English. This is what you'd normally express as complex CSS selectors or XPath queries. Here, you just ask. -
source— the URL of the website to scrape. -
config— the configuration we defined above.
Step 4 — Run It and Print the Result
result = smart_scraper_graph.run()
print(result)
Calling .run() triggers the entire pipeline:
- Playwright opens the webpage (headless browser)
- The page content is extracted
- The content is sent to the Llama model via Groq
- The model reads it and answers your prompt
- The result is returned as a Python dictionary
4. How to Use — Setup & Instructions
Getting this running on your machine takes about 5 minutes.
Prerequisites
- Python 3.8 or higher
- A free Groq API key (get one at console.groq.com)
Step 1 — Clone the Repository
git clone https://github.com/zahidkhan-xen/ai-powered-web-scraper
cd ai-powered-web-scraper
Step 2 — Create a Virtual Environment
# macOS/Linux
python3 -m venv venv
source venv/bin/activate
# Windows
python -m venv venv
venv\Scripts\activate
Step 3 — Install Dependencies
pip install scrapegraphai
playwright install
⚠️
playwright installdownloads the browser binaries needed to render web pages. This may take a minute.
Step 4 — Add Your Groq API Key
Open main.py and replace the placeholder with your actual key:
"api_key": "paste-your-groq-key-here",
Step 5 — Run the Scraper
python main.py
You'll see verbose output as the scraper works, and the final result printed at the end.
Customizing Your Scrape
To scrape a different site or extract different information, just change these two lines in main.py:
smart_scraper_graph = SmartScraperGraph(
prompt="What products are listed on this page?", # ← your question
source="https://your-target-website.com/", # ← your URL
config=graph_config
)
More prompt examples you can try:
# Extract pricing information
prompt="List all pricing plans and their features"
# Get contact details
prompt="What is the contact email and phone number on this page?"
# Summarize an article
prompt="Summarize this article in 3 bullet points"
# Extract job listings
prompt="List all job titles and their locations"
5. Conclusion
What I love about this approach is how it flips the traditional scraping paradigm on its head. Instead of you adapting your code to the website's structure, the AI adapts to the website for you.
This is still an early-stage project — a single main.py — but it proves a powerful concept. Some directions you could take it further:
- Add a CLI interface so you can pass URLs and prompts as arguments
- Save results to JSON or CSV for downstream processing
- Loop over multiple URLs to scrape at scale
- Add error handling for pages that fail to load
- Swap in different models (OpenAI, Anthropic, etc.) depending on task complexity
Traditional scrapers break when websites change. This one just needs a better question. That's a fundamentally more resilient way to build.
GitHub: zahidkhan-xen/ai-powered-web-scraper
If you found this useful, consider giving the repo a ⭐ — it helps others find it too!
Top comments (0)