DEV Community

zahidkhan-xen
zahidkhan-xen

Posted on

AI-Powered Web Scraper

I Built an AI-Powered Web Scraper That Understands Plain English — Here's How

No XPaths. No CSS selectors. No regex nightmares. Just tell it what you want, and it figures out the rest.


1. Introduction — The Problem I Was Solving

If you've done any web scraping, you know the pain:

  • You write a perfectly working scraper today.
  • The website changes its layout tomorrow.
  • Your scraper breaks. You fix it. It breaks again.

The real frustration isn't writing the scraper — it's maintaining it. Every time a site updates its HTML structure, you're back to inspecting elements, rewriting selectors, and re-testing.

I kept thinking: What if the scraper could just understand what I want, the same way I'd explain it to a person?

That's what led me to build this project. Instead of writing brittle rules about where data lives in the HTML, I wanted to just ask a question in plain English and get an answer back.


2. Solution — What I Built

I built a Python AI Web Scraper that combines two powerful tools:

  • ScrapeGraphAI — a Python library that uses LLMs to intelligently scrape websites
  • Groq + Llama 3.1 — a fast, free AI model that reads and understands web page content

The result? You give it a URL and a question like:

"What is the purpose of this website?"

And it returns the answer — no HTML parsing, no CSS selectors, no fragile scraping rules.

Tech Stack at a glance:

  • Python 3.x
  • scrapegraphai — the AI scraping framework
  • playwright — for browser-based page rendering
  • Groq API — free LLM inference (Llama 3.1 8B model)

3. Code Explanation — Step by Step

Here's the complete main.py (it's beautifully short):

from scrapegraphai.graphs import SmartScraperGraph

graph_config = {
    "llm": {
        "api_key": "your-groq-api-key-here",
        "model": "groq/llama-3.1-8b-instant",
    },
    "verbose": True,
    "headless": True,
}

smart_scraper_graph = SmartScraperGraph(
    prompt="Extract purpose of this website: https://qaul.ai/",
    source="https://qaul.ai/",
    config=graph_config
)

result = smart_scraper_graph.run()
print(result)
Enter fullscreen mode Exit fullscreen mode

Let's break it down piece by piece.


Step 1 — Import the SmartScraperGraph

from scrapegraphai.graphs import SmartScraperGraph
Enter fullscreen mode Exit fullscreen mode

SmartScraperGraph is the core class from the ScrapeGraphAI library. Think of it as a smart agent that knows how to visit a webpage, read its content, and answer your question using an LLM.


Step 2 — Configure the AI Model

graph_config = {
    "llm": {
        "api_key": "your-groq-api-key-here",
        "model": "groq/llama-3.1-8b-instant",
    },
    "verbose": True,
    "headless": True,
}
Enter fullscreen mode Exit fullscreen mode

This config dictionary has three things:

  • llm — tells ScrapeGraphAI which AI model to use. Here we're using groq/llama-3.1-8b-instant, which is fast and available for free via the Groq API.
  • verbose: True — prints what the scraper is doing at each step, great for debugging.
  • headless: True — runs the browser in the background (no visible window), which is standard for production scrapers.

💡 Why Groq? Groq offers blazing-fast LLM inference with a generous free tier. Llama 3.1 8B is more than capable for most scraping tasks.


Step 3 — Define Your Scraping Task

smart_scraper_graph = SmartScraperGraph(
    prompt="Extract purpose of this website: https://qaul.ai/",
    source="https://qaul.ai/",
    config=graph_config
)
Enter fullscreen mode Exit fullscreen mode

This is the magic part. You provide:

  • prompt — your question in plain English. This is what you'd normally express as complex CSS selectors or XPath queries. Here, you just ask.
  • source — the URL of the website to scrape.
  • config — the configuration we defined above.

Step 4 — Run It and Print the Result

result = smart_scraper_graph.run()
print(result)
Enter fullscreen mode Exit fullscreen mode

Calling .run() triggers the entire pipeline:

  1. Playwright opens the webpage (headless browser)
  2. The page content is extracted
  3. The content is sent to the Llama model via Groq
  4. The model reads it and answers your prompt
  5. The result is returned as a Python dictionary

4. How to Use — Setup & Instructions

Getting this running on your machine takes about 5 minutes.

Prerequisites


Step 1 — Clone the Repository

git clone https://github.com/zahidkhan-xen/ai-powered-web-scraper
cd ai-powered-web-scraper
Enter fullscreen mode Exit fullscreen mode

Step 2 — Create a Virtual Environment

# macOS/Linux
python3 -m venv venv
source venv/bin/activate

# Windows
python -m venv venv
venv\Scripts\activate
Enter fullscreen mode Exit fullscreen mode

Step 3 — Install Dependencies

pip install scrapegraphai
playwright install
Enter fullscreen mode Exit fullscreen mode

⚠️ playwright install downloads the browser binaries needed to render web pages. This may take a minute.

Step 4 — Add Your Groq API Key

Open main.py and replace the placeholder with your actual key:

"api_key": "paste-your-groq-key-here",
Enter fullscreen mode Exit fullscreen mode

Step 5 — Run the Scraper

python main.py
Enter fullscreen mode Exit fullscreen mode

You'll see verbose output as the scraper works, and the final result printed at the end.


Customizing Your Scrape

To scrape a different site or extract different information, just change these two lines in main.py:

smart_scraper_graph = SmartScraperGraph(
    prompt="What products are listed on this page?",   # ← your question
    source="https://your-target-website.com/",          # ← your URL
    config=graph_config
)
Enter fullscreen mode Exit fullscreen mode

More prompt examples you can try:

# Extract pricing information
prompt="List all pricing plans and their features"

# Get contact details
prompt="What is the contact email and phone number on this page?"

# Summarize an article
prompt="Summarize this article in 3 bullet points"

# Extract job listings
prompt="List all job titles and their locations"
Enter fullscreen mode Exit fullscreen mode

5. Conclusion

What I love about this approach is how it flips the traditional scraping paradigm on its head. Instead of you adapting your code to the website's structure, the AI adapts to the website for you.

This is still an early-stage project — a single main.py — but it proves a powerful concept. Some directions you could take it further:

  • Add a CLI interface so you can pass URLs and prompts as arguments
  • Save results to JSON or CSV for downstream processing
  • Loop over multiple URLs to scrape at scale
  • Add error handling for pages that fail to load
  • Swap in different models (OpenAI, Anthropic, etc.) depending on task complexity

Traditional scrapers break when websites change. This one just needs a better question. That's a fundamentally more resilient way to build.


GitHub: zahidkhan-xen/ai-powered-web-scraper

If you found this useful, consider giving the repo a ⭐ — it helps others find it too!

Top comments (0)