DEV Community

Siddhesh Surve
Siddhesh Surve

Posted on

🕸️ I Just Deleted My Scraper Boilerplate: Meet the "One-Liner" Crawler

Stop writing for loops to crawl websites. Seriously. It’s 2026, and we just got the upgrade XPath was missing for 20 years.

If you have ever built a web scraper, you know the Drill of Pain™:

  1. Write a Python script.
  2. Import requests and BeautifulSoup.
  3. Fetch a URL.
  4. Parse the HTML.
  5. Write a loop to find links.
  6. Recursively call the function (and accidentally DDoS the site because you forgot a time.sleep).

It’s imperative, it’s messy, and it’s hard to read 6 months later.

But a new open-source tool just dropped on GitHub, and it is effectively "SQL for the Web."

Meet wxpath.

🤯 The Paradigm Shift: Crawling Inside the Query

Usually, XPath is just a selector language. You download the page first, then you use XPath to pick the title.

wxpath changes the rules. It adds a custom url(...) operator directly into the XPath language. This means your selector doesn't just find data—it travels to new pages to get it.

The Old Way (Python + Requests + Parsel)

# 🤮 The Boilerplate Nightmare
import requests
from parsel import Selector

def crawl(url):
    response = requests.get(url)
    sel = Selector(text=response.text)

    # Extract data
    title = sel.xpath("//h1/text()").get()

    # Find next link and repeat...
    next_page = sel.xpath("//a[@class='next']/@href").get()
    if next_page:
        crawl(next_page) # Hope you handle recursion depth!

Enter fullscreen mode Exit fullscreen mode

The wxpath Way

# 😍 The "One-Liner" (Declarative)
import wxpath

# Fetch the URL, find the link, FETCH THAT LINK, and return the title
query = "url('https://site.com')/url(//a[@class='next']/@href)//h1/text()"

results = wxpath.query(query)

Enter fullscreen mode Exit fullscreen mode

Do you see that? The logic of navigation is embedded in the query.

⚡ Feature Spotlight: The "Deep Crawl" Operator (///)

The killer feature isn't just fetching one page. It's the /// operator.

In standard XPath, // means "search anywhere in this document."
In wxpath, /// means "crawl anywhere in this network."

If you want to crawl all links on a Wikipedia page, and then for each of those pages, extract the title and the first paragraph, you don't need a loop. You need a map.

Scraping a Knowledge Graph in one expression:

url('https://en.wikipedia.org/wiki/Web_scraping')
///url(//div[@id='bodyContent']//a/@href)  /map {                                     'title': //h1/text(),
    'summary': (//p)[1]/text(),
    'source_url': base-uri(.)
}

Enter fullscreen mode Exit fullscreen mode

This single expression:

  1. Fetches the seed URL.
  2. Finds all links in the body.
  3. Concurrently crawls those links.
  4. Returns a clean JSON-like object (Map) for each result.

🛠️ Why This is "Viral" Tech

We are moving away from Imperative Coding (telling the computer how to do it) to Declarative Coding (telling the computer what you want).

We saw this with React (UI). We saw this with GraphQL (APIs). Now we are seeing it with wxpath (Scraping).

Key Features that make it production-ready:

  • XPath 3.1 Support: It supports modern Maps and Arrays (/map{...}), so your output is ready for JSON serialization immediately.
  • Async Under the Hood: It uses aiohttp to fetch pages concurrently, even though the query looks linear.
  • Politeness Built-in: It respects robots.txt and handles rate limiting automatically, so you don't become "that guy" who crashes a server.

🚀 How to Try It

It’s a Python library, so installation is standard:

pip install wxpath

Enter fullscreen mode Exit fullscreen mode

Then run the CLI to test a query instantly:

wxpath "url('https://news.ycombinator.com')//a[@class='storylink']/text()"

Enter fullscreen mode Exit fullscreen mode

🔮 The Verdict

Is this the end of Scrapy? Probably not for massive, enterprise-scale crawls. But for the 90% of use cases where you just need to "grab data from these linked pages," wxpath is a cheat code.

It turns a 50-line script into a 3-line query. That is the kind of efficiency we live for.

Star the repo while it's hot:
👉 github.com/rodricios/wxpath

Top comments (0)