DEV Community

Charles
Charles

Posted on

Web Scraping for Data Science: How to Build Datasets Without Writing Spaghetti Code

Web Scraping for Data Science: How to Build Datasets Without Writing Spaghetti Code

Every data scientist hits this wall: you find an amazing dataset source on the web, but it's behind paginated pages, dynamic JavaScript, or — worst of all — a CAPTCHA wall.

The Problem

Traditional scraping for data science looks like this:

import requests
from bs4 import BeautifulSoup
import time

response = requests.get('https://example.com/data')
soup = BeautifulSoup(response.text, 'html.parser')
# ... 50 lines of fragile selector logic ...
# ... oh wait, the page uses JS rendering ...
# ... and now I'm blocked ...
Enter fullscreen mode Exit fullscreen mode

The API-First Approach

Modern web scraping APIs abstract away the infrastructure headaches:

curl -X POST https://api.xcrawl.com/v1/scrape \
  -H "x-api-key: YOUR_KEY" \
  -d '{"url": "https://example.com/data", "js_render": true}'
Enter fullscreen mode Exit fullscreen mode

Building a Real Dataset: 1000 GitHub Repos

Here's how to build datasets using XCrawl:

Search

const { XcrawlScraper } = require('xcrawl-scraper');
const client = new XcrawlScraper({ apiKey: 'YOUR_KEY' });
const results = await client.search({ query: 'site:github.com topics', count: 100 });
Enter fullscreen mode Exit fullscreen mode

Scrape & Extract

const data = await client.scrape({
  url: results[0].url,
  js_render: true,
  extraction: { mode: 'llm', schema: { repo_name: 'string', stars: 'number' } }
});
Enter fullscreen mode Exit fullscreen mode

Export

xcrawl search "site:github.com" --count 1000 --output repos.csv
Enter fullscreen mode Exit fullscreen mode

Why This Matters

Approach Time Lines of Code Maintenance
DIY Scraper 2-4 hours 100-200 High (breaks weekly)
API-First 5-10 minutes 10-20 None

Get started with XCrawl API at dash.xcrawl.com

Top comments (0)