Stop Getting Blocked: Recon Your Target Website Before Scraping It

#python #showdev #tooling #webscraping

The problem

You spend hours writing a scraper, run it, and immediately get a 403.
Or you build it with requests, only to realize the site needs JavaScript to render.

I got tired of this loop, so I built scrapalyser — a Python library that scans
any website before you write a single line of scraper code.

Install

pip install scrapalyser

Usage

import scrapalyser

report = scrapalyser.scan(
url="https://example.com",
output="txt",
lang="en",
)

What it detects

Anti-bot: Cloudflare, DataDome, PerimeterX, Akamai, Kasada, reCAPTCHA, hCaptcha
Tech stack: React, Vue, Angular, Next.js, Nuxt, WordPress, Shopify...
JS required: so you know if requests is enough or if you need Playwright
API endpoints: via CSP headers, inline scripts, or XHR interception (Playwright mode)
robots.txt & sitemap
Login wall: form, redirect, button, OAuth

Two engines

curl_cffi (default): fast, no browser, one HTTP request.

playwright: full browser with XHR interception and screenshot.

report = scrapalyser.scan(
url="https://example.com",
engine="playwright",
headless=False,
screenshot="capture.png",
)

If you get blocked

If the site returns a 403 or a captcha page, the report immediately tells you
which antibot blocked you — all other fields return "blocked by antibot".
No guessing.

DEV Community