I built my own website crawler because SEO tools were too restrictive

#beginners #opensource #webdev #productivity

The problem

I run nerdyelectronics.com, a tech blog. Every time I published a batch of posts or changed my site structure, I'd go through the same painful cycle:

Make changes
Wait for Google Search Console to recrawl (days)
Discover broken links and issues... days later
Fix them
Wait again

I tried the usual tools. Screaming Frog is powerful but feels like enterprise software for a simple job. SaaS crawlers charge per page or per scan. I just wanted to know: what's broken on my site right now?

So I built CrawlyCat.

What it does

CrawlyCat crawls your website and reports:

HTTP 4xx/5xx errors
Redirect chains
Missing or bad <title> and meta descriptions
Missing or duplicate <h1> tags
Internal broken links
External link inventory

Nothing revolutionary — but it runs locally, has no limits, and takes about 30 seconds to set up.

The architecture

Two crawling modes:

Browser mode (default): Uses Playwright with headless Chromium. This handles JavaScript-rendered pages and even bypasses Cloudflare challenges using playwright-stealth. Slower, but necessary for modern sites.

Fast mode: Uses httpx for raw HTTP requests. About 10x faster, great for static sites or blogs. No JS rendering.

Both modes use BeautifulSoup for HTML parsing, share the same issue-detection logic, and respect robots.txt automatically.

Three interfaces

This is where CrawlyCat gets interesting. Most crawlers give you one interface. CrawlyCat has three:

CLI — for scripting, CI pipelines, and cron jobs:

python -m crawler --url https://example.com --max-pages 200 --html-out report.html

GUI — tkinter desktop app with live progress and status tabs:

Web UI — Flask app with Server-Sent Events for real-time updates:

The Web UI is my favorite. It shows a live crawl log, tabbed status views (by HTTP code, by issue type), and a summary panel. No npm, no node — just Flask serving a single HTML template.

Tradeoffs I made

Browser mode vs. Fast mode: I could have forced everything through Playwright, but that's 10x slower. For a static blog, httpx is plenty. For a React SPA behind Cloudflare, you need the browser. So I kept both and let the user choose.

No sitemap seeding (yet): Right now CrawlyCat discovers pages by following links from the start URL. Sitemap support is on the roadmap.

robots.txt is always respected: You can't disable this. If a page is disallowed, it's skipped. I didn't want to build a tool that encourages ignoring robots.txt.

Local only: No cloud, no accounts, no telemetry. Your crawl data stays on your machine.

Results on my own site

I ran CrawlyCat against nerdyelectronics.com and found:

Broken internal links from old posts that referenced moved pages
Redirect chains from HTTP → HTTPS → www → non-www
Pages missing meta descriptions
A few posts with duplicate H1 tags

All things I wouldn't have caught until Search Console flagged them days later.

Try it

git clone https://github.com/bhageria/crawlycat.git
cd crawlycat
pip install -r requirements.txt
python -m playwright install chromium
python -m crawler web