DEV Community

Cover image for I built my own website crawler because SEO tools were too restrictive
vivek
vivek

Posted on

I built my own website crawler because SEO tools were too restrictive

The problem

I run nerdyelectronics.com, a tech blog. Every time I published a batch of posts or changed my site structure, I'd go through the same painful cycle:

  1. Make changes
  2. Wait for Google Search Console to recrawl (days)
  3. Discover broken links and issues... days later
  4. Fix them
  5. Wait again

I tried the usual tools. Screaming Frog is powerful but feels like enterprise software for a simple job. SaaS crawlers charge per page or per scan. I just wanted to know: what's broken on my site right now?

So I built CrawlyCat.

What it does

CrawlyCat crawls your website and reports:

  • HTTP 4xx/5xx errors
  • Redirect chains
  • Missing or bad <title> and meta descriptions
  • Missing or duplicate <h1> tags
  • Internal broken links
  • External link inventory

Nothing revolutionary — but it runs locally, has no limits, and takes about 30 seconds to set up.

The architecture

Two crawling modes:

Browser mode (default): Uses Playwright with headless Chromium. This handles JavaScript-rendered pages and even bypasses Cloudflare challenges using playwright-stealth. Slower, but necessary for modern sites.

Fast mode: Uses httpx for raw HTTP requests. About 10x faster, great for static sites or blogs. No JS rendering.

Both modes use BeautifulSoup for HTML parsing, share the same issue-detection logic, and respect robots.txt automatically.

Three interfaces

This is where CrawlyCat gets interesting. Most crawlers give you one interface. CrawlyCat has three:

CLI — for scripting, CI pipelines, and cron jobs:

python -m crawler --url https://example.com --max-pages 200 --html-out report.html
Enter fullscreen mode Exit fullscreen mode

GUI — tkinter desktop app with live progress and status tabs:

crawlycat gui

Web UI — Flask app with Server-Sent Events for real-time updates:

crawlycat web ui

The Web UI is my favorite. It shows a live crawl log, tabbed status views (by HTTP code, by issue type), and a summary panel. No npm, no node — just Flask serving a single HTML template.

Tradeoffs I made

Browser mode vs. Fast mode: I could have forced everything through Playwright, but that's 10x slower. For a static blog, httpx is plenty. For a React SPA behind Cloudflare, you need the browser. So I kept both and let the user choose.

No sitemap seeding (yet): Right now CrawlyCat discovers pages by following links from the start URL. Sitemap support is on the roadmap.

robots.txt is always respected: You can't disable this. If a page is disallowed, it's skipped. I didn't want to build a tool that encourages ignoring robots.txt.

Local only: No cloud, no accounts, no telemetry. Your crawl data stays on your machine.

Results on my own site

I ran CrawlyCat against nerdyelectronics.com and found:

  • Broken internal links from old posts that referenced moved pages
  • Redirect chains from HTTP → HTTPS → www → non-www
  • Pages missing meta descriptions
  • A few posts with duplicate H1 tags

All things I wouldn't have caught until Search Console flagged them days later.

Try it

git clone https://github.com/bhageria/crawlycat.git
cd crawlycat
pip install -r requirements.txt
python -m playwright install chromium
python -m crawler web
Enter fullscreen mode Exit fullscreen mode

That's it. Open http://127.0.0.1:5000, paste your URL, and hit crawl.

GitHub: github.com/bhageria/crawlycat


CrawlyCat is open source under GPL v3. Contributions welcome.

Top comments (0)