The problem
I run nerdyelectronics.com, a tech blog. Every time I published a batch of posts or changed my site structure, I'd go through the same painful cycle:
- Make changes
- Wait for Google Search Console to recrawl (days)
- Discover broken links and issues... days later
- Fix them
- Wait again
I tried the usual tools. Screaming Frog is powerful but feels like enterprise software for a simple job. SaaS crawlers charge per page or per scan. I just wanted to know: what's broken on my site right now?
So I built CrawlyCat.
What it does
CrawlyCat crawls your website and reports:
- HTTP 4xx/5xx errors
- Redirect chains
- Missing or bad
<title>and meta descriptions - Missing or duplicate
<h1>tags - Internal broken links
- External link inventory
Nothing revolutionary — but it runs locally, has no limits, and takes about 30 seconds to set up.
The architecture
Two crawling modes:
Browser mode (default): Uses Playwright with headless Chromium. This handles JavaScript-rendered pages and even bypasses Cloudflare challenges using playwright-stealth. Slower, but necessary for modern sites.
Fast mode: Uses httpx for raw HTTP requests. About 10x faster, great for static sites or blogs. No JS rendering.
Both modes use BeautifulSoup for HTML parsing, share the same issue-detection logic, and respect robots.txt automatically.
Three interfaces
This is where CrawlyCat gets interesting. Most crawlers give you one interface. CrawlyCat has three:
CLI — for scripting, CI pipelines, and cron jobs:
python -m crawler --url https://example.com --max-pages 200 --html-out report.html
GUI — tkinter desktop app with live progress and status tabs:
Web UI — Flask app with Server-Sent Events for real-time updates:
The Web UI is my favorite. It shows a live crawl log, tabbed status views (by HTTP code, by issue type), and a summary panel. No npm, no node — just Flask serving a single HTML template.
Tradeoffs I made
Browser mode vs. Fast mode: I could have forced everything through Playwright, but that's 10x slower. For a static blog, httpx is plenty. For a React SPA behind Cloudflare, you need the browser. So I kept both and let the user choose.
No sitemap seeding (yet): Right now CrawlyCat discovers pages by following links from the start URL. Sitemap support is on the roadmap.
robots.txt is always respected: You can't disable this. If a page is disallowed, it's skipped. I didn't want to build a tool that encourages ignoring robots.txt.
Local only: No cloud, no accounts, no telemetry. Your crawl data stays on your machine.
Results on my own site
I ran CrawlyCat against nerdyelectronics.com and found:
- Broken internal links from old posts that referenced moved pages
- Redirect chains from HTTP → HTTPS → www → non-www
- Pages missing meta descriptions
- A few posts with duplicate H1 tags
All things I wouldn't have caught until Search Console flagged them days later.
Try it
git clone https://github.com/bhageria/crawlycat.git
cd crawlycat
pip install -r requirements.txt
python -m playwright install chromium
python -m crawler web
That's it. Open http://127.0.0.1:5000, paste your URL, and hit crawl.
GitHub: github.com/bhageria/crawlycat
CrawlyCat is open source under GPL v3. Contributions welcome.


Top comments (0)