Introducing AlgoCrawl: A Dual-Depth Web Crawler for Penetration Testing
I’m excited to share AlgoCrawl, a new open source web crawler designed specifically for penetration testing and in-depth website analysis. Built with TypeScript and powered by Playwright (yes, it’s non-headless by default!), AlgoCrawl takes crawling to the next level by combining two essential techniques:
What Makes AlgoCrawl Unique?
Dual-Depth Crawling
Unlike traditional crawlers that only follow links between pages, AlgoCrawl introduces a dual-depth approach:
Page Depth: It navigates through every link, form submission, or button click to systematically explore each page.
Dynamic Depth: On every page, the crawler simulates user interactions—clicking on dynamic elements and submitting forms—to capture content that loads via AJAX or other JavaScript-driven methods. This ensures that even content hidden behind dynamic user interfaces isn’t missed.
Built for Penetration Testing
By mimicking real user behavior and capturing dynamic content, AlgoCrawl is ideal for:
Security Assessments: Uncovering potential vulnerabilities in web applications.
Performance Testing: Monitoring how sites react under various simulated interactions.
Comprehensive Analysis: Creating a detailed map of both static and dynamic content for further examination.
Key Features
Playwright Automation: Leverages Playwright for efficient browser automation.
Customizable Configuration: Tweak settings such as crawl depth, timeouts, and proxy options using a simple JSON file.
Optimized Element Interaction: Uses multiple click strategies (standard, JavaScript-triggered, and position-based) to handle a variety of interactive elements.
Robust Error Handling: Comprehensive logging and retry mechanisms to ensure reliable crawling in complex environments.
Getting Started
AlgoCrawl requires Node.js (v18 or later). To try it out:
Clone the repository:
bash
Copia
git clone https://github.com/algorime/AlgoCrawl.git
cd AlgoCrawl
Install dependencies:
bash
Copia
npm install
Run the crawler:
bash
Copia
npm start
You can also run in development mode with auto-reloading via npm run dev.
All configuration is managed in config/default.json, making it easy to tailor the crawler to your specific needs.
Current Status & Roadmap
The project is already packed with robust features:
Core functionality with dual-depth crawling.
Enhanced AJAX handling and performance optimizations.
Comprehensive error tracking and logging.
Future plans include:
More advanced form interactions.
Deeper JavaScript event simulation.
Improved concurrent page handling.
If you’re interested in web security or just love innovative crawling solutions, I invite you to check out the project, contribute, and share your feedback!
Feel free to adapt this post as needed. Happy crawling and secure coding!
Top comments (0)