Today is the day we move Crawlee for Python out of beta 🥳 In this post I will summarize the decisions and development that went into this process. I hope it will inform — and even validate — your product decisions.
What is Crawlee for Python?
To begin with, a quick intro: Crawlee for Python is an open-source web-scraping library that provides you with a full toolkit to build, run, and scale your own scrapers. You can deploy Crawlee for Python to extract financial data to identify cases of corporate wage theft, or to track major global polluters, or compare the prices of competing e-commerce retailers. The only real limit is your creativity.
Why was it in beta?
Crawlee for Python was released in beta in July 2024. A couple of years earlier, we had released Crawlee for JS, which had better capabilities from the outset as a result of the tooling that was available for us to build with.
We knew we ultimately wanted both versions of Crawlee to have equivalent capabilities, but the Python tooling available to us at the time was too young and untested in real scenarios. We believed we could reach the point where Crawlee for Python was equivalent to (or better than!) its JS counterpart but weren’t ready to include untested tooling in the build. After working hard to launch, we made the decision to release Crawlee for Python in beta.
This decision gave us breathing room to collect real-world feedback to solve the issues related to any new code and gather community feedback. Getting the product out in beta allowed devs to engage with it, and gave us the chance to fix reported bugs, implement requested features, and improve some performance issues. This helped identify new issues with the library that we could incorporate into a more complete build
What did we do to prepare for v1?
We used the time Crawlee for Python was in beta to improve more than just the library. We created several benchmarks to track the performance of a wide range of typical crawlers, allowing us to compare Crawlee for Python with other options available to developers, including Crawlee for JS.
We re-implemented most of the capabilities of Crawlee for JS in our v1 of Crawlee for Python. Key features we added are:
- Unified storage client system, resulting in less duplication, better extensibility, and a cleaner developer experience. It also opens the door for the community to build and share their own storage client implementations tailored to specific databases or cloud storage providers.
- Adaptive Playwright crawler, which makes your crawls faster and cheaper, while still allowing you to reliably handle complex, dynamic websites. In practice, you get the best of both worlds: speed on simple pages and robustness on modern, JavaScript-heavy sites.
-
New default HTTP client (
ImpitHttpClient
, powered by the Impit library). This delivers fewer false positives, more resilient crawls, and less need for complicated workarounds. Impit is also developed as an open-source project by Apify, so you can dive into the internals or contribute improvements yourself: you can also create your own instance, configure it to your needs (e.g. enable HTTP/3 or choose a specific browser profile), and pass it into your crawler. - Sitemap request loader, making it easier to start large-scale crawls where sitemaps already provide full coverage of the site
- Robots exclusion standard which not only helps you build ethical and responsible crawlers, but can also save time and bandwidth by skipping disallowed or irrelevant pages
- Fingerprinting so that each crawler run looks like a real browser on a real device, making it less likely that you’ll get blocked. Using fingerprinting in Crawlee is straightforward: create a fingerprint generator with your desired options and pass it to the crawler.
- Open telemetry, allowing you to monitor real-time dashboards or analyze traces to understand crawler performance. easier to integrate Crawlee into existing monitoring pipelines
How did we know it was ready to move out of beta?
Once we implemented the features above, and tested them for as many possible real-world cases we believed are possible, we eventually had a version of Crawlee for Python that lived up to our initial vision for it. We feel now is the time: Crawlee for Python is ready to sit alongside its JS counterpart as a full web-scraping and automation library.
Are you ready to check it out? Visit our repo and give us a star ⭐
What happens next?
We will continue to build upon Crawlee for Python v1 as a rolling release. Although we are happy to be launching the full version, we know we can make further improvements. This is where you come in!
We’d love your feedback. Try it out, and don’t hesitate to open an issue to:
- Report any bugs or weird behaviour you encounter;
- Ask questions if something isn’t clear - so we can improve the docs;
- Or request new features you’d like to see
Thanks for reading!
We hope this summary gave you some insights into the process and decisions that take place to move a product out of beta. We would love to hear from you if you have questions about Crawlee for Python (or JS!), the tooling we have implemented, or anything else we have discussed in this article.
Give it a run and share your issues with us so we can continue to refine this tool and adapt it to suit all your web-crawling and automation needs.
Top comments (1)
I’m particularly excited about the adaptive Playwright crawler because it makes dealing with dynamic sites much simpler without volumes of custom retry or stealth logic. The new fingerprinting system has been effective me, particularly with rotatiing proxies and human-imitating nav patterns, reducing blocks substantially. I’ve also combined the OpenTelemetry support with Grafana/Prometheus, which allows me to get better visibility into spikes of latency and error. I’d want to extend the unified client to send results to cloud warehouses directly, such as BigQuery, which will make analytics workflow simpler. Overall, Crawlee seems a solid augmentation to existing stacks today, and into the future, is well-positioned to replace patchwork solutions to large-scale, dynamic web crawling of the future.