Collecting data from Amazon can be challenging, especially when it needs to be done regularly and on a large scale. The main issues when working with Amazon include IP address blocks and CAPTCHAs. These problems occur because of the high volume of requests, which are often similar, and inconsistencies in browser fingerprints. To reduce the risk of blocking, users need to experiment with finding unbanned proxies while also avoiding exceeding traffic limits.
Inconsistent fingerprints can be a problem because Amazon uses different security techniques to detect bots. They analyze browser data and compare it with the user's operating system. Behavioral patterns also matter. It's important to regularly update the automation logic to imitate real users. Managing the entire automation stack involves more than just scripting and running it. It includes automating launch schedules, creating auxiliary subsystems for handling proxies and browser data, and implementing monitoring mechanisms to ensure the system runs smoothly.
This article compares Surfsky with other popular browser automation and scraping services:
- Scrapingbee
- Browserless v.1 (Cloud Subscription)
Let’s evaluate these services based on their performance and features.
Price
We will not focus on this because Surfsky is currently in the Alpha stage, so it is not useful to compare it with services that are already on the market. However, let’s mention the pricing of ScrapingBee and Browserless.
ScrapingBee | Browserless | Surfsky | |
---|---|---|---|
Real price for 1000 requests, $ | $6 | $3.2 | Free now |
Proxies
High-quality proxies are essential for web scraping. Different types of proxies can be useful for solving various problems, so we investigated the available proxy options in the scraping services.
ScrapingBee | Browserless | Surfsky | |
---|---|---|---|
Proxy rotation | ✅ | ✅ | ✅ |
Built-in residential proxies | ✅ | ✅ | ✅ |
HTTP support | ✅ | ✅ | ✅ |
SOCKS5 support | ❌ | ❌ | ✅ |
SSH support | ❌ | ❌ | ✅ |
OpenVPN support | ❌ | ❌ | ✅ |
Fingerprinting
Purposefully altering the browser fingerprint can help evade anti-bot systems. However, most of the currently available solutions only offer a partial remedy to this problem. For instance, ScrapingBee recommends utilizing the 'Stealth Proxy' feature, which simply falsifies HTTP headers. On the other hand, Browserless employs the open-source library puppeteer-extra-plugin-stealth, but unfortunately, it is ineffective against advanced anti-bot systems.
Basic browser fingerprint management includes:
- User-agent spoofing;
- Modifying HTTP headers;
- Hiding --enable-automation, --headless flags;
- Overriding API permissions;
- Spoofing navigator, iframe, WebGL, media, window, chrome objects and properties.
It is important to note that object modification is performed using JavaScript and can be easily detected. Additionally, in basic techniques, inconsistencies and non-standard property values are often compared against the expected values for the user's browser and operating system.
Advanced browser fingerprint management includes not only basic techniques, but also some more advanced methods:
- WebGL Vendor info and image hash;
- Canvas fingerprinting;
- IP, DNS hardening;
- Audio and video fingerprinting;
- WebRTC hardening;
- Font fingerprinting;
- Geolocation API spoofing;
- SSL/TLS fingerprinting;
- HTML5 features spoofing;
- Extended JavaScript Browser information spoofing.
It is important to note that advanced fingerprinting is implemented differently compared to basic fingerprinting design. Instead of using JavaScript, it is implemented by modifying the browser kernel.
High-quality and consistent anti-fingerprinting detection measures always include low-level substitutions and real user fingerprints. Surfsky is designed to counter the most complex anti-bot systems.
Services like PixelScan and CreepJS are designed to detect and highlight inconsistencies in digital fingerprints. We used these services for the test.
ScrapingBee | Browserless | Surfsky | |
---|---|---|---|
Basic browser fingerprinting | ✅ | ✅ | ✅ |
Advanced browser fingerprinting | ❌ | ❌ | ✅ |
Real GPU canvas rendering | ❌ | ❌ | ✅ |
PixelScan passing | ❌ | ❌ | ✅ |
CreepJS passing | ❌ | ✅ | ✅ |
Convenience
It is hard to evaluate such things as convenience, but here are the main aspects we’ve focused on:
- Support should be available at least in the form of a communication channel with the development team.
- An easy-to-use admin panel should include features such as managing subscriptions and creating requests.
- Easy-to-use documentation should include code examples for a quick start and clear manuals.
ScrapingBee | Browserless | Surfsky | |
---|---|---|---|
Support | ✅ | ✅ | ✅ |
Easy-to-use admin panel | ✅ | ✅ | ❌ |
Easy-to-use documentation | ✅ | ✅ | ✅ |
Features
Here are other parameters we decided to compare:
- The number of concurrent requests that can be executed per unit of time determines the request limit. If you exceed this limit, services may apply various sanctions such as rate-limiting, returning errors, or blocking access.
- Scaling is an indicator of a service's ability to dynamically adjust server capacity, either increasing or decreasing it. The service needs to be able to adapt to the dynamic variation in the number of client requests.
- Chrome Debug Protocol is a set of tools and API that allows interacting with the Google Chrome/Chromium browser for debugging and automation purposes, often used for debugging and flexible work with the browser. Its functions are described in the specification here: https://chromedevtools.github.io/devtools-protocol/.
- The ability to take a screenshot of the page after it has loaded or after other trigger actions.
- Executing JavaScript scripts on a browser page. It can be useful for debugging purposes, monitoring status, ensuring accessibility, and verifying the correct display of page content.
- Profile management means being able to save your browser's current state. Imagine you're automating tasks on LinkedIn, which usually requires logged-in accounts to perform actions. If you need to regularly perform actions with these accounts, you'll need a system that can save cookies, local storage, session storage, history, passwords, bookmarks, service workers, and page caches. It can be difficult and time-consuming to develop synchronization for this purpose, especially if the necessary software interface is not available. Surfsky solves this problem by providing auto save and restore browser session functionality.
- Scraping API is not strictly necessary, but it is a useful tool that aims to simplify scraping work. It is an API that enables you to perform basic actions with the browser, such as opening a page, retrieving HTML, taking screenshots, and more. This tool can be beneficial when you don't have the opportunity or need to develop a script using browser interaction frameworks.
ScrapingBee | Browserless | Surfsky | |
---|---|---|---|
Concurrent requests | 200+ on business+ plan | up to 1000 | unlimited by default |
Scaling | Upon support request | Premium subscription | Infinite by default |
Chrome Debug Protocol | ❌ | ❌ | ✅ |
Screenshot | ✅ | ✅ | ✅ |
JavaScript rendering | ✅ | ✅ | ✅ |
Profile management | ❌ | ❌ | ✅ |
Scraping API | ✅ | ❌ | ✅ |
Performance
Here are some of the measurements we used to evaluate the performance of the tools:
- Page load time. Consider the operation completed when the ‘DOMContentLoaded’ event is triggered. This benchmark will utilize the built-in proxies provided by the service itself. We'll use Amazon's Today's Deals as an example: https://www.amazon.com/gp/goldbox. The fastest service received 3 points, the second-fastest received 2 points, and the slowest received 1 point.
- Scraping Google SERP (search results for 100 random keywords), Amazon Today's Deals, eBay search results (100 random keywords). In this setup, each participant will start 20 instances simultaneously, and each test will be run 3 times. The fastest service received 3 points, the second-fastest received 2 points, and the slowest received 1 point.
Each test will be run 3 times at 1-minute intervals.
ScrapingBee | Browserless | Surfsky | |
---|---|---|---|
Page load time (avg) | 6.1 sec | 4.7 sec | 3.4 sec |
Google SERP (avg) | 1.79 page/sec | 0.95 page/sec | 2.71 page/sec |
Amazon today’s deals (avg) | 1.18 page/sec | 0.97 page/sec | 3.32 page/sec |
Ebay search results | 1.15 page/sec | 0.78 page/sec | 1.36 page/sec |
Performance Score | 7 points | 5 points | 12 points |
As a conclusion, we invite you to join the Surfsky Alpha test! The team is currently looking for feedback, and now you have a chance to contribute to the product backlog. Let's work together to create the best web scraping tool! Visit https://surfsky.io/ for more details.
We`ve added a demo, so now you can easily check out the parsing capabilities of our cloud browser: open any website and get its contents as HTML and a screenshot.
Top comments (0)