DEV Community: Surfsky

How to set up fast mass parsing of prices on Amazon bypassing its anti-bot system

Surfsky — Wed, 13 Mar 2024 13:53:34 +0000

Hello, this is Pavel from the Surfsky team. While working on the project, I learned about several very common cases in which developers face bans and blocks and cannot bypass them using standard tools like undetected-chromedriver, Puppeteer, or Selenium. If you've tried everything, but the results still leave a lot of room for improvement, read on.

Disclaimer: we solved the problem of slow and inefficient parsing of Amazon using our own service, but this article also describes general principles of how anti-bots work, which will be useful when using other tools too.

Part 1. The problem: data is not being properly collected

A major seller of electronics on Amazon faced difficulties in obtaining up-to-date product information. The company that supplied them with data was constantly late and provided incomplete data. They argued that Amazon had improved its anti-bot systems, and now the complexity of web data parsing has significantly increased. As a result, the customer started losing up to 15% of profits compared to previous periods.

The tasks that the client wanted to solve were as follows:

Get data about products and price dynamics from the keepa.com price tracker.
Select the most relevant products and start tracking them, monitoring changes in price, discounts, ratings, and number of reviews. The number of analyzed products was just over 80,000.
Web parsing of extended product information: detailed product descriptions, user reviews, visual materials, photos and videos, and a list of competitors.
Data analysis and processing: calculation and visualization of statistical indicators, generation of reports, and application of statistical price prediction models for future periods.
Business decision-making: based on the obtained data, perform listing optimization, plan promotions and discounts, and make decisions on purchasing new products.

Part 2. Audit and problem analysis

The client's requirements were as follows:

Parsing prices of 80,000 products daily.
Providing a fresh dataset daily by 15:00 UTC.
The data must be consistent and in sufficient volume, without gaps, errors, and inaccuracies.
The data collection process must be controllable and predictable.
The client requested help with integrating the solution into their technology stack, as well as competent and prompt technical support.

Since we have already worked with Amazon, we knew in advance where to start and how to optimally solve these tasks. Traditional parsing methods are insufficient to deal with the constantly changing protection and bot recognition algorithms used by Amazon.

How does Amazon (among others) detect automation?

Network level. Amazon analyzes network requests, spam score of the IP address, and packet headers. Here, the type of proxies being used, their quality, and their correct application and rotation are crucially important: we need to ensure their effective use and prevent overspending.

Browser fingerprinting. This is a method for collecting information about the browser and device being used, based on which a device fingerprint is created, which includes: browser type, version, and languages; screen resolution; window size; Canvas; WebGL; fonts; audio, and much more.

Emulating user actions. Amazon uses an anti-bot system to analyze user actions and tries to block bots. To bypass it, it is necessary to emulate actions of real users: delays, imitation of cursor movements, rhythmic keystrokes, random pauses, and irregular behavior patterns.

Specific measures must be taken at each of these levels to prevent anti-bot systems from identifying automation. A satisfactory result can be achieved in several ways:

Use your own custom solutions and maintain their infrastructure on your own;
Use paid-for services like Apify, Scrapingbee, Browserless.
Combine high-quality proxies, captcha solvers, and multi-accounting browsers;
Use standard browsers in headless mode with anti-detection patches.
Other options of varying complexity are also available.

Why did we use our own solution, Surfsky?

At the network level, Surfsky uses high-quality residential and mobile proxies from various geolocations and subnets. In each case, we analyze which proxies are more suitable for the task at hand and provide customization options for each client. On our servers, the network level is integrated natively and allows us to avoid leaks, through which Amazon might learn the real IP address and detect spoofing.

Surfsky uses only real browser fingerprints collected from actual user devices. Currently, there are more than 10,000 of them and they are constantly updated. Moreover, spoofing is performed at the lowest possible level of the browser kernel.

If you also would like to start using Surfsky, go to our website and order a demo. If you've read up to this point, then our service can definitely solve the problems you might be facing.

Part 3: Finding a Solution

First, we developed a proof-of-concept: scripts that launch our browser in multi-threaded mode to parse data from product pages. Let’s recall that our client needs to collect data on 80,000 products per day, which is about 3,400 targeted requests per hour. During the development phase, we always plan for the possibility of an increase in request volume, so we account for at least a 10% increase. Thus, our solution must be scalable to at least 88,000 products per day. To deliver collected data to the client by 15:00 UTC daily, we need to have at least 4 hours of buffer time. We calculated a minimum of 88,000/20 = 4,400 requests per hour.

Next, our task was to find the optimal number of requests processed in our cloud cluster. We allocated and reserved a pool of residential proxies (mix worldwide). Instead of average values, we use percentiles for more accurate calculations. A product page in the 95th percentile takes up ~8-10 MB. The full page load time through a proxy is 6 seconds. Parsing and saving data to the database take 2 seconds.

All in all, we have 20 hours of time available daily to collect data covering at least 88,000 products. Thus, a minimum of 10 browser instances is required.

Having completed our calculations, we launched a series of proof-of-concept tests for 8 hours to collect detailed statistics. It turned out that about 20% of requests returned a captcha. To solve this, we emulated user actions, including realistic cursor movements and page scrolling. We also connected automatic captcha recognition, after which we ran our tests again. As a result, the request time increased from 8 to 21 seconds, but we reduced the number of erroneous responses, which gives us a more stable and predictable result in the long term. Thus, we increased the number of browser instances from 10 to 27.

Part 4: Integration

This is the shortest part: after developing our technical solution, we focused on its integration into the client's infrastructure. Throughout the integration, we provided constant support to the client. The entire process took 12 days, 4 of which were spent on auditing and developing the optimal solution.

Conclusion

Using Surfsky allowed us to solve the client's problems: we were able to obtain accurate data on time. The cost of support decreased, and the overall efficiency of the client's operations improved.

Using anti-detection technologies has become a must for web browser automation tasks. This is because anti-bot systems are continuously improving, getting better every year at identifying automated behavior and browser characteristics. Anti-bots use algorithms and behavioral pattern analysis, and bypassing them is becoming increasingly difficult.

At Surfsky, we have created and continuously maintain the most up-to-date and powerful solution for web automation. For this, we use a multi-accounting anti-detect browser and all currently available cutting-edge tools to counteract all attempts at identifying browser automation. Our approach allows our users to bypass complex anti-bot systems while maintaining a high degree of anonymity and security.

Surfsky 0.6.2: Alpha testing completed, CAPTCHA solver implemented, performance enhanced

Surfsky — Thu, 15 Feb 2024 12:02:06 +0000

We've made many updates to Surfsky recently, but the biggest news is that we're done with alpha testing and ready for our first customers! To start using Surfsky, simply request a demo or reach out via our contact form.

Here's a quick summary of the latest key updates in Surfsky version 0.6.2:

Updated the browser kernel to 121.4. We keep the kernel up-to-date and do not lag behind the official Chrome release schedule, blending in with the crowd of regular users of this browser.
We've created a feature that can solve Cloudflare Turnstile, hCaptcha, and reCAPTCHA v2/v3/enterprise. At present, this tool can be utilized in client code. Soon, we'll upgrade it to solve these automatically on our server.
Browser starts up faster now. We've improved how it scales.
Added a graceful browser closure, which involves sending closure signals to APIs and processes so they can complete their work correctly, close connections, and free up resources before fully closing the instance.

Amazon Scraping Tools: Features Comparison

Surfsky — Wed, 20 Dec 2023 08:36:45 +0000

Collecting data from Amazon can be challenging, especially when it needs to be done regularly and on a large scale. The main issues when working with Amazon include IP address blocks and CAPTCHAs. These problems occur because of the high volume of requests, which are often similar, and inconsistencies in browser fingerprints. To reduce the risk of blocking, users need to experiment with finding unbanned proxies while also avoiding exceeding traffic limits.

Inconsistent fingerprints can be a problem because Amazon uses different security techniques to detect bots. They analyze browser data and compare it with the user's operating system. Behavioral patterns also matter. It's important to regularly update the automation logic to imitate real users. Managing the entire automation stack involves more than just scripting and running it. It includes automating launch schedules, creating auxiliary subsystems for handling proxies and browser data, and implementing monitoring mechanisms to ensure the system runs smoothly.

This article compares Surfsky with other popular browser automation and scraping services:

Scrapingbee
Browserless v.1 (Cloud Subscription)

Let’s evaluate these services based on their performance and features.

Price

We will not focus on this because Surfsky is currently in the Alpha stage, so it is not useful to compare it with services that are already on the market. However, let’s mention the pricing of ScrapingBee and Browserless.

	ScrapingBee	Browserless	Surfsky
Real price for 1000 requests, $	$6	$3.2	Free now

Proxies

High-quality proxies are essential for web scraping. Different types of proxies can be useful for solving various problems, so we investigated the available proxy options in the scraping services.

	ScrapingBee	Browserless	Surfsky
Proxy rotation	✅	✅	✅
Built-in residential proxies	✅	✅	✅
HTTP support	✅	✅	✅
SOCKS5 support	❌	❌	✅
SSH support	❌	❌	✅
OpenVPN support	❌	❌	✅

Fingerprinting

Purposefully altering the browser fingerprint can help evade anti-bot systems. However, most of the currently available solutions only offer a partial remedy to this problem. For instance, ScrapingBee recommends utilizing the 'Stealth Proxy' feature, which simply falsifies HTTP headers. On the other hand, Browserless employs the open-source library puppeteer-extra-plugin-stealth, but unfortunately, it is ineffective against advanced anti-bot systems.

Basic browser fingerprint management includes:

User-agent spoofing;
Modifying HTTP headers;
Hiding --enable-automation, --headless flags;
Overriding API permissions;
Spoofing navigator, iframe, WebGL, media, window, chrome objects and properties.

It is important to note that object modification is performed using JavaScript and can be easily detected. Additionally, in basic techniques, inconsistencies and non-standard property values are often compared against the expected values for the user's browser and operating system.

Advanced browser fingerprint management includes not only basic techniques, but also some more advanced methods:

WebGL Vendor info and image hash;
Canvas fingerprinting;
IP, DNS hardening;
Audio and video fingerprinting;
WebRTC hardening;
Font fingerprinting;
Geolocation API spoofing;
SSL/TLS fingerprinting;
HTML5 features spoofing;
Extended JavaScript Browser information spoofing.

It is important to note that advanced fingerprinting is implemented differently compared to basic fingerprinting design. Instead of using JavaScript, it is implemented by modifying the browser kernel.

High-quality and consistent anti-fingerprinting detection measures always include low-level substitutions and real user fingerprints. Surfsky is designed to counter the most complex anti-bot systems.

Services like PixelScan and CreepJS are designed to detect and highlight inconsistencies in digital fingerprints. We used these services for the test.

	ScrapingBee	Browserless	Surfsky
Basic browser fingerprinting	✅	✅	✅
Advanced browser fingerprinting	❌	❌	✅
Real GPU canvas rendering	❌	❌	✅
PixelScan passing	❌	❌	✅
CreepJS passing	❌	✅	✅

Convenience

It is hard to evaluate such things as convenience, but here are the main aspects we’ve focused on:

Support should be available at least in the form of a communication channel with the development team.
An easy-to-use admin panel should include features such as managing subscriptions and creating requests.
Easy-to-use documentation should include code examples for a quick start and clear manuals.

	ScrapingBee	Browserless	Surfsky
Support	✅	✅	✅
Easy-to-use admin panel	✅	✅	❌
Easy-to-use documentation	✅	✅	✅

Features

Here are other parameters we decided to compare:

The number of concurrent requests that can be executed per unit of time determines the request limit. If you exceed this limit, services may apply various sanctions such as rate-limiting, returning errors, or blocking access.
Scaling is an indicator of a service's ability to dynamically adjust server capacity, either increasing or decreasing it. The service needs to be able to adapt to the dynamic variation in the number of client requests.
Chrome Debug Protocol is a set of tools and API that allows interacting with the Google Chrome/Chromium browser for debugging and automation purposes, often used for debugging and flexible work with the browser. Its functions are described in the specification here: https://chromedevtools.github.io/devtools-protocol/.
The ability to take a screenshot of the page after it has loaded or after other trigger actions.
Executing JavaScript scripts on a browser page. It can be useful for debugging purposes, monitoring status, ensuring accessibility, and verifying the correct display of page content.
Profile management means being able to save your browser's current state. Imagine you're automating tasks on LinkedIn, which usually requires logged-in accounts to perform actions. If you need to regularly perform actions with these accounts, you'll need a system that can save cookies, local storage, session storage, history, passwords, bookmarks, service workers, and page caches. It can be difficult and time-consuming to develop synchronization for this purpose, especially if the necessary software interface is not available. Surfsky solves this problem by providing auto save and restore browser session functionality.
Scraping API is not strictly necessary, but it is a useful tool that aims to simplify scraping work. It is an API that enables you to perform basic actions with the browser, such as opening a page, retrieving HTML, taking screenshots, and more. This tool can be beneficial when you don't have the opportunity or need to develop a script using browser interaction frameworks.

	ScrapingBee	Browserless	Surfsky
Concurrent requests	200+ on business+ plan	up to 1000	unlimited by default
Scaling	Upon support request	Premium subscription	Infinite by default
Chrome Debug Protocol	❌	❌	✅
Screenshot	✅	✅	✅
JavaScript rendering	✅	✅	✅
Profile management	❌	❌	✅
Scraping API	✅	❌	✅

Performance

Here are some of the measurements we used to evaluate the performance of the tools:

Page load time. Consider the operation completed when the ‘DOMContentLoaded’ event is triggered. This benchmark will utilize the built-in proxies provided by the service itself. We'll use Amazon's Today's Deals as an example: https://www.amazon.com/gp/goldbox. The fastest service received 3 points, the second-fastest received 2 points, and the slowest received 1 point.
Scraping Google SERP (search results for 100 random keywords), Amazon Today's Deals, eBay search results (100 random keywords). In this setup, each participant will start 20 instances simultaneously, and each test will be run 3 times. The fastest service received 3 points, the second-fastest received 2 points, and the slowest received 1 point.

Each test will be run 3 times at 1-minute intervals.

	ScrapingBee	Browserless	Surfsky
Page load time (avg)	6.1 sec	4.7 sec	3.4 sec
Google SERP (avg)	1.79 page/sec	0.95 page/sec	2.71 page/sec
Amazon today’s deals (avg)	1.18 page/sec	0.97 page/sec	3.32 page/sec
Ebay search results	1.15 page/sec	0.78 page/sec	1.36 page/sec
Performance Score	7 points	5 points	12 points

As a conclusion, we invite you to join the Surfsky Alpha test! The team is currently looking for feedback, and now you have a chance to contribute to the product backlog. Let's work together to create the best web scraping tool! Visit https://surfsky.io/ for more details.

We`ve added a demo, so now you can easily check out the parsing capabilities of our cloud browser: open any website and get its contents as HTML and a screenshot.

Surfsky Cloud Browser Free Scraping Demo

Headless cloud browser based on Chromium and equipped with advanced fingerprint spoofing technologies.

surfsky.io

Surfsky 0.5.0: added Scraping API, launch speed up by 30%

Surfsky — Thu, 30 Nov 2023 10:36:14 +0000

The Surfsky team continues improving our cloud-based headless browser. Recent updates include a significant number of improvements:

We have introduced a new feature that allows users to scrape websites without writing any code using the Scraping API. To do so, simply launch a one-time profile and specify the necessary URL and parsing parameters. Surfsky will then provide the HTML content and optionally a screenshot of the specified URL. Additionally, we have added the Scraping API to Rapid API, which can be found here.
Surfsky browser launch is now at least 30% faster thanks to instance preloading, which improves overall productivity and efficiency.
We have updated the Surfsky browser kernel to Chromium 119. We strive to release Chromium kernel updates as soon as they become available, and our update release schedule is synchronized with global Chrome update statistics.
We have fixed bugs related to SSH proxies, eliminating random connection errors when launching the browser with SSH.
We have added a large pool of residential proxies. Now our users can run Surfsky without the need for their own proxies, thanks to our improved proxy infrastructure.
We have improved the SSH proxy pre-check functionality, making it significantly faster and more reliable.
We have fixed a bug related to stopping a persistent profile by using the stop method.

The Surfsky alpha test continues! Register at the Surfsky website and provide your feedback to the project team. Let us know about any difficulties, errors, issues, or wishes you may have, and we will continue developing the best web automation tool tailored to your needs.

How to bypass Cloudflare protection when web scraping

Surfsky — Thu, 21 Sep 2023 13:58:29 +0000

We are the Surfsky team, and today we would like to tell you about our product and how it can be useful to you. Currently in the alpha stage, you can give it a try for free and in return, please share your feedback.

What is Surfsky?

Surfsky is a cloud-based browser. It launches in the cloud and offers an interface to connect automation libraries and frameworks, such as Puppeteer and Playwright. It also has a convenient web inspector that allows you to see what is happening on a page to help you write the necessary data extraction algorithm correctly.

But the main feature is advanced fingerprint spoofing to bypass security systems.

Why would I need such a browser?

As you probably have guessed already, for data extraction. In an ideal world each service would come with an API for data collection. However, in the real one not every service offers such an API, and those who do might limit it in terms of the data it provides, its speed, or terms of use. That’s why to bypass these restrictions, information is often collected from publicly available pages.

You need a browser to collect publicly available information automatically. This is because services that you’d like to scrape are often created as single-page applications, that is, websites partially or completely created with JS rendering. To display such a web page correctly and fully, it has to be opened in a browser.

To make collecting information from multiple services easier for you, we have created a browser that launches in the cloud at your request. Using it, you can easily extract all the necessary data from the pages that interest you.

Why not use a regular browser?

One of the most common problems encountered when web scraping is getting banned, restricted, or blocked. Services use special systems to identify bots and cases of illegal data usage or access, and consequently block page rendering: they either cover the requested page with a CAPTCHA or don’t show it at all.

In our browser we have done everything to prevent websites from thinking that you’re a bot, so that you are always able to get the necessary information. We collect digital fingerprints of real users, analyze them, and use them in our browser to spoof parameters checked by security systems.

Such checks include, but are not limited to:

(in)consistency of the GPU data and its parameters;
(in)consistency of system fonts;
(in)consistency of the network connection and geolocation data;
presence of automation tools;

Changing the browser fingerprint sometimes may not be enough for sufficient concealment, so we have added support for several proxy types: http, https, socks5, and ssh, as well as support for the OpenVPN protocol.

As an example, let us open the website https://nowsecure.nl, which uses Cloudflare defenses, in both headless Chrome running on a server and Surfsky.

The server result:

Surfsky:

This is how easily Surfsky can bypass Cloudflare defensive measures.

How it works: some examples.

Let’s see how we can launch our browser in the cloud and connect to it using Python.

After registering you will receive an API access token that you will need to use to run queries. To launch the browser you need to run one query specifying the proxy type with which you would like to launch the browser.

const axios = require('axios')

const BROWSER_API = axios.create({
    baseURL: 'api-public.surfsky.io',
    timeout: 100000,
})

const { wsEndpoint } = BROWSER_API.post(
    '/profiles/one_time',
  { proxy: 'http.your-favourite-proxy.com' },
  { headers: { 'X-Cloud-Api-Token': API_TOKEN } }
).then((r) => {
  return { wsEndpoint: r.data.ws_url }
})

And this is how by running just one query you get a browser fully ready to work. Let’s open amazon.com and take its screenshot using the Playwright framework:

const { chromium } = require('playwright')

const browser = await chromium.connectOverCDP(wsEndpoint)
const page = await browser.newPage()

await page.goto('https://amazon.com')
await page.screenshot({ path: 'screen.png' })

await browser.close()

We have easily launched the browser, opened the page that we wanted, and got its screenshot.

Using our service you can launch the necessary amount of browser instances from the same device without any restrictions and work with them as if they were running on your system.

If you already have browser automation code, you can simply substitute the way to launch the browser with a single http request, and everything will continue working exactly as before, no additional changes necessary.

We thank you for reading! We hope that you are now interested in trying Surfsky out. We are currently in the alpha testing stage, and our team will be happy to receive any constructive feedback. You can register to take part in the alpha testing here: https://surfsky.io/.

We hope to see you onboard soon!