Surfsky

Posted on Mar 13

How to set up fast mass parsing of prices on Amazon bypassing its anti-bot system

#webscraping #database #python

Hello, this is Pavel from the Surfsky team. While working on the project, I learned about several very common cases in which developers face bans and blocks and cannot bypass them using standard tools like undetected-chromedriver, Puppeteer, or Selenium. If you've tried everything, but the results still leave a lot of room for improvement, read on.

Disclaimer: we solved the problem of slow and inefficient parsing of Amazon using our own service, but this article also describes general principles of how anti-bots work, which will be useful when using other tools too.

Part 1. The problem: data is not being properly collected

A major seller of electronics on Amazon faced difficulties in obtaining up-to-date product information. The company that supplied them with data was constantly late and provided incomplete data. They argued that Amazon had improved its anti-bot systems, and now the complexity of web data parsing has significantly increased. As a result, the customer started losing up to 15% of profits compared to previous periods.

The tasks that the client wanted to solve were as follows:

Get data about products and price dynamics from the keepa.com price tracker.
Select the most relevant products and start tracking them, monitoring changes in price, discounts, ratings, and number of reviews. The number of analyzed products was just over 80,000.
Web parsing of extended product information: detailed product descriptions, user reviews, visual materials, photos and videos, and a list of competitors.
Data analysis and processing: calculation and visualization of statistical indicators, generation of reports, and application of statistical price prediction models for future periods.
Business decision-making: based on the obtained data, perform listing optimization, plan promotions and discounts, and make decisions on purchasing new products.

Part 2. Audit and problem analysis

The client's requirements were as follows:

Parsing prices of 80,000 products daily.
Providing a fresh dataset daily by 15:00 UTC.
The data must be consistent and in sufficient volume, without gaps, errors, and inaccuracies.
The data collection process must be controllable and predictable.
The client requested help with integrating the solution into their technology stack, as well as competent and prompt technical support.

Since we have already worked with Amazon, we knew in advance where to start and how to optimally solve these tasks. Traditional parsing methods are insufficient to deal with the constantly changing protection and bot recognition algorithms used by Amazon.

How does Amazon (among others) detect automation?

Network level. Amazon analyzes network requests, spam score of the IP address, and packet headers. Here, the type of proxies being used, their quality, and their correct application and rotation are crucially important: we need to ensure their effective use and prevent overspending.

Browser fingerprinting. This is a method for collecting information about the browser and device being used, based on which a device fingerprint is created, which includes: browser type, version, and languages; screen resolution; window size; Canvas; WebGL; fonts; audio, and much more.

Emulating user actions. Amazon uses an anti-bot system to analyze user actions and tries to block bots. To bypass it, it is necessary to emulate actions of real users: delays, imitation of cursor movements, rhythmic keystrokes, random pauses, and irregular behavior patterns.

Specific measures must be taken at each of these levels to prevent anti-bot systems from identifying automation. A satisfactory result can be achieved in several ways:

Use your own custom solutions and maintain their infrastructure on your own;
Use paid-for services like Apify, Scrapingbee, Browserless.
Combine high-quality proxies, captcha solvers, and multi-accounting browsers;
Use standard browsers in headless mode with anti-detection patches.
Other options of varying complexity are also available.

Why did we use our own solution, Surfsky?

At the network level, Surfsky uses high-quality residential and mobile proxies from various geolocations and subnets. In each case, we analyze which proxies are more suitable for the task at hand and provide customization options for each client. On our servers, the network level is integrated natively and allows us to avoid leaks, through which Amazon might learn the real IP address and detect spoofing.

Surfsky uses only real browser fingerprints collected from actual user devices. Currently, there are more than 10,000 of them and they are constantly updated. Moreover, spoofing is performed at the lowest possible level of the browser kernel.

If you also would like to start using Surfsky, go to our website and order a demo. If you've read up to this point, then our service can definitely solve the problems you might be facing.

Part 3: Finding a Solution

First, we developed a proof-of-concept: scripts that launch our browser in multi-threaded mode to parse data from product pages. Let’s recall that our client needs to collect data on 80,000 products per day, which is about 3,400 targeted requests per hour. During the development phase, we always plan for the possibility of an increase in request volume, so we account for at least a 10% increase. Thus, our solution must be scalable to at least 88,000 products per day. To deliver collected data to the client by 15:00 UTC daily, we need to have at least 4 hours of buffer time. We calculated a minimum of 88,000/20 = 4,400 requests per hour.

Next, our task was to find the optimal number of requests processed in our cloud cluster. We allocated and reserved a pool of residential proxies (mix worldwide). Instead of average values, we use percentiles for more accurate calculations. A product page in the 95th percentile takes up ~8-10 MB. The full page load time through a proxy is 6 seconds. Parsing and saving data to the database take 2 seconds.

All in all, we have 20 hours of time available daily to collect data covering at least 88,000 products. Thus, a minimum of 10 browser instances is required.

Having completed our calculations, we launched a series of proof-of-concept tests for 8 hours to collect detailed statistics. It turned out that about 20% of requests returned a captcha. To solve this, we emulated user actions, including realistic cursor movements and page scrolling. We also connected automatic captcha recognition, after which we ran our tests again. As a result, the request time increased from 8 to 21 seconds, but we reduced the number of erroneous responses, which gives us a more stable and predictable result in the long term. Thus, we increased the number of browser instances from 10 to 27.

Part 4: Integration

This is the shortest part: after developing our technical solution, we focused on its integration into the client's infrastructure. Throughout the integration, we provided constant support to the client. The entire process took 12 days, 4 of which were spent on auditing and developing the optimal solution.

Conclusion

Using Surfsky allowed us to solve the client's problems: we were able to obtain accurate data on time. The cost of support decreased, and the overall efficiency of the client's operations improved.

Using anti-detection technologies has become a must for web browser automation tasks. This is because anti-bot systems are continuously improving, getting better every year at identifying automated behavior and browser characteristics. Anti-bots use algorithms and behavioral pattern analysis, and bypassing them is becoming increasingly difficult.

At Surfsky, we have created and continuously maintain the most up-to-date and powerful solution for web automation. For this, we use a multi-accounting anti-detect browser and all currently available cutting-edge tools to counteract all attempts at identifying browser automation. Our approach allows our users to bypass complex anti-bot systems while maintaining a high degree of anonymity and security.

DEV Community

How to set up fast mass parsing of prices on Amazon bypassing its anti-bot system

Part 1. The problem: data is not being properly collected

Part 2. Audit and problem analysis

Part 3: Finding a Solution

Part 4: Integration

Conclusion

Top comments (0)

Read next

¡Hola Wagtail!

Interactive DataFrame Management with Streamlit Fragments 🚀

Python crawler practice: using 98ip proxy IP to obtain cross-border e-commerce data

Python 🐍 and variable types