It is extremely important for aggregator sites in the field of e-commerce to maintain up-to-date information. Otherwise, their main advantage disappears — the ability to see the most relevant data in one place.
In order to solve this problem, it is necessary to use the web scraping technique. Its idea is that a special software crawler is created. It crawls the necessary sites from the list, parses information from them and uploads it to the aggregator site.
The problem is that often the owners of the sites from which aggregators take data do not want to provide them with access so easily. This can be understood — if the price information in the online store gets to the aggregator website and turns out to be higher than that of the competitors represented there, the business will lose customers.
The methods of countering scraping
Therefore, the owners of such sites often resist scraping — in other words, they try to prevent their data downloading. They can identify requests that crawler bots send by the IP addresses. Usually such software uses so-called server IP addresses, which are easy to identify and block.
In addition, instead of blocking requests, another method is often used — identified bots are shown irrelevant information. For example, they overestimate or underestimate the prices of goods or change their descriptions.
An example that is often cited in this regard is the prices of plane tickets. Indeed, quite often airlines and travel agencies may show different results for the same flights depending on the IP address of the search. The real case: searching for an air flight from Miami to London on the same date from IP addresses in Eastern Europe and Asia shows different results.
In the case of an IP address in Eastern Europe, the price looks like this:
And for an IP address in Asia it looks like this:
As you can see, the price for the same flight differs significantly — the difference is $76, which is really a lot. For an aggregator site, there is nothing worse than this — if incorrect information is presented on it, then users will not use it. In addition, if a particular product has one price on the aggregator site, and it changes when you go to the seller’s website, this will also affect the reputation of the project negatively.
Solution: using residential proxies
To avoid problems when scraping data for the needs of their aggregation, you can use residential proxies. Server IP addresses are provided by hosting providers. It is quite simple to identify the address belonging to the pool of a particular provider — each IP has an ASN number that contains this information.
There are many services for analyzing ASN numbers. They are often integrated with anti-bot systems that block access for crawlers or manipulate the data given in response to their requests.
Residential IP addresses help to bypass such systems. Internet service providers issue such IP addresses to homeowners, with associated notes in all related databases. There are special residential proxy services that allow you to use residential addresses. 2captcha residential proxy is just such a service.
The requests that crawlers of aggregator sites send from residential IP addresses look as if they were coming from ordinary users from a certain region. And no one blocks ordinary visitors — in the case of online stores, they are potential customers.
As a result, the use of rotating proxies from 2captcha allows aggregator sites to receive guaranteed accurate data and avoid blocking and difficulties with parsing.
Top comments (0)