DEV Community: Metrow

How to Avoid CAPTCHA

Metrow — Mon, 02 Jan 2023 12:35:38 +0000

Regular internet users sometimes run into an annoying pop-up called a CAPTCHA. Solving a Google reCAPTCHA when you’re simply browsing the internet is easy. When it comes to bots, however, getting the same pop-up can be devastating as they are intended to stop automation software.

Triggering CAPTCHAs is often the worst thing that can happen to a bot. Its user will have to either solve it manually, forget about automation, or use a CAPTCHA solving service. Luckily, there are some ways to avoid CAPTCHAs entirely.

What is CAPTCHA?

CAPTCHA or_ Completely Automated Public Turing Test to Tell Computers and Humans Apart_ is an anti-bot solution commonly implemented in websites. There have been various iterations of it over the years, but the most common one in use now is Google reCAPTCHA.

Early versions of the test had users type in scrambled words, solve equations, or do other mundane tasks. Over time, sophisticated bots such as search engine spiders learned to solve CAPTCHAs, making the test less useful.

Google took it upon themselves to create new CAPTCHA types that would be significantly harder for automation software to solve. Nowadays, the tests no longer request you to solve an equation or input a word or two. Triggering CAPTCHAs today will meet you with various picture-based puzzles.

These puzzles aren’t accidental either. Google is using them to train machine learning and AI models to recognize pictures. As users, most commonly, have to identify things like trains, planes, bridges, etc., the bot can then learn to do so as well.

Additionally, there are a few other types of CAPTCHAs in use. There’s the aptly named “Invisible CAPTCHA”, which does exactly what it says in the name. You can meet websites where it’s integrated and never receive it.

Invisible CAPTCHAs track your mouse movements, clicking patterns, and other data to decide whether the person browsing is a human or a bot. Since bots, at least back in day, would have highly unusual patterns (instant mouse movements, no scrolling, quick browsing), the invisible CAPTCHA would be able to catch them without burdening a regular user.

Another common method that might be classified as a CAPTCHA are honeypots. These are hidden links or other elements in CSS or the source code of the website that are invisible to the user. Bots, however, can find them with ease and, once clicked, will be presented with a CAPTCHA.

Finally, there are sound-based CAPTCHAs. While these are generally added to some of the regular tests for those with vision impairments, sometimes you’ll get a purely audio CAPTCHA. These will often have you type in numbers or words according to the sound file.

6 ways to avoid CAPTCHAs

Note that none of these methods are mutually exclusive, so use as many as you can in combination with each other. Having all of them used at once will completely minimize the amount of CAPTCHAs you get while using bots.

Finally, there are ways to solve CAPTCHA automatically such as using specific services that would do that for you. These, however, aren’t usually worth the hassle. Some of the methods here will let you avoid the test once you get one, making the services that solve CAPTCHA automatically less useful than you might think.

1. Change user-agents

If you use any web scraping solution that you’ve created yourself, it will likely have some default user-agent (UA). Since it is sent automatically with each request, it’s something that can be used to track your activity.

Additionally, some default user-agents can be often blocked by websites, since they’re a dead giveaway that someone’s using a bot. So, get a list of legitimate user-agents and implement them on a rotating basis. Experiment with them to find out how often you should switch a user agent to solve CAPTCHAs by having them not trigger at all.

2. Use rotating proxies

Your IP address is another way that you can get tracked by most websites. If the same IP address keeps sending connection requests en masse, they know you’re botting or using other automation software.

Rotating proxies are the solution to the issue. They give you a pool of IP addresses that can be changed after every request. Additionally, rotating proxies often come from devices that are located in regular households (as opposed to business-grade servers), so the connection seems genuine and legitimate.

3. Randomize request delay

Sending requests on a consistent basis without any delay is the oldest trick in the book and websites are privy to it. If you keep changing pages or going to URLs at set intervals that never change, that’s clearly a bot doing things for you.

As such, one of the most important and easiest methods to avoid CAPTCHAs is to add randomization to request times. If you’ve coded your own scraper, adding randomization is a piece of cake.

4. Avoid direct links

Another way websites frequently discover bots is that they most often go through a set library of URLs. People, however, will often visit the homepage and then browse around somewhat randomly. As such, many websites have implemented homepage cookies that would help them discover bots.

So, try to collect URLs on the go instead of constantly going through direct links on websites. It helps if you also use a headless browser to collect cookies along the way instead of simply sending direct requests to the website.

5. Render JavaScript

Piggybacking off of the last method, it’s usually helpful to render JavaScript if you’re using a browser instead of direct requests. Nowadays, websites have a ton of content hidden in JavaScript. Most of it is useful to the user, so they eye anyone that blocks JavaScript with suspicion.

While it may increase the load times by a bit, rendering JavaScript elements will also reduce the likelihood of receiving a CAPTCHA. It becomes a balancing act that you have to perform as you have to pick which is more important - speed or CAPTCHAs.

6. Avoid honeypots

Avoiding honeypots can be a bit simpler than it may seem at first glance. Since these elements have to be invisible to regular users, they will often have tags such as “hidden” or their visibility being set to “off”.

So, check the elements and source code of the page. Pay special attention to all the URLs, since these will often hold the honeypot. If there’s an URL that’s hidden with visibility set to off, you can be almost sure that it’s going to be a honeypot.

Best Tools to Monitor Website Changes

Metrow — Wed, 21 Dec 2022 13:22:36 +0000

Website monitoring is a process that helps track website changes and test its performance, functions, and availability in different locations. This process has many use cases. Companies monitor pages to spot security issues or for market research, while e-commerce businesses may monitor competitor sites for pricing and product changes, brand mentions, etc.

To track website changes, companies and individuals use various tools. These tools offer different features, such as choosing a particular area on a site that needs tracking, monitoring website availability and performance in various locations, sending change alerts, and others.

We listed the 13 best tools to monitor website changes. In the list, you’ll find monitoring tools’ descriptions, their features, and pricing information. Website monitoring is an essential part of ensuring your business runs smoothly, so choosing the right tool can define the success of your monitoring efforts.

Site 24/7

Site 24/7 provides website and performance monitoring for DevOps and IT operations. It’s an all-in-one monitoring solution that covers website, server, application performance, network, and cloud performance monitoring, among other things. The software has a large user base including large and well-known enterprises such as NASA, Ford, and SAP.

Main features:

Instant issue detection alerts via SMS, email, or a phone call
Covers over 110 locations worldwide
Strong focus on user experience analysis

Price:

Site 24/7 offers a free 30-day trial. Their site monitoring Starter package costs from $9/month if billed annually. The smallest pricing plan covers ten websites or servers, 110 locations, and 500MB logs.

Hexowatch

Hexowatch covers 12 different monitoring options, including visual, HTML code, keyword, content, and source code monitoring. It allows setting monitoring preferences based on frequency, sensitivity, location, and types of alerts you’d like to receive. The tool supports a number of software integrations, including Telegram, Slack, Google, and others.

Main features:

Change alerts via email, Slack, Telegram, or Zapier
Monitors HTML code, keywords, and visuals
Quick web data extraction

Price:

Hexowatch has a limited free version and a 30-day money guarantee. Paid plans start at $14.99/month. The Standard plan covers 2 000 checks a month and includes an API key and limited software integrations for receiving alerts.

Distill.io

Distill.io allows selecting parts of a website you wish to track. The tool allows setting custom monitoring frequency and maintaining a tracking history. All monitors can be managed via a single dashboard, and data can be extracted into a PDF. This website change monitoring software has over 400 000 users.

Main features:

Allows selecting specific elements you wish to monitor
Has a Google Chrome extension
Alerts available via phone push notifications

Price:

Distill.io offers a free version for testing and non-essential use. The smallest paid plan costs $15/month and is the best option for individual use. This plan covers 30 000 checks a month and supports three devices.

Versionista

Versionista offers website change tracking at scale. It allows monitoring changes to HTML, PDFs, and dynamic content. It can auto-crawl sites to detect changes. The software sends users reports by providing color-coded comparisons that help spot additions or deletions on a website. Versionista’s web change intelligence offers solutions for various businesses.

Main features:

Provides detailed monitoring summaries via email
Allows team collaborations within an organization
Filters irrelevant content

Price:

Versionista’s pricing depends on how many URLs you wish to monitor for changes. You can submit 5 URLs for free and get 465 free crawls. 400 URLs and 37 200 full browser crawls a month cost $99.

Sken.io

Sken.io helps automate repetitive tasks and save time. The website change monitor is helpful for e-commerce competitor monitoring since it reports price changes and product availability. Apart from the usual web change tracking, the tool can also monitor legislation changes to help companies with law compliance.

Main features:

Allows selecting a specific website part for monitoring
Has a mobile application for Android
Google Chrome extension

Price:

Sken.io offers a 14-day free trial that covers 140 free checks. The Basic plan is $3/month and delivers 500 checks. All paid plans include priority support, email, mobile and push alert notifications, data synchronisation across different devices, and a few other features.

ChangeTower

ChangeTower is a powerful platform for website change detection and archiving. Users are only required to submit an URL of a website they wish to monitor. ChangeTower then crawls the website and takes a screenshot of the full-page. The snapshot is automatically saved in the archive for future change tracking. ChangeTower’s clients include KPMG, Salesforce, Disney, and Cloudflare.

Main features:

Customizable criteria for alerts
Detailed website change alerts
Stores time-base code snapshots

Price:

ChangeTower offers a free version that covers six checks per day and one month of data archival. Paid plans start from $9 a month and cover from 50 checks for 500 dynamic URLs. All plans include web content, HTML, and keyword monitoring.

Pagescreen.io

Pagescreen.io automates website change detection, monitoring, and alerts. The software allows team collaboration and can send alerts to multiple people. Team members can also discuss and comment over screenshots, which is a helpful and time-saving feature. The solution can be integrated with various workflows and systems, such as Google Drive, Slack, Dropbox, and others.

Main features:

Handles large-scale web monitoring
Supports numerous integrations
Allows team collaborations

Price:

Pagescreen.io offers a 14-day free trial that covers five monitors and 1 000 screen captures. The smallest plan is $15/month and supports unlimited URLs and six archives. All the plans offer a one-hour minimum frequency between captures for single monitoring.

Deep Web Monitor

Deep Web Monitor allows monitoring password-protected pages. This is one of the monitoring tools that access websites with real browsers, submit forms, and click on links. The software also allows selecting a specific area on a website you wish to monitor. Visual comparison enables users to monitor graphic information, not only texts.

Main features:

Visual change comparison
Different business and individual plans
Allows customizing check interval

Price:

Deep Web Monitor offers four different subscription plans. The Beginner plan is $10/month and covers ten monitors and a 30-day history. All plans come with a 14-day free trial for testing the tool and its features.

Pagecrawl.io

Pagecrawl.io can track changes in dynamic and JavaScript pages. Their Java-script enabled browser allows capturing web content from all types of websites and provides screenshots. Pagecrawl.io software also supports proxies, and you can provide your own list of IPs, which enables you to get them from a reliable provider.

Main features:

Works on password-protected pages
Automatically bypasses Cloudflare bot detection
Supports number extraction, eliminating text or symbols

Price:

Pagecrawl.io has a free-forever version. This plan can track changes on up to 16 websites with a one-day monitoring frequency. The free version allows using your own proxies. Paid plans start from $8/month.

Wachete

Wachete provides sample watches so you can check them out and see what the software can be used for. This monitoring tool can help track web changes, job offers, product prices and availability. It also sends alerts via email, Slack, Teams, or on its mobile app. The tool can be used for website audits, competitor tracking, website health checks, and other cases.

Main features:

Mobile app for Apple, Android, and Windows OS
Web browser extensions and add-ons for Chrome, Firefox, and Edge
Provides a web change history of up to 12 months

Price:

Wachete has a free version that can monitor five pages and performs a check every 24 hours. The free version doesn’t support dynamic pages. Paid plans start at $4.90/month and allow monitoring up to 50 pages and one dynamic page.

Trackly

Trackly sends users an email when it detects a change in the targeted website. The email contains highlighted website changes so users can spot right away what was changed. Users can control what inconsistencies they’d like to be informed about, and the software monitors websites every hour.

Main features:

Hourly change monitoring
Customizable change tracking
Email alerts

Price:

Trackly’s pricing depends on how many websites you wish to monitor. Tracking up to three pages is free, but it only monitors these sites once a day or a week. If you wish to have the websites monitored hourly, the smallest subscription covers 20 websites and costs $9/month. Trackly also has a free 30-day trial.

OnWebChange

If you’re looking for constant change detection, OnWebChange offers tracking your targeted websites every five minutes. The tool allows selecting a particular part on the website you want to monitor. You can even choose multiple sections. It allows setting the tracking frequency, how you wish to receive notifications, and adding logic filters.

Main features:

Tracks changes on images, PDFs, and text
Works on any public website
5min monitoring frequency

Price:

OnWebChange has a free plan that monitors websites every 24 hours. The free version can perform up to 30 checks a month across three websites. Paid plans start from $0.89 and go up to $8.99. This is one of the cheapest web monitoring tools.

Visualping

Visualping has over 2M users worldwide, including Fortune 500 companies. The software sends an email alert with a screenshot when it detects a change in a site. They also have a convenient dashboard that provides more detailed information about the changes. The business version of the tool allows team collaborations and provides team training.

Main features:

Email alerts with screenshots
Team collaboration and training
Integrates with Slack and Teams

Price:

Visualping has a free version that covers up to five pages a day. The Starter plan costs $10/month and can monitor websites hourly. This plan covers up to 1 000 checks a month. You can also choose to pay per use or get billed yearly and get a discount.

Proxies for Monitoring Websites

Website change monitoring works by sending continuous GET requests to the target website. This means that the target receives many requests very quickly. Sending such requests from a single IP address will get it blocked in no time. To avoid that, monitoring tools have to use proxies.

Some monitoring tools allow integrating your own proxy list. This is a helpful feature because you can get proxies from your preferred provider and ensure the most efficient monitoring.

Depending on your needs and the use case, you may want to use residential proxies for tracking website changes. These proxies are more reliable than other types of IPs because they’re harder to detect. They can also cover any location in the world with precise geo-targeting, so you can see how your website appears in different areas around the world.

Conclusion

Website changes matter because they can inform you about security breaches, changes on the competitor’s websites, and new offers. Various businesses track their own and competitor’s websites to gather valuable information that is later used for improving the user experience on the website or gets integrated into a business strategy.

How to Crawl a Website Without Getting Blocked: 17 Tips

Metrow — Fri, 16 Dec 2022 13:59:17 +0000

Collecting publicly available data is hardly possible without web crawling and web scraping. The first one indexes web pages and discovers target URLs, while the second can extract the data for your needs. Unfortunately, these processes can unintentionally harm websites, so they set up defenses against them.

You must be prepared to tackle anti-scraping measures and know how to crawl a website without getting blocked. Otherwise, your data collection process can become a nightmare and require more resources than necessary. The list below covers the most common tips and a few extra measures on how to avoid bans while web scraping.

Follow the rules in Robots.txt

Robots.txt, or robots exclusion protocol, is a standard way for websites to determine the rules of how web crawlers and other bots should behave while visiting. Most importantly, it specifies how frequently bots can send requests to the website and which pages should not be accessed.

Checking the Robots.txt protocol is one of the first things to do when web scraping. You can find it by adding “/Robots.txt” to the web address. It will show you whether you can crawl the website. Some exclusion protocols restrict all bots from entering. Others allow only major search engines, such as Google.

Even if the website allows scraping, there will be rules outlined in Robots.txt. These rules are in place to keep the website performing well for regular visitors and protect personal or copyrighted data from being collected.

Most website defenses are made specifically to prevent bots from violating exclusion rules. If you don’t follow them, anti-scraping defenses will see your bot as a threat that can slow down or even crash the website, which can lead to bans.

Use proxies

Tracking IP addresses is one of the most common anti-scraping measures. It allows websites to assign different metrics to individual visitors and discover bots. Many other anti-scraping implementations are possible only if the website can track IP addresses. Therefore, web scraping is almost impossible without switching between IP addresses to appear as multiple different visitors.

Additionally, using a proxy server enables you to bypass geo-restrictions and access content that would otherwise be unavailable in your country. Some websites change their content to visitors according to their IP location, but a proxy will allow you to access data from other regions as well.

Other means of changing your IP address, such as VPN or Tor browser, are inferior to proxies when it comes to data collection. They are hard to integrate with most tools, have poor IP selection, and the practice might even be discouraged by providers.

Rotate residential IP addresses

IP rotation is when a proxy server automatically assigns different IP addresses after a set interval or for every new connection. Switching between proxies manually takes time and might put you at higher risk as sending too many requests without rotating will get your IP banned.

Rotating residential proxies is the safest choice for avoiding bans while web scraping. Such IP addresses originate from physical devices with connections verified by internet service providers. It will allow your crawlers and scrapers to blend in better with regular visitors lowering the chance of bans significantly.

Limit the traffic you create

Crawlers and scrapers can build enormous pressure on the website's server, sending requests and creating traffic at the pace of thousands of humans. So much traffic from a single visitor is a clear sign it's a bot. Sometimes websites might allow it, but you should still mind the traffic you create.

If you don't respect the website's limitations, your actions can appear as an attack, trying to take down the website instead of peacefully collecting data. Needless to say, you will lose your progress and will get the IP address banned.

Different approaches to limiting the traffic created by a scraper bot exist and might vary according to the scraping software. However, delaying requests and crawling during off-peak hours are the most common.

Delay requests

Web scrapers can send requests constantly without any breaks or time for thinking. While humans, even the fastest ones, will have to pause for a few seconds between requests. No delay between requests is an obvious sign of a bot for any website monitoring visitors' behavior.

However, a delay between requests is not enough as they also must be randomized. For example, sending requests exactly every 5 seconds or minutes will seem suspicious. No ordinary user would browse a website like that. Spread the requests at random intervals as it is the best way to imitate human activity.

If you see that your crawler is slowing down, it might mean that your requests are too often, and you should put more delay. Usually, robots.txt specify the minimum delay, but don't use exactly the same delay if you want to blend in better.

Choose the right time for scraping

It might seem that the best time for collecting is when it's convenient for you. Unfortunately, it often overlaps with peak hours of websites as most visitors tend to flock at around the same time. It is best to avoid scraping during such hours, and not only because it can slow down your progress.

Putting pressure on the server when ordinary users are also visiting in great numbers can lead to lowered performance or even a website crash. Such performance hits might negatively affect the regular user’s experience.

The best time to scrape might not be the most convenient for you, but it can reduce the number of IP blocks significantly. A great way to solve this problem is to test the site at different hours and schedule your web scraper only when the server performs at its best.

Avoid predictable crawling patterns

Ordinary visitors browse websites somewhat randomly as they look for information through a trial and error fashion. Bots already know the website's structure and can go straight to the needed data. Such crawling patterns are recognizable, and some websites even employ advanced AI technologies to catch them more efficiently.

Changing how your crawler bot travels around the website will help you appear more as a regular user and reduce the probability of bans. Include random actions, such as scrolling and visiting unrelated links or pages that aren't your target. Mix these actions on every visit to the website like an ordinary user would.

This tip goes together with randomizing the delays between your requests, as even the most natural browsing pattern will seem suspicious if done too quickly. But while delays can be random, changing the crawling pattern will require you to study the website's layout and find natural ways of browsing.

Avoid scraping behind logins

Collecting data protected by a log-in creates a few issues for crawlers. On most websites, logging in requires registering an account and agreeing to its Terms and Conditions. Most of the time, they explicitly state that the use of bots isn't allowed or restricted to some degree. Similarly, they will forbid data scraping, which would cause you to break an agreement and, in turn, the law.

A lot of information behind a log-in isn't public either. Scraping copyrighted or personal information could even be against the law. It’s best to avoid scraping data behind logins entirely. An exception could be if the data is available without logging in with something additional being displayed if you do so. Then, it is likely that the publicly visible data would be fair game.

Deal with honeypots

Honeypot traps are security mechanisms that protect websites from online threats while identifying bots, including crawlers and scrapers. Spider honeypots are specifically designed to catch web crawlers and work by placing links only visible to them, usually hidden within the code.

If your crawler visits such a link, the website will know it is a bot, and an IP ban will follow. However, implementing efficient honeypot traps requires a lot of knowledge and resources. Most websites can't afford sophisticated decoy systems and use simple tricks that are easy to notice.

Before following any link with your web crawler, check that it has proper visibility for regular users. Usually, honeypots have hidden visibility in CSS, so it is better to set the scraper to avoid links with background color, “visibility: hidden”, “display: none” or similar properties.

Optimize HTTP headers

HTTP request headers provide crucial context for servers when transferring data online. They determine what data has to be sent over and how it must be organized. Most web scraping tools do not have any headers by default, while browsers send multiple ones to tailor their requests.

Websites can use information in HTTP headers to tell whether the request is coming from a regular visitor or a bot. You should optimize your headers to make your web crawler blend in with other users and avoid bans. Here are some of the most common HTTP request headers to know:

The referrer header tells the website about the last page you visited
Accept header instructs the server about the type of media the client accepts.
Accept-Language header informs about the client's preferred language
Accept-Encoding header communicates the acceptable data compression format.
The host header contains the domain name and port number for the target server.

You should set these headers to look like ones from regular internet users. While optimizing these and other HTTP headers will reduce the risk of bans, one header - the User-Agent - is essential and deserves additional attention.

Change User-Agents

The User-Agent header provides the server with information about the operating system and browser used for sending the request. Every regular internet user has a User-Agent set so that websites would load properly. Here's what a User-Agent header could look like:

User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0

If there is no user agent set, the website might not send you the data. The User-Agent is the first header checked when detecting bots. It is not enough to set one user agent for your web scraper and forget about it. You should use only the most popular user agents and change them frequently.

Using widespread user agents will allow you to blend in better, and frequently changing them will create an impression of multiple different users. If you look for more tips about optimizing user-agent and other HTTP headers, check our article on the topic here.

Don’t scrape images

Images in websites make them appear pleasant to the eye and easier to navigate, but this only applies to humans and not bots. Images should be avoided while web scraping as it is a burden for your bot and can cause unnecessary risks.

The graphical content of a website can take a lot of space, meaning you will need to store it somewhere if you download it. If they aren't your target, the web scraping process will be much faster without them. Additionally, you might even save some money in terms of bandwidth.

Images can also be copyrighted, and collecting them from a website might be a serious infringement. In case you need any images, make sure to double-check whether you can extract them. But if they are not necessary, don't scrape them.

Use a unique browser fingerprint

Browser fingerprinting identifies the visitor's device by creating a digital signature, also known as a fingerprint. Using Transmission Control Protocol (TCP) and other means of communication with a browser, websites can identify the browser type, operating system, screen resolution and even hardware.

When you visit a website, your browser reveals these parameters, and they help track you even if cookies are disabled. Bots leave fingerprints too, so it is an increasingly popular method of restricting web crawling and scraping. The number of bans can grow significantly without good anti-fingerprinting measures.

Besides using proxies and common user agents, there are two main strategies to fight browser fingerprinting. One is to decrease the uniqueness of your browser. Uninstall any add-ons and use only up-to-date versions of the most popular browsers, such as Google Chrome or Mozilla Firefox. If it looks like millions of other browsers, used by many other people, it won’t be possible to track you through fingerprinting.

A nuclear option is to disable JavaScript with an extension such as Ghostery or NoScript. It will make browser fingerprinting almost impossible, but more and more websites these days are not loading properly without Javascript enabled, so you’d likely lose out on some data.

Don’t scrape unnecessary JavaScript elements

Checking whether the visitor's browser uses JavaScript is one of the easiest measures against bots websites implement. They just send some JavaScript elements and see if the visitor can load them. If they can't, it is most likely a bot as almost all regular browsers have JavaScript enabled.

It may seem then that you should always render all JavaScript elements. However, dynamic website elements can be a tough nut to crack for a web scraper. They are hard to load, depend upon specific user interactions, and might add a lot of unnecessary traffic.

Therefore, it is best to avoid JavaScript elements when you can. But If the checks are fierce, or if you want to scrape some dynamic elements, you will have to render at least the necessary ones.

Use a headless browser

A headless browser is one without any graphical user interface (GUI). Instead of navigating through websites with buttons and search bars, you can use commands to access the content. Because of their speed, headless browsers are used for testing websites and automating tasks, but they are also beneficial for web scraping.

They allow easier access to dynamic elements and can pass JavaScript checks. Some of the most popular web browsers, such as Google Chrome or Mozilla Firefox, have headless modes. A more advanced headless browsing experience can be achieved with tools like Selenium, HtmlUnit, Puppeteer or Playwright.

Setting up a headless browser isn't for everyone. You need to be familiar with at least basic commands and have some development knowledge to get them running properly.

Solve CAPTCHAs

Sometimes you do everything correctly, and the website still wants to block you. CAPTCHA, or Completely Automated Public Turing Test to Tell Computers and Humans Apart, is one of the signs that the website might be looking for an excuse to ban your IP address.

CAPTCHAs work by providing a test that only a human can solve. Most frequently, it's an image-based task asking visitors to select pictures according to a provided theme. If the test is unsolved in a given timeframe, the visitor is denied access and considered a bot. An IP ban might follow after several unsolved CAPTCHAs.

It is always better to avoid CAPTCHAs, but it is not the end of the world if you get some. Most web scrapers either solve some CAPTCHAs themselves or can work together with CAPTCHA solving services. With good API integration, such services can solve these tests rather quickly.

However, success isn't always guaranteed, and the price of CAPTCHA solving services may rise exponentially if you receive too many tests. It is more efficient to use every other tool first and keep solving CAPTCHAs as a backup plan.

Scrape Google cache

If nothing else seems to work and you are getting banned, there is one last trick to try. Instead of crawling and scraping the website itself, you can scrape the data on Google cache. Simply paste "http://webcache.googleusercontent.com/search?q=cache:" and write the original URL you intended to visit after the colon.

Even if the website has strong defenses and blocks most scrapers, it wants to be visible on Google. It is the source of most internet traffic, after all. Therefore, it will allow Googlebot, and there will be a copy of the website in their archive.

Unfortunately, it isn’t a perfect solution. Less popular websites are rarely updated, so some frequently changing content can be missing. Other websites even block the Googlebot from crawling, so you won't be able to find them. But if you can locate the needed website, scraping Google cache will make things easier.

Frequently Asked Questions

How do I stop being blocked from web scraping?

Web scraping tools, just like crawlers, employ bots to visit websites. Both processes trigger the same defensive measures. In short, the most effective methods are following the rules outlined in Robots.txt, using proxies and optimizing the user agent along with other HTTP headers.

Is it legal to crawl a website?

Crawling and scraping websites is legal, but there are some exceptions you should know. The first one is copyrighted content. Such data is often marked in some way, for example, with the copyright symbol ©. Keep an eye on these signs to avoid getting in trouble.

Personal information is also off limits unless the owner explicitly states their consent. On social media sites, for instance, personal data is saved behind log-ins, so it is better to avoid crawling and scraping there altogether.

Accessing publically available, copyright free and non-personal data is the safest option. Always consider consulting with a legal professional. Another great idea is to check out our in-depth blog post about this topic.

Can you be banned from scraping?

You can get your IP address banned while scraping large amounts of data without any precautions. In some cases, websites can block registered accounts and browser fingerprints too. Luckily, no ban is permanent as you can change the IP addresses with a proxy and start all over again.

So neither your web scraper nor the devices you use can be completely shut down from web scraping.

Can websites detect scraping?

Many websites can detect web scrapers and crawlers if you do not take any precautions. For example, crawling without a proxy and ignoring robots.txt will most likely arouse suspicion. There are some indications signaling that your actions have been detected.

Increased number of CAPTCHAs
Slow website loading speed
Frequent HTTP errors - 404 Not Found, 403 Forbidden and 503 Service Unavailable and others
IP address block

Conclusion

The tips covered here will give you a head start against most websites, but anti-scraping tools will continue to evolve. Be sure to update your knowledge about what web scraping is and the industry best practices. But no matter how much data collection changes, following Robots.txt, switching user agents, and using proxies will remain the most effective measures.

How to Test Proxies

Metrow — Mon, 12 Dec 2022 09:47:39 +0000

Regardless of how you acquired your proxies, you should test them. Whether you bought the IPs from a proxy server provider or found a proxy list online, you should make sure they work properly.

Most proxy providers offer a money-back guarantee if you’re not happy with the product, but it only lasts for a short period. That’s why you should check the proxies as soon as you buy them.

Location, speed, and proxy uptime are the main factors to consider when testing. Testing will help you ensure the proxies are compatible with your software and allow targeting the locations that you need. Testing uptime will also help you avoid disappointment when you start working on data-gathering projects, whether it’s social media scraping or other tasks.

You can test proxies with free or paid tools, IP checkers, or online IP databases. Different tools offer various features that will help you ensure your proxies are working correctly.

Below you’ll find a list of different ways to test proxies, including free and paid proxy testers.

1. Online IP checkers

The main function of a proxy is shielding your actual IP address and routing your internet traffic through a proxy server. This means your IP address should change when you connect to a proxy server. You can use an online IP checker to find out whether that’s the case.

Most online IP checkers are free and they can identify whether you’re using a proxy or connecting through an actual IP address. To ensure this works properly, you need to know your original IP address. Once you visit an online IP checker’s site, it will display your IP address, and you should be able to recognize whether it’s a proxy IP address or your actual IP.

Online IP checkers can also reveal essential information about your proxy, such as location, HTTP headers, and software that you’re using.

While most IP checkers can identify whether you’re using a proxy or an actual IP address, in general, online IP checkers are pretty limited. If you want more robust proxy testing, you should use more advanced tools.

2. IP databases

IP databases contain lists of IP addresses and information related to them. You can use this data to learn about the IP address location, whether it’s connected to a domain, and the type of proxy, if it is one.

Some proxies can be blacklisted from specific targets. They can be blocked by the proxy provider or by the target website. If you acquire free proxies, they are the most likely to be blacklisted on most of the popular sites.

Testing proxies on an IP database will help you determine if your proxies have been blocked on any target websites. For example, IP2Location Database allows checking proxies to see if their IP addresses have been blacklisted. You can test a small number of IPs for free or purchase a paid plan if you need to verify even more.

In general, an IP database can help you learn about your proxy location, associated domain name, if your IP is associated with one, and usage type, which identifies whether your IP comes from a data center or if it’s a residential proxy.

3. FOGLDN Proxy Tester

FOGLDN Proxy Tester is a completely proxy tester. This tool is great for monitoring proxy speed as it gives direct ping times to any website in the world. With this tool, you’ll see how much time it takes to connect to a specific website. Additionally, FOGLDN Proxy Tester works on any type of proxy.

Users often turn to FOGLDN Proxy Tester if they want to test the speed of their sneaker proxies since the tool monitors the latency of proxies. Sneaker copping requires ultra fast proxies, so this data can determine whether proxies are good enough for the task.

FOGLDN Proxy Tester is easy to use. The tool works on devices with both Mac and Windows operating systems. All you need to do is download the software, set it up on your device and add a proxy list. Then, enter the URL of your target site and test the proxies.

While FOGLDN Proxy Tester is a great choice for proxy speed testing, it also has some limitations. The tool doesn’t provide any information about the proxy type or IP address location. If your test fails, you won’t get much data about it, so you won’t know why exactly it failed. You also can’t control how many requests to send and at what intervals.

If you want a thorough proxy check, use FOGLDN Proxy Tester together with other IP checkers, which can identify the proxy location, type, and other relevant information.

4. Hidemy.name

Hidemy.name is a free proxy checker powered by smart algorithms that can check your proxy performance and share various information about your IP addresses. The tool can determine your proxy type, location (country and city), speed, and level of anonymity.

The tool features four degrees of anonymity:

No Anonymity - you have zero protection over your identity. The server knows your actual IP address and that you’re using a proxy to shield it.
Low Anonymity - your actual IP address is hidden but the server knows you’re using a proxy.
Average Anonymity - the target server is aware you’re connecting through a proxy. The server sees an IP address and thinks it’s yours, but it’s not your accurate IP.
*High Anonymity *- the server cannot see your actual IP address, and it doesn’t know you’re connecting through a proxy.

This proxy checker verifies proxy lists that it has gathered from private and public IP databases and various websites across the world. Hidemy.name is an advanced tool that allows you to check multiple proxy features and export the data that you need.

5. Proxy Verifier

Proxy Verifier is a web-based IP checker. This tool doesn’t require installing any software. You can simply visit their site and check your IPs. To add to the legitimacy, this open-source project was initiated by Yahoo Developer’s Network.

The main downside to this proxy checker is that it only supports HTTP proxies. However, contrary to other proxy testers, this tool checks both incoming and outgoing proxy traffic by simulating both client and proxy server devices.

Proxy Verifier allows checking multiple proxies simultaneously. It also features a convenient and easy-to-use dashboard. So if you want to test HTTP proxies, this proxy checker may be the one to go to.

6. InfoByIP

InfoByIP features a free proxy checker. The tool can verify the availability and anonymity of your proxies, and it supports HTTP, SOCKS4 and SOCKS5 proxies.

The tool is web-based, which means you won’t need to install their software. All you need to do is enter your proxy in IP:Port format, choose its type, and click “Check Proxy”.

InfoByIP is simple to use but also quite limited. However, if you only want to learn about proxy availability and anonymity, then this free checker will work just fine.

7. Angry IP Scanner

Angry IP Scanner is an open-source network scanner that works across different platforms. It scans IP addresses, ports, and local networks. The tool is free, and it has a number of features.

This tool can be used to test a proxy’s uptime, determine the location, port, etc. It features a comment-line interface, which makes the tool easy to use. One of the main advantages of Angry IP Scanner is that it allows exporting data in various formats.

The tool doesn’t require any installation, but you need to download it. According to Angry IP Scanner’s website, the tool has been downloaded more than 29 million times and is used by enterprises of various sizes, government agencies, and banks.

Regardless of whether that is true or a marketing trick, the tool does its job and is easy to use.

8. Nmap

Nmap is a free and well-documented network mapping tool that can be used as a proxy checker. To test proxies, you’ll need to provide IP addresses or packets, and the tool will return information about your proxies. You’ll be able to test your proxy speed and anonymity.

The tool features advanced network mapping techniques and can handle large proxy networks. Nmap works on all the most popular operating systems and also supports less frequently used ones such as Amiga, NetBSD, Sun OS, etc.

Conclusion

Users test proxies to learn about proxy speed, find out what locations they come from, and make sure they have a high enough uptime. Some proxy server providers encourage proxy testing and offer a money-back guarantee if their proxies don’t meet your standards. That’s one of the reasons you should test proxies. But a more important reason is that testing proxies will help you run your web scraping projects more successfully.

You can test proxies with a free or paid proxy checker. You’ll come across various tools online that will help you determine your proxy speed and anonymity. More advanced tools will let you test multiple proxies at once, while other tools will even indicate the degree of anonymity your proxies provide.

Hence, testing proxies is easy when you find a good proxy tester. And it’s absolutely worth it, as IP performance can define how smoothly your web scraping projects will run.

What is Browser Fingerprinting?

Metrow — Thu, 08 Dec 2022 13:33:42 +0000

Browser fingerprinting is a method of online tracking and data collection using the information web browsers provide with their requests. Websites use it to increase security, detect bots, and compile visitors' digital identities, called “fingerprints”.

It was named after the technique of human identification from our fingertips. But real-life fingerprints are only used to identify and charge criminals, while browser fingerprinting can track every visitor without regard to their intentions or consent.

In this article, we will cover what is browser fingerprinting - how it works, whether it is legal, what techniques websites use, and how you can minimize the risks of unveiling your online identity through browser-based fingerprints.

How does browser fingerprinting work?

When you connect to a website, your browser sends an HTTP request specifying what data and in what form is needed. When the server responds, the website is loaded. Developers can insert additional Javascript code asking the browser to provide software and hardware parameters.

Usually, the information is needed to load the website correctly and adapt the experience to your device. But in the case of browser fingerprinting, these parameters are used to compile a digital identity of a visitor.

The collected data can include anything from browser type to exact hardware specifications. Every parameter serves as a data point used for identification. More data means more accurate identification as the probability of two users with the same browser fingerprint becomes less likely.

With the exception of cookies, most of the collected data for user tracking doesn't require expressed consent. Techniques used for fingerprinting are integrated into the code of websites. So disabling their scripts would break most of them, which allows them not to ask if visitors agree to browser fingerprinting.

The uniqueness of the visitor's software and hardware setup, however, can only be measured by having a library of already collected fingerprints. So, every visitor's fingerprint is compared with thousands of others to determine its uniqueness and track actions across the web.

Collecting browser fingerprinting data in bulk is possible when all the parameters are hashed into a unique string of numbers and letters to identify each user. Such an ID is used to know if the visitor connected before, whether he had changed his parameters and to raise red flags against unwanted users.

In most cases, the complete browser fingerprint hash consists of two main parts:

Browser hash. Relies on visitor's browser data - browser type and version, installed plugins, User Agent, operating system, screen resolution, fonts used, etc.
Device hash. Relies on the information about the hardware - CPU, GPU, MAC address, serial number and more. Such ID applies to the hardware configuration of the device.

None of these browser fingerprinting parts can fully identify visitors by themselves. Browser data could be shared by multiple visitors and device similarities may overlap with hardware being identical. Combining both parts, as well as adding cookie and IP address data, creates a higher chance for unique identification.

You can check how your web browser stands against tracking. Non-profit organizations, such as the Electronic Frontier Foundation, try to raise awareness of the threat that browser fingerprinting poses. They have built a browser testing tool for ordinary internet users to check if they can be tracked.

Is browser fingerprinting legal?

Unfortunately, there aren't any effective legal tools against browser fingerprinting, so companies can legally engage in such a practice. At the time of writing, fingerprints are still considered public data, but there are some promising developments.

In the European Union, the General Data Protection Regulation (GDPR) defines cookies as personal data as long as they are used for identification. Companies can process such data only once you provide your consent or the company has a legitimate interest (E.g., to prevent financial fraud).

The United States doesn't have any national laws protecting its consumers from tracking, although there are initiatives in some states, such as the California Consumer Privacy Act. Much like GDPR, these laws only regulate cookies and do not address the issue of browser fingerprinting fully.

A step further against online tracking is the EU's ePrivacy directive that expands on GDPR by specifying the definition of cookies and consent pop-ups. More importantly, the ePrivacy directive will address browser fingerprinting by regulating it similarly to cookies.

The promised EU regulation will allow distinguishing between legitimate and illegitimate uses of browser fingerprinting. Hopefully, this law will come into effect soon, and other countries will follow its practices. As of now, websites can legally track users for a variety of reasons.

Browser fingerprinting use cases

Securing accounts

Websites are tasked with storing a lot of our personal data. If such data were leaked, a lot of damage could be caused to consumers. Account takeover (ATO) attacks are less likely when websites use browser fingerprinting as they can check the user's ID and implement extra verification measures if needed.

Targeted advertising

A majority of websites use browser fingerprinting for targeted advertising. It takes a step further from traditional ads by personalizing content to fit the traits and interests of a group or an individual consumer. Such practice has some controversial applications.

Ecommerce sites or ticket websites can change prices based on factors from your browser fingerprint. It can deny consumers access to fair prices and discriminate against certain groups or regions. Luckily, however, it’s mostly used to provide advertisements that would drive revenue for the company.

Data brokers

Data brokers are companies that process and profit from online data. They categorize enormous amounts of information from different sources and compile profiles of users and companies. Datasets are then sold to those who find them interesting.

*Cybersecurity *

Some cybersecurity measures are possible only after fingerprinting visitors, as websites may differentiate incoming traffic and act against malicious traffic. For example, Denial-of-Service (DDoS) attacks aim to overload the server until it crashes and cannot serve regular users. These attacks are performed by bots, so differentiating them from real visitors can help reduce their impact.

What are the techniques used for fingerprinting?

There is no single technique that websites use for fingerprinting, instead, they exploit different ways browsers work to create a unique fingerprint. We’ve outlined some of the more popular methods below.

Browser sniffing

Browser detection (or browser sniffing) techniques were created to determine parameters required for correctly loading web pages. For example, if a user visits from a mobile browser, a mobile version of the website should be loaded.

The primary source of browser information is HTTP headers. The User-Agent header is the main one because it states the browser and OS to the server with every request. However, websites frequently collect other related data, such as browser history and installed extensions.

Canvas

Canvas fingerprinting is a powerful technique that uses HTML5 to force your browser into drawing an image. Depending on the hardware and software used, the task is completed differently. So it is possible to identify GPU(s), font settings, drivers, browsers and operating systems.

Images the visitor's device must draw require little resources and aren't visible on the website's interface. Additionally, it is usually something simple, such as a two-dimensional blank rectangle. Parameters for the canvas fingerprint are collected by running overlays, anti-aliasing filters, fonts, etc.

WebGL

Web Graphics Library (WebGL for short) is a JavaScript API used for rendering graphic elements, usually interactive three-dimensional ones. Originally used for creating complex visualizations without additional plugins, WebGL can also help websites to identify and track visitors.

Similarly to canvas fingerprinting, it also rests on inferring device parameters from loading images that aren't visible to the users. However, this technique is more widespread as all major browsers support WebGL and provide identical parameters for fingerprinting as with HTML5 canvas.

Audio fingerprinting

Instead of relying on how your GPU draws images, websites can instruct your audio card to play a sound. Sending a low-frequency note to the device allows seeing how it will process such action. No audible tune is played to the visitor, so the website does not need sound permissions for audio fingerprinting.

It is enough for the website to detect device-specific parameters used to process sound-related tasks. Such information reveals audio hardware, drivers and other specifications required for audio fingerprinting. All of them are later used to build a unique fingerprint of your setup.

Device fingerprinting

Browser fingerprint is not limited to the machine you use to browse the web. It can also include IDs of media devices that are connected to it. Gadgets like headphones, microphones, and internal parts, such as video and audio cards, can be added to the device fingerprint for identifying the visitor.

Media device fingerprinting is rare as the visitor must permit access to his hardware. Most users don’t want to allow websites such access, especially when it comes to cameras and microphones. Therefore, media device fingerprinting is implemented by services that require a lot of access to function (e.g., video conferencing platforms).

Device fingerprinting can also refer to a broader term applied when no browser is involved. For example, mobile app developers use SDKs (Software Development Kits) supplied by OS developers for data collection. Device specifications are relatively easy to fingerprint when apps or programs are used.

Hardware benchmarking

Websites can conduct benchmark tests to assess the hardware that the visitor uses. Most commonly, websites run cryptographic algorithms. Differences in performance may reveal the CPU model and other details.

Similar APIs are used to assess the state of the device's battery, namely the capacity and charge level. It is especially effective for fingerprinting older systems as their batteries tend to have a unique ware which is used to ID them.

Another rarely used but intrusive method rests on observing the clock skew. It is a phenomenon when electrical signals arrive at different components at different times due to temperature changes. Detecting the clock skew range can tell a lot about the device and its software.

Cookies

Strictly speaking, collecting cookies isn’t enough for browser fingerprinting, and the practice is losing relevance due to regulations and privacy settings. Still, cookies supplement fingerprinting as the first step in identifying users.

Cookies are a client-side identification method based on little bits of data stored on the visitor's browser. Every time you visit a website, it sends you a cookie, and once you return, it can detect your identity.

Cookies were created for authentication and personalisation purposes, but now they are used for user tracking and advertisements. They can help websites document your actions, and advertisers can even place their third-party cookies for targeting purposes.

IP address detection

Monitoring IP addresses is another method that reinforces browser fingerprinting. An IP address is a string of numbers that uniquely identifies every device on the internet. Browsers send your IP address with the data requests, so the server would know where to return the data.

The IP address also unveils your approximate geo-location and the name of the internet service provider. Such information enables targeting the website's content accordingly, for example, adapting the language to the visitor's country.

However, IP addresses can be used for adjusting product prices and limiting availability to certain markets. Also, IP address blocks are the most common method of restricting access to the internet for specific subnets or locations.

What is cross-browser fingerprinting?

Cross-browser fingerprinting is a method of tracking visitors across multiple browsers. Just like single-browser fingerprinting, it utilizes operating system and hardware detection by asking browsers to perform various tasks. The data is then shared with other websites or added to an online fingerprint library.

Website owners and advertisers may share visitor data because it makes their browser fingerprinting efforts more effective. Cross-browser fingerprinting is justified as a security measure because it helps websites detect account breaches. However, it violates privacy by bringing unwanted personalization even if the visitor tries to avoid it by changing browsers.

Since fingerprinting does not rely on data stored in the browser, internet users lose any say in whether they want to be tracked and whether websites should know where they visited previously. Cross-browser tracking is one of the main reasons to stop browser fingerprinting.

How do I stop browser fingerprinting?

There is no ultimate solution to stop browser fingerprinting. Most effective ones will require sacrificing the quality of your browsing experience as some websites may limit usability or deny access entirely. But anti-fingerprinting techniques don't have to be perfect to make fingerprinting not worth the effort.

Disable JavaScript

Disabling JavaScript in your browser settings is the most straightforward and yet, the most effective measure against browser fingerprinting. It effectively restricts cookies, WebGL and canvas fingerprinting techniques, as well as most other APIs, to detect your device's parameters

Unfortunately, you will quickly notice that disabling JavaScript renders most websites unstable. You will lose speed, and most functionalities or the websites will crash. Rarely do websites would not require JavaScript to function these days, and those that do are unlikely to track their visitors.

Decrease browser uniqueness

Disabling JavaScript is too extreme for most, so we turn to browser uniqueness. The more your browser is unique, the easier it is to distinguish it from the rest of the internet traffic. If you use some unusual browser that wasn't updated for a while, that fact alone might be enough to create your fingerprint.

Using a common browser, such as Google Chrome, is a good start but far from the only factor that can lower the uniqueness of your browser fingerprint. Here are some steps you can take:

Uninstall unnecessary extensions. Browser add-ons can add useful functions, but better to delete those you can do without. Every plug inserts additional code, which makes you stand out more.
Use ordinary language preferences. Browsers can ask websites to load in a specific language. Most of the internet is in English, so it's best to keep your language settings only for this language to blend in.
Clear cookies and browsing history. While it isn't an extraordinary measure against fingerprinting, regularly deleting cookies and browser history will make it harder to track you.
Change privacy settings. Major browsers block basic tracking methods by default, but it isn't enough. Sending a "Do Not Track" request, restricting permissions and ads takes a couple of clicks.
Use incognito mode. It’s debatable whether incognito mode helps against browser fingerprinting. Still, if you don’t trust the website, it is better to disable at least some trackers with incognito mode.
Scan against viruses. It should be a no-brainer to keep your device safe. If you need an additional reason to fight malware, it can make it extremely easy to fingerprint you.

Use privacy add-ons

Installing privacy plugins to your browser is another easy way to disable trackers and, with correct settings, decrease uniqueness. Ad blockers, such as AdBlockPlus, can disable trackers and make the internet more visually pleasant and safe.

More privacy-focused add-ons, like Trace, Noscript and Privacy Badger, can block aggressive trackers and scripts automatically. They also help manage your privacy settings manually. If you trust a website, you can add it to a whitelist and access it without blocking anything.

The downside of using many plugins is that they might contribute to your browser's uniqueness. Don't overuse them and run fingerprint tests to know which ones are actually helping to fight browser fingerprinting.

Install anti-fingerprinting browser

Instead of looking for add-ons, switching to a more privacy-focused browser might be a better option. Most such browsers disable trackers and block ads by default. They even take a step further by fighting browser fingerprinting directly.

Brave is a free and open-source chromium-based web browser aiming to increase security while not decreasing usability. It protects from browser fingerprinting by randomizing the parameters you send while surfing the web.

Brave makes your fingerprint unique for every browsing session and every website. Such an approach doesn't limit usability and is fairly effective against fingerprinting. Changing Brave's settings can provide even more protection but will limit the usability.

Most other browsers and plugins are based on generalization - aiming to make all fingerprints identical to one another. To achieve it, they must mask the unique attributes of your browser, which does limit the usability of websites. However, it is usually more effective than randomization.

Tor browser was the first to take the fight against fingerprinting seriously. It changes users' IP addresses, aggressively limits JavaScript, generalizes your timezone, and many other parameters needed to collect fingerprints. Tor may be the best protection against browser fingerprinting, but you will sacrifice speed and convenience.

However, ease of use isn’t the primary reason to choose Tor. It is an open-source, volunteer-based project that aims to fight internet restrictions and help activists around the globe. Stopping browser fingerprinting is just a part of this battle.

Rotate your IP address

Changing your IP address is one of the first defense lines against browser fingerprinting. Even if you use a browser with randomization or privacy add-ons, the IP address from which the requests are sent will remain the same, so your browser fingerprint will be easy to compile.

While there are many ways to change your IP address, rotating residential proxies are the most effective and versatile option. Residential IPs originate from ordinary internet service providers and physical devices, which allows them to blend in with the rest of the internet traffic better.

Rotating such proxies means that every request or within a chosen timeframe has a new residential IP address assigned automatically. In combination with other anti-fingerprinting measures, it makes tracking your actions extremely hard.

Other methods of changing your IP address are usually applicable only for basic browsing tasks, while residential proxies can bring more to the table. They are easiest to integrate with specialized software (e.g., web scrapers and automation software) while not compromising speed and location targeting capabilities.

Conclusion

Browser fingerprinting is the newest and biggest threat to online privacy. With enough data points, websites can compile your fingerprint and identify you without your consent and across different browsers. Fingerprinting has some positive applications, but, in most cases, you don't want websites tracking you.

Although there is no guaranteed solution to browser fingerprinting, it isn't a lost cause. More sophisticated anti-fingerprinting measures might appear in the future. For now, using a browser with randomisation, decreasing uniqueness and rotating residential IP addresses are the best tools that will likely remain effective in the future.

What Are The Most Common User Agents for Web Scraping

Metrow — Mon, 05 Dec 2022 08:56:52 +0000

User agents are an important part of web scraping. In order to gather accurate and relevant information from the web, user agent strings have to be set right. User agents can define what information a target website sends to the user and how the content is displayed.

Data gathering for SEO, marketing, competitor monitoring, and other business cases requires careful preparation. Getting proxies, setting user agents, and bypassing blocks are essential for successful web scraping.

Find out what a user agent is and why it’s so important for web scraping. Learn about the most common user agents and their types depending on different devices. After reading this article, you’ll be able to set your user agents and get the most accurate and relevant data that you need for your business.

What Is a User Agent?

Every single browser has a user agent. It represents the user on the internet by providing information about the user, such as its browser, operating system, device type, and software. Providing this information manually every single time when connecting to the internet would be highly inefficient. That’s why every browser connected to the internet has a user agent.

For example, a user agent can look like this:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36

User agents help websites adapt their content to different web browsers and operating systems.

Why Are User Agents Important?

User agents are very important because they set browsers apart from each other and ensure that each user gets the content displayed correctly.

User agent strings are included in the website’s HTTP header when they connect to a website. Identifying users based on their user agents allows websites to provide different versions of content through the same URL address.

For example, once you enter a URL, the web server checks your user agent and provides you with the appropriate website. If you want to access the same site via your mobile device, you don’t have to enter a different URL. The same URL gives you different versions of a website on both mobile and computer browsers.

To give you another example of why user agents are important, think about different image formats. A website can provide images in PNG and GIF formats and display those depending on the user agents. The GIF version will be displayed to users with MS Internet Explorer versions that cannot show PNG images. At the same time, the PNG image version will be displayed in more recent browser versions.

That’s why user agents are important because, without them, users wouldn’t get the content they expect.

What Are the Different Types of User Agents?

User agent strings enable website servers to identify the devices (among other things) requesting online content. The user agent tells the website what device is visiting the site, and this information is then used to determine what content should be returned.

Here’s a user agent list for different device types:

Android User Agents

Android mobile user agents depend on the mobile phone device. Android devices can be Samsung, Sony, Nexus, and other phones using Android OS for mobile devices. Since Android is based on the Linux kernel, the user agent will always contain Linux. For example, a user agent for a Samsung Galaxy S22 phone will look like this:

Mozilla/5.0 (Linux; Android 12; SM-S906N Build/QP1A.190711.020; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/80.0.3987.119 Mobile Safari/537.36

iPhone User Agents

Apple passes different information through the user agents. Contrary to Android devices, Apple doesn’t use version numbers that would allow differentiating between different iPhone models. Here’s an example of an iPhone 13 Pro Max user agent:

Mozilla/5.0 (iPhone14,3; U; CPU iPhone OS 15_0 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) Version/10.0 Mobile/19A346 Safari/602.1

MS Windows User Agents

Microsoft Windows mobile devices also have their own user agents. For example, this could be a user agent of Microsoft Lumia 650:

Mozilla/5.0 (Windows Phone 10.0; Android 6.0.1; Microsoft; RM-1152) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Mobile Safari/537.36 Edge/15.15254

Tablet User Agents

Tablet user agents depend on the OS your device is using and the model of your tablet. For instance, the user agent for a Sony Xperia Z4 tablet could look like this:

Mozilla/5.0 (Linux; Android 6.0.1; SGP771 Build/32.2.A.0.253; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/52.0.2743.98 Safari/537.36

Desktop User Agents

" width="800" height="494">

Desktop user agents can have a myriad of different combinations. These depend on the device, OS, browser, etc. Here’s what a user agent for a MacOS X-based computer connecting via a Safari browser could appear like:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9

List of Most Common User Agents

For web scraping, you need to use the most common user agents to gather various information. With one user agent you may get different information than with another user agent string. Here are the latest and most common user agents for different web browsers and operating systems:

Chrome on Windows 10 User Agent

This is currently the most popular user agent:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36

Chrome on macOS User Agent

The most common user agent on macOS:

Mozilla/5.0 (Macintosh; Intel Mac OS X 12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36

Chrome on Linux User Agent

The latest and most popular Linux user agent:

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36

Chrome on Android User Agents

The latest Android user agent with Chrome browser is as follows:

Mozilla/5.0 (Linux; Android 10) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.5249.79 Mobile Safari/537.36

These are currently the most common user agents, but the list may change. You can find a dynamic list of the most popular user agents online.

How To Change the User Agent?

You can change your user agent string and appear as sending requests to your target website from a different browser or device than you actually are. How to change the user agent can depend on your browser.

We’ll tell you how to change your user agent string on Chrome, the most commonly used web browser.

Click on the menu on your Chrome browser
Go to “More tools” and then “Developer tools”
In the Console tab, click on the menu (if you can’t see the
Console, click on the menu and select “Show console”)
Pick “Network conditions” and look for “User agent” option
Uncheck the “Select automatically” box and choose the user agent from the list

If you’re not happy with the list, you can set a custom user agent. However, your custom user agent string will only apply as long as the “Developer tools” is open and only to your current tab.

Conclusion

Every internet user with a device and a browser has a user agent. This string helps the target website to identify the user and return adapted content, which displays content in the correct way. This can mean displaying certain image formats, language, etc.

When web scraping, you often need to collect information from various locations and for different devices. This information can be used for SEO, e-commerce purposes, or competitor monitoring. Customizing your user agent can help you get the data you need.

You can change your user agent on Chrome by editing your “Developer tools” settings and either choosing a suggested user agent string or customizing it.

User agents depend on device types and can vary greatly depending on your browser, language settings, operating system, software, etc. When gathering information from the web, you may want to use the most common user agents to make sure you get accurate and relevant information.

How to Find Someone's IP Address on TikTok

Metrow — Mon, 28 Nov 2022 10:39:45 +0000

There are several ways to find someone’s IP address on TikTok. Some of them require some technical knowledge and knowing your way around computers and the TikTok app. Others require you to simply copy a suitable URL from the TikTok app and paste it into an IP grabber.

We’ll go through the two primary methods of capturing IP addresses from TikTok and how they work. We’ll also give you the tools needed to prevent IP tracking if others were to do the same to you.

Use an IP grabber

There are tons of IP grabber applications out there, but Grabify is likely the simplest and most popular option. It’s also extremely easy to use due to the intuitive interface.

Grabify works by creating a URL with a tracking code that’s used to collect data from people who click on it. As such, you’ll have to have a conversation or any other interaction with the person whose TikTok IP address you want to capture.

Getting to the conversation is the hardest part. If you’ve established a connection with the person, you can go to Grabify and input a URL. It can be anything the person would be interested in. After that, click “Create URL”

You’ll have to agree to Grabify’s Privacy Policy and Terms of Service. Doing so will bring you to a new page, which will have a changed URL. Click on the “Copy” to get the new URL to your clipboard.

After that, to grab someone’s TikTok IP address all you have to do is have them click on it. The Grabify IP logger will do its magic. You’ll be able to see all IP addresses that have clicked on the link with some additional information as a bonus.

Additionally, if you close out of the tab, you can get back to the same IP tracking page by using the code assigned to the URL. If you input it into the home page, it will bring you back to the Link Information page.

That’s all it takes to use a TikTok IP finder!

Using Command Prompt

A more complicated way of doing things is to use internal network tracking software stored in each OS. We’ll be going through the Windows version of using Command Prompt (CMD) to track someone’s TikTok IP address.

Note that you will have to be using TikTok on PC for this method to work. Android and iOS devices aren’t going to work.

First, you will need to open up Command Prompt, which can be done by using the Search function (clicking the Windows button also works) and typing in “cmd.exe”. Press enter to get the app started.

Then you’ll need to close as many apps as possible, except for CMD and TikTok. You can do most of the closing through Task Manager by quitting processes or by simply exiting out of applications the regular way.

Once that is completed, you’ll need to get the person whose IP addresses you want to track on a call. You can also try connecting to an online chat, but it usually works best with video chats or calls.

After that is done type in “netstat -an” and you’ll receive a list of IP addresses to choose from. One of them will be the one you’re looking for. It will work best if you have as little open applications as possible.

Why would I need to find someone's IP address on TikTok?

Most commonly, you’ll want to grab someone’s IP address if you want to verify their location or report them for some sort of misdemeanor. For example, someone might be pretending to be someone else and attempting to scam you. With an IP, you can check their location to verify claims.

Additionally, if you suspect someone’s account might have been hacked, you could try grabbing the IP address. If the IP and location has changed significantly, it’s likely that another person is using the account. Although we should note that this wouldn’t be conclusive proof.

How can I prevent my IP being tracked?

While there are no methods for preventing IP tracking entirely, you can easily spoof your address. There are several tools that can be used to hide your IP, depending on how much anonymity is needed.

One of the easier ways is to use a VPN. If you already have one, enabling it while having conversations or calls with others will change your IP address to the one of the server.

Unfortunately, VPNs are mostly notoriously slow. Video and audio calls might struggle to keep up, significantly reducing quality of service. Additionally, a truly savvy person might be able to figure out you’re using VPNs since some of the IPs might be public.

The best option is to use a proxy service, namely residential proxies. These are IPs acquired from household devices (just like yours), so the addresses aren’t public and are not associated with any company. Additionally, they will likely not reduce connection speeds in any way, allowing you to maintain high quality calls and TikTok usage.

Since these IPs belong to regular household devices, even if someone were to track you, all they’d get is a realistic looking address and location. They might be tricked into thinking that’s your real IP address and location.

Finally, it’s important to note that both of these options also prevent TikTok from logging your IP address. The platform, however, might use other ways to track you, so the protection isn’t as foolproof as against another regular user.

HTTP Headers for Web Scraping

Metrow — Tue, 22 Nov 2022 08:53:51 +0000

Businesses benefit from collecting data and monitoring their competitors, which they often do through web scraping. While web scraping is crucial in making data-driven decisions, it can also be challenging. Improper use might cause you to get your IP address blocked or receive poor-quality data.

There are many ways to optimize scraping operations. An often overlooked one is using HTTP headers. We will cover all you need to know about them.

What are HTTP Headers?

HTTP (or The Hypertext Transfer Protocol) is the foundation of the internet which provides a standardized way for devices to communicate with each other while transferring data. It defines how the client's (e.g., a browser) data request is constructed and how the server needs to respond.

HTTP headers are invisible to end-users but are a part of every online data exchange. They enable the client and the server to send additional information within the request or a response. The data is organized according to HTTP headers, of which we can distinguish two primary types:

Request header informs about the requested data or the client. For example, request headers can indicate the format of data the client needs.
Response headers carry information about the response or the server. For example, a response header can indicate the format of the data server returns.

Besides enabling communication, HTTP headers bring various other benefits such as optimizing performance, providing multilingual content, helping troubleshoot connection problems and increasing security. For web servers, the latter means restricting bot-like actions which can overload the server. Unfortunately, web scraping gets lumped in together with bots, meaning bans are frequent.

What is the role of HTTP Headers in Web Scraping?

HTTP headers are essential in ensuring a smooth browsing experience for ordinary users. They inform the server what device is connecting to it and what data is needed. Therefore, looking for suspicious HTTP headers and blocking the related IPs is one of the most popular anti-scraping measures.

If you want your web scraper to blend in and avoid blocks, HTTP headers must appear as if coming from regular internet users. Any issues or discrepancies can arouse suspicion, and the server may suspect that you are using a bot.

HTTP headers may also help you take a step further and mask your bot as a new user for some requests. Sending too many of them as one user will alarm the server, so you should rotate between multiple request headers that don't stand out from the rest.

HTTP headers also play a crucial role in defining the quality of data you retrieve. Incorrectly setting them up may result in poor data quality or a significant increase in the traffic needed for web scraping.

To put it shortly, optimizing the most important headers decreases the chances of IP blocks and increases data quality. Since there are so many HTTP headers, you do not need to know them all - it is enough to start with the most relevant ones for web scraping.

What are the most important HTTP Headers for Web Scraping?

User-Agent header

The User-Agent request header informs the server what browser and operating system version the client is using. Such information helps the server to decide what layout to use and how to present data. It is the first obstacle you must bypass because websites filter requests for uncommon User-Agent headers.

The most frequent mistake here is sending too many requests with the same User-Agent HTTP header. It raises suspicion for the server because regular internet users don’t send as many requests as bots do. Make sure to imitate multiple regular User-Agent headers and use the most popular ones to blend in.

User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0

Referer header

The referrer header informs the web server about the last web page visited before sending the request. Regular users rarely jump randomly to websites on the internet. Instead, they move from one website (e.g., a search engine) to another. Such actions are reflected in the referer header.

It is an important but often overlooked HTTP header that allows imitating users better. Your scraper bot should reflect a reputable website as a source. Using a popular search engine could be a great option.

Referer: http://www.google.com/

Cookie header

The cookie HTTP header includes stored cookies. The server sends these small blocks of data to the client and expects them back with the next request. Cookies enable the server to identify users and remember their actions (for example, log-in information or the content of a shopping cart).

Visitors' privacy settings can block the cookie header, so it is optional. However, cookies are still advantageous when web scraping. If you use them correctly, you can mimic regular user behavior better or tell the server that you are a new user. Not using cookies properly might raise some red flags for the server.

Cookie: sessionToken=123abc

Accept-language header

Accept-Language header informs the web server about the language client prefers. This HTTP header is used to set the language of the webpage if the server can't do it by other means, such as URL addresses or IP address location.

Therefore, the accept-language header should align with all other information while web scraping. When it doesn't correspond to the IP or requested language in the URL, you risk your scraper bot getting banned. In addition, the Accept-Language header can help you appear as a local visitor better.

Accept-Language: en-US

Accept header

Accept request header notifies the web server about the media type the client expects and can understand. The client provides available text, image or video types, and the server uses content negotiation to select one.

Choosing a suitable media type and ensuring a swift process makes for faster communication and better data delivery. Additionally, submitting unusual accept headers might arouse suspicion.

Accept: text/html, image/jxr

Accept-Encoding header

The Accept-Encoding request header informs the web server of the acceptable data format. In particular, what compression algorithm should be used when sending the data from the web server to the client. Usually, it's a popular format such as gzip or Brotli.

When the server compresses the data, it can use less traffic, and the client can receive data faster. However, web servers aren't always able to compress files, but you should still use this HTTP header to increase the quality of the data you get while web scraping.

Accept-Encoding: gzip

Host header

The host HTTP header tells the target server's domain name and port number. If the port number is missing, the default one is used (HTTP URLs use 80). Additionally, the host header is mandatory for all HTTP/1.1 requests.

When there is no host header, or if it contains more than one host, your connection will be unsuccessful. Access will also be denied if this header is incorrect. Luckily, nowadays the header is almost always configured automatically, so leaving it be is enough. You can try tinkering with it, though.

Host: metrow.com

Conclusion

Optimizing HTTP headers will surely improve your web scraping process, and you can start with the ones we mentioned here. Just remember that it is only a part of the puzzle. Other tools, such as proxies for web scraping, are equally necessary. With the right steps, no data will be out of your reach.

What is MAP Monitoring?

Metrow — Fri, 18 Nov 2022 09:42:20 +0000

MAP (Minimum Advertised Price) monitoring is the process of using automation software to find out whether retailers are following the guidelines established by the manufacturer. MAP violations can negatively affect the brand value by changing the perception of the consumer.

By not following the suggested MAP price, ecommerce and retail vendors can make the manufacturer seem like a cheap brand that produces low quality products. Since nowadays there are so many retail and ecommerce stores, manually performing MAP monitoring is nearly impossible.

What is MAP?

Minimum advertised price defines the lowest possible price at which a product can be sold. Manufacturers in some countries can set the minimum advertised price in contracts they have with vendors. Not following MAP might have some negative consequences to both parties.

There are many arguments for why minimum advertised pricing should be implemented. One of them is based on company perception - some brands produce high-quality products that are known to sport a hefty price tag. Retailers selling below the MAP price could change that perception, which would put each company at odds with each other.

Another argument stems from fairness. A minimum advertised price ensures that retailers cannot compete with each other without having some standardized bottom ground. As such, it makes all of the prices across the market more equal.

Why do e-commerce suppliers need MAP monitoring?

MAP monitoring performs several goals all at once. By keeping track of all the prices across the market, suppliers can ensure fair competition and brand reputation. Additionally, it provides suppliers with more power over the entire value chain.

Other reasons to include MAP monitoring may include:

Fair competition across all distribution channels
Protection of margins
Control of the pricing
Prevention of underpricing
Allowing smaller companies to compete

In the end, price monitoring tools allow companies to ensure that the agreements they have signed are being held up by both parties. There are a lot of pressures on ecommerce and retail vendors that might cause them to violate MAP, so keeping tabs on pricing data is essential.

How to monitor MAP?

MAP monitoring is performed through web scraping tools that allow users to automatically collect data from online sources. These can also be called price monitoring tools as they essentially perform the same task.

Price monitoring solutions are usually bots that automatically go through some URLs and download the data stored within. In these cases, the scripts run through ecommerce store product pages to collect all the price data available.

Whenever web scraping is employed, proxies become a necessity. Even if a supplier has just cause to do monitoring for brand reputation protection purposes, most websites will have automatic anti-automation software systems in place. Since the website cannot know beforehand that it’s the supplier, they will often ban the offending IP address.

Residential proxies are most often used to circumvent these issues. They allow users to connect to websites and servers through them and forward all requests on their behalf. Proxies then return the acquired data back to the user while only revealing its own IP address.

As such, any IP ban is made worthless through the use of proxies. A user can keep switching IPs whenever they get banned to completely avoid any infraction.

Automated MAP monitoring tools will often include proxies either by default or allow users to integrate them easily. Running both of these tools in combination will allow businesses to collect enough pricing information on a continuous basis from as many retailers as necessary.

That data would then be stored either in some cloud solution or locally and prepared for analysis. A violation of MAP would be easy to notice as even the simplest data analysis software would be able to create filters. All it takes is to filter out all product prices below MAP to see if there had been any violations.

Finally, with such data, manufacturers would be able to issue warnings to retailers who might not be following the minimum advertised price policy. As a result, having retailers follow contractual obligations is made much easier through the use of MAP monitoring software.

Minimum advertised price monitoring might not be the most popular web scraping use case, however, it’s one of the most important ones for manufacturers. Due to the proliferation of ecommerce and retail vendors online, manually following each of them to see if they are following MAP is nearly impossible.

Luckily, nowadays there are plenty of web scraping solutions that can double as price monitoring software. With these tools, performing MAP monitoring automatically becomes a breeze as long as the provider can handle the volume of data extraction.

What is a Headless Browser?

Metrow — Mon, 14 Nov 2022 15:42:46 +0000

A headless browser is a web browser without graphic elements, a so-called Graphical User Interface (GUI). Headless browsers are run via command line or network communication.

Running without the GUI means that a headless browser can still function as a regular browser and perform various tasks, such as uploading documents, presenting data, and contacting target websites. The main difference is that headless browsing happens at the backend and doesn’t involve any graphic display, such as pictures, icons, or similar visual elements.

Headless browsers are mainly used for automated tests, data extraction, and various headless testing tasks. The main benefit of a headless browser is its ability to run quickly and without interfering with the front end of the target website. However, headless browsers also have some limitations that you’ll learn about in this article.

What is a Headless Browser Used for?

A headless browser has multiple use cases. We listed and explained the main ones:

Data Collection

Headless browsers are often used for data collection because they can help extract specific data points from the target site. They make web scraping more efficient since headless browsers don’t need to load graphic elements.

With a headless browser, you don’t need to write request chains one by one, and you can run JavaScript.

Layout Testing

A headless browser can help test website layout to ensure the front-end of a website looks as planned. Developers and designers use headless browsers to automate layout screenshots, test specific element color selections, AJAX execution, and perform various layout checks.

Automation Testing

Developers use headless browsers to automate software maintenance and perform quality assurance tasks. For example, headless browser automation allows checking if submission forms on web apps work as intended.

Performance Monitoring

Headless browsers can help monitor website performance. Websites without the GUI load much faster and allow tasks without the user interface interaction to be tested via the command line. Pages don’t have to be refreshed manually, which saves time and effort.

What are the Advantages and Disadvantages of a Headless Browser?

When it comes to the pros of headless browsers, here are the main advantages:

Speed. Headless browsers are much quicker than regular browsers. They can load JavaScript and CSS faster because they don’t need to open and render HTML elements.
Efficiency. This is especially relevant for data extraction. Headless browsers can help collect specific data points from the target websites, for example, pricing data on an e-commerce website.
Simplicity. Developers can save time by using headless browsers because they allow utilizing command lines. For example, this can be applied when testing code changes.

However, headless browsers also have a few drawbacks:

Fast loading can make it harder to notice and debug various issues.
Headless browsing only reveals backend issues, which means that it may not catch potential issues at the front-end of a website.

What are the Most Popular Headless Browsers?

A headless browser can be used for task automation.png

Developers often test various headless browsers for different tasks because some of them perform better than others in specific testing cases. Trying out different browsers allows developers to find the right combination of tools for their tasks.

Here are the most popular web browsers for headless use and their primary features:

Mozilla Firefox is a well-known web browser which has a headless mode. The headless Firefox is often used with the Selenium framework. This browser can be efficiently used for headless testing.
Google Chrome runs in a headless environment if it’s in version 59 or any version released after it. In headless mode, this browser can help create PDFs, get screen captures, or print the Document Object Model (DOM).
HtmlUnit is supported by JavaScript libraries. This headless browser can be used for website automation since it can reduce manual efforts for various user interactions with a website. For example, it can automatically test submission forms and website redirects.

What is Headless Browser Testing?

Headless browsers are mainly used for web page testing. The reason for that is a headless browser’s ability to understand HTML content and interpret it as any regular web browser would. Headless browsers can perform tests without the usual GUI, which allows the software to test various website components while skipping visual element rendering.

Testing in a headless environment enables users to run tests faster than with a regular browser. However, headless browser testing has a few limitations that are worth mentioning.

What are the Limitations of Headless Browser Testing?

Headless browsing for website testing comes with a few risks:

Headless browsers highlight the bugs that appear in a headless mode. This means that regular users will rarely run into these issues since they surf the web in a typical browser environment. It’s important to remember this limitation because some developers end up solving issues that are often irrelevant to regular users.
Headless testing helps websites load quickly, which is often an advantage. However, fast loading can make locating elements with inconsistent failures difficult.

These limitations can be solved by being careful and staying aware of the potential issues. Many users choose to utilize different browsers, which can help spot various bugs better.

Conclusion

Now that you've learnt what is a headless browser, you can imagine how it may help make developers’ lives easier. These browsers run in the backend without interfering with the front-end of a website and can quickly test various website features, such as redirects, submission forms, etc.

The main headless browser use cases include data extraction, automation, headless testing, and performance monitoring. There are various types of headless browsers, including the most popular regular browsers in a headless mode, such as Mozilla Firefox and Google Chrome.

While these browsers have a number of benefits, they also have a few limitations that users should be aware of. For example, it’s essential to keep in mind that a headless browser only interacts with the backend of a website. Hence, it only flags bugs that are relevant to the backend and do not necessarily interfere with the user experience.

How to Bypass Geo-Blocking without a VPN?

Metrow — Mon, 07 Nov 2022 09:26:19 +0000

Virtual Private Network (VPN) has long been one of the most popular tools for accessing geo-restricted content. However, this tool is not the only solution on the market. Some users overpay just because they’re not aware of other options.

You can bypass geo-blocking and access content that’s unavailable from your geographical location quickly and conveniently without using a VPN. But before we introduce the options, let’s review what geo-blocking is, why various media services use it and discuss whether it’s legal to bypass geo-restrictions.

What is Geo-Blocking?

Geo-blocking is a practice of blocking certain online content based on the user’s geographical location. The location is determined by identifying the user’s IP address, which is assigned to every internet-connected device by the Internet Service Provider (ISP).

An IP address can pin the user’s location rather accurately, sometimes up to the street level. The way to bypass geo-blocking is by shielding the original IP address, and that’s exactly what a VPN does. However, using a VPN is not the only way to hide an IP address.

Why is Geo-Blocking Used?

The main geo-blocking reasons are internet censorship and licensing agreements. Online broadcasting and streaming companies often have copyright agreements for their content that differ from country to country.

For example, a popular online streaming service like Netflix may have the copyright to stream a certain film in the US, but it would require a separate agreement to broadcast the same movie in the UK. In some countries, the same film may not even have the legal rights to be streamed due to various censorship requirements.

Is Bypassing Geo-Blocking Illegal?

Various countries have different laws when it comes to geo-restricted content. In general, it’s not illegal to bypass geo-blocking, but it may go against the terms and conditions of a service you’re using.

However, if you’re bypassing geo-restrictions to access illegal content, then it may be considered a crime.

How to Access Geo-Restricted Content without a VPN?

A VPN is certainly not the only way to access geo-restricted content. Here are other effective ways that can help access internet content and services from various locations without leaving your home:

Use Proxy Servers

Proxies work as intermediaries between the internet user and their target website. They shield the original IP address of the internet user and allow connecting to a website anonymously.

Residential proxies can come from any location in the world. If your proxy service provider has a large proxy pool, they can offer residential IPs even from the most remote corners. Meanwhile, datacenter proxies come from data centers, which can also be located around the world, but they cover far fewer locations than residential proxies.

Proxy servers have many use cases and shielding an IP address is just one of them. They can also act as a firewall and add a layer of security by filtering content.

However, free proxies that come from a questionable source may bring more harm than good. They’re shared between many users at the same time and can be blocked on certain popular targets. Hence, it’s important to choose a reliable proxy service provider that offers high quality proxies.

Install the Tor Browser

Tor is a popular internet browser that was created to allow people to browse the internet privately, without any tracking, surveillance, or censorship. The browser is free to download and works on all the main Operating Systems (OS), such as Windows, Mac, Linux, or Android.

The Tor browser protects users by blocking the most famous plugins that can be manipulated into revealing the user’s IP address. The Tor Project was created to ensure people have free and secure access to the open web.

Bypassing geo-blocking is one of the Tor browser’s features. It redirects user traffic through three servers, which also adds a robust layer of security against cybercriminals accessing sensitive data.

However, the Tor browser can significantly slow down the internet connection. This can lead to slow loading and video buffering, so if you’re looking for a fast loading speed, keep in mind that Tor may not be able to offer that.

Use SmartDNS

If you’re looking to hide your IP address and don’t want any extra security features, SmartDNS may be the best solution for you.

SmartDNS diverts all your DNS queries through a remote server. Contrary to other tools, this one doesn’t encrypt traffic or change your IP address. Instead, SmartDNS shields your DNS, so the target doesn’t receive any information about your geographical location.

Using a SmartDNS has a few advantages. It doesn’t trigger any automatic protection systems, which could be activated by accessing sites from a different IP address than usual. It also keeps your access to sites that are restricted to your real IP address.

Another advantage is that SmartDNS doesn’t slow down your internet connection, so you can access and consume content that requires fast loading.

If you choose SmartDNS, make sure to research providers before purchasing their services. Some VPN services offer SmartDNS as an additional feature, but keep in mind that SmartDNS and a VPN cannot be used simultaneously.

Conclusion

Geo-blocking is a common issue that internet users run into. Various media streaming services and broadcasters have to restrict their content based on geographical locations due to copyright and licensing agreements, as well as censorship.

While many users buy a VPN service to access geographically restricted content, that’s not the only solution. Proxy servers can also help bypass geo-blocking and access content available in any location. In addition to that, proxies can also enhance the security and privacy of their users.

The Tor browser is another solution that can help. It was created to ensure people can browse the internet securely and without being tracked. However, this browser is slow, which may be an issue when trying to load various geo-restricted sites or download their content.

SmartDNS is a solution that solely covers the issue of geo-restrictions. It doesn't add any extra benefits for its users, such as improved security or anonymity. For that job, proxies are the best choice.

IP Rotation: How to Rotate an IP Address?

Metrow — Tue, 25 Oct 2022 07:24:34 +0000

One of the main techniques for reducing web scraper block rates is rotating IP addresses. This practice can increase success rates and make various data collection jobs much easier. Companies use rotating proxies for different use cases, including SEO monitoring and data analysis.

This article will explain what an IP address rotation is and how it works. You’ll also find information about different IP rotation methods and learn what proxies are the best when you need rotating IPs.

What is IP Address Rotation?

IP address rotation means distributing assigned IP addresses to a device at random or scheduled intervals.

An Internet Service Provider (ISP) assigns a single IP address to a device when a connection to the ISP is active. In case of a disconnection or reconnection, the ISP distributes the next available IP address. Rotating IP is also called a dynamic IP address.

ISPs usually have many IP addresses at their disposal. When one user disconnects, the user’s latest IP address is returned to the pool of IP addresses, and the following available IP address is assigned to the user. IP rotation ensures that the existing ISP’s resources are being used at an optimal rate.

Rotating proxy

Similarly to ISPs rotating user IP addresses, users can rotate their own IPs whenever needed by choosing rotating proxies. Rotating proxies come from a proxy server and enable users to have more control over various tasks, for example, web crawling or data scraping.

To rotate proxies, a proxy server assigns a new IP from a pool of proxies for every connection or at set intervals. This means that if you're sending a hundred requests to any targets, you can get a hundred different IP addresses.

Sending requests from different IP addresses helps appear as multiple organic users coming from multiple locations rather than bots. Proxy rotation reduces the chances of web scrapers getting blocked while gathering public data from the web.

Why Rotate IP Address?

Companies and individual users rotate IP addresses for numerous reasons. However, the main reason for rotating IPs is to avoid IP blocks. Sending too many requests to the target from the same IP address can flag the IP as suspicious and get it blocked. This may interrupt or completely stop important data gathering jobs.

Some of the most popular use cases for rotating proxies are:

SEO monitoring — companies that work with SEO monitoring use rotating IPs to check keyword rankings in different locations.
Data analysis — gathering public data helps companies make data-driven business decisions, track various trends, and monitor competitors. This data is collected by utilising rotating proxies.
Geo-restrictions — one of the main reasons why people and companies use proxies is to bypass geo-restrictions. Rotating IPs allow bypassing these restrictions without getting blocked.
Web scraping — gathering data with a single static IP address is the shortest way to getting that IP blocked. IP rotation can ensure that scraping jobs don’t get interrupted.
Web crawling — similarly to web scraping, crawling also requires constant IP rotation to avoid IP blocks.

Methods for Rotating IP Address

IP rotation has various methods that depend on how long you wish to keep the same IP address for. Here are the main IP rotation methods:

Pre-configured rotation — Pre-configured or pre-set rotation allows rotating IPs at set intervals. Once the specified time passes, the user gets a new IP address assigned to them.
Specific rotation — Once the user’s connection to the ISP ends or refreshes, the ISP assigns a new rotating IP address to the user.
Random IP rotation — Every request that the user sends comes from a new randomly assigned IP address. The user has no control over what IP gets assigned to them.
Burst IP rotation — ISPs rotate the IP addresses after every specific number of requests. For example, if the settings specify that IP should rotate after ten requests, the 11th request will be sent from a different IP. And then, after another ten requests, the IP will be changed again.

Residential Proxies for Flawless IP Rotation

Residential proxies are the best choice for proxy rotation. These IPs appear as regular internet users because they come from an internet service provider. Users can connect to the residential proxy server and rotate proxies at various intervals or with every request.

When it comes to rotating proxy, one of the main advantages of residential IPs is the size of the proxy pool. You can choose a proxy provider with a large pool of residential proxies and never worry about getting the same IP twice if you don’t want it.

For example, Metrow has over 10M residential proxies coming from everywhere around the world. This means that if you need to access content that’s only available in a very specific geographical area, you can find multiple IP addresses that will get you access to that location.

Residential proxies are much harder to block because they appear as regular internet users. And target websites don’t want to block organic traffic, so rotating residential IPs can ensure a much higher success rate than other types of proxies.

Conclusion

Dynamic IP addresses change with every new connection or at set intervals. This means that an internet service provider assigns the user a new IP address from a pool of multiple IPs.

IP address rotation can help in various cases. Companies rotate IP addresses for SEO monitoring, public data gathering, web crawling, and other tasks. Proxy rotation ensures that users can access data that may not be available in their region due to geo-restrictions. Another reason for using rotating proxies is to reduce IP block rates.

The best solution for rotating IPs is residential proxies. These proxies come from an internet service provider and appear as organic internet users. Therefore, they’re less likely to get blocked.

When choosing a rotating proxy, make sure you work with a proxy provider that has a large proxy pool. Only then can you expect to get reliable IPs from various locations around the world.