Scraper0024

Posted on Feb 8 • Edited on Feb 10

How to Unblock Websites When Scraping With Scrapeless Web Unlocker?

#javascript #programming #beginners #tutorial

Web scraping is now widely used for data analysis and market research, and of course, competitor analysis. However, we will inevitably encounter some crawling obstacles, such as IP blocking, complex JavaScript rendering, and CAPTCHA verification. One key concern users often raise is: Is scraping websites legal?

How to avoid being detected and blocked when performing automated work?

This article will show you the most effective and time-saving methods.

Start reading now!

Why do I keep getting blocked when scraping websites?

Before we jump into web scraping tips, first, you need to know the common anti-bot measures you may encounter when scraping web data.

If you always encounter network blocks, check the following 8 aspects:

1️⃣ Excessive Requests from the Same IP
Websites monitor traffic patterns and may block IP addresses that make too many requests in a short period. This is often done to prevent scraping and abuse.

2️⃣ IP Blacklisting
If your scraping activity is flagged as suspicious, the website may blacklist your IP. This can happen if you repeatedly access the site from the same IP address or use identifiable patterns of behavior that resemble a bot.

3️⃣ Use of Captcha
Many websites use CAPTCHA challenges to distinguish between human users and bots. If your scraper triggers a CAPTCHA challenge, it may be blocked until the CAPTCHA is solved.

4️⃣ JavaScript Rendering
Websites with complex JavaScript might hide or dynamically generate content. Traditional scraping methods struggle with this, resulting in incomplete or failed scrapes.

This is the most basic reason why websites block your scraper. How to overcome the JS Render challenge? Don't worry. We can solve it later.

5️⃣ User-Agent Detection
Websites often check the "User-Agent" string to see if the request is coming from a real browser or a bot. Scraping tools that don't properly mimic a real browser can be detected and blocked.

6️⃣ Rate Limiting and Session Expiry
Websites may limit the number of requests you can make within a certain time frame, and your session may expire after a certain number of actions. Repeatedly hitting the website can lead to temporary or permanent blocking.

7️⃣ Fingerprinting
Modern websites use browser fingerprinting techniques to detect automated scraping. This method tracks unique patterns like screen resolution, timezone, and other browser characteristics, making it easier for websites to identify and block scraping tools.

8️⃣ Geo-Blocking
Some websites restrict access based on the geographic location of the IP address. If you're scraping from a region that is not allowed access, you may encounter blocks.

Scrapeless Web Unlocker - The Best Solution for Scraping Websites

Scrapeless is not only a leading website unblocker but also a comprehensive web scraping solution.

As a powerful web unblocker, Scrapeless provides users with simplified and efficient HTML extraction services. With its advanced proxy selection technology and automatic unblocking mechanism, Scrapeless can easily bypass complex anti-crawler protection and help users quickly obtain the required data.

Why should we choose Scrapeless Web Unlocker?

⚙️ Efficient JavaScript rendering (JSRender)

Scrapeless's JSRender technology uses an advanced browser simulation rendering engine that can handle dynamic content loading in web pages in real-time. It is particularly suitable for modern websites that require JavaScript to generate content, such as dynamic pages, single-page applications (SPAs), etc.

Compared to traditional crawler tools, Scrapeless's JSRender can render complex data generated by JavaScript in a short time, which is very important for crawling content that requires interaction or dynamic updates (such as product detail pages on e-commerce websites). For example, when scraping product pages from Shopee, Amazon, and Lazada, Scrapeless can load and extract all dynamic data (such as price, inventory, reviews, etc.) without missing any important information.

🧩 IP bans Bypassing
Scrapeless provides a built-in intelligent proxy pool that can automatically switch IPs to ensure a stable access experience. The proxy pool intelligently selects high-quality IP resources, so that even in large-scale crawling, it can still effectively bypass the website's IP blocking and restrictions, ensuring that the crawling task proceeds smoothly.

Users do not need to make any additional configurations. We ensure the highest degree of automation, saving a lot of development time and effort. Users can focus on business logic without worrying about IP blocking.

⚔️ Automatical CAPTCHA Solver
Scrapeless features an integrated CAPTCHA solver capable of handling image CAPTCHAs, text CAPTCHAs, and Google reCAPTCHA challenges. This ensures uninterrupted scraping sessions without manual intervention.

For those wondering, is scraping websites legal?—the answer depends on the website's terms of service and the nature of data collection. While publicly available information is often fair game, ethical and legal considerations should always be factored in when conducting web scraping.

Scrapeless simplifies the process by automating bypass mechanisms, allowing businesses and developers to focus on extracting valuable insights efficiently.

How to use Scrapeless Web Unlocker?

Step 1. Log in to Scrapeless.
Step 2. Enter the "Web Unlocker".

Step 3. Configure the operation panel on the left according to your needs:

Step 4. After filling in your target url, Scrapeless will automatically crawl the content for you. You can see the crawling results in the result display box on the right. Please select the language you need: Python, Golang, or node.js, and finally click the logo in the upper right corner to copy the result.

This ensures that you can access any public website without interruption. It supports various crawling methods, excels at rendering JavaScript, and implements anti-crawl technology to provide you with the tools to browse the web effectively.

Or you can use our sample code below to integrate into your own project effectively:

Url: Target website
Method: Request method
Redirect: Whether to allow redirection
Headers: Custom request header fields

Python:

import requests
import json

url = "https://api.scrapeless.com/api/v1/unlocker/request"

payload = json.dumps({
   "actor": "unlocker.webunlocker",
   "input": {
      "url": "https://httpbin.io/get",
      "proxy_country": "US",
      "type": "",
      "redirect": False,
      "method": "GET",
      "request_id": "",
      "extractor": ""
   }
})
headers = {
   'Content-Type': 'application/json'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(response.text)

JavaScript:

var myHeaders = new Headers();
myHeaders.append("Content-Type", "application/json");

var raw = JSON.stringify({
   "actor": "unlocker.webunlocker",
   "input": {
      "url": "https://httpbin.io/get",
      "proxy_country": "US",
      "type": "",
      "redirect": false,
      "method": "GET",
      "request_id": "",
      "extractor": ""
   }
});

var requestOptions = {
   method: 'POST',
   headers: myHeaders,
   body: raw,
   redirect: 'follow'
};

fetch("https://api.scrapeless.com/api/v1/unlocker/request", requestOptions)
   .then(response => response.text())
   .then(result => console.log(result))
   .catch(error => console.log('error', error));

Go

package main

import (
   "fmt"
   "strings"
   "net/http"
   "io/ioutil"
)

func main() {

   url := "https://api.scrapeless.com/api/v1/unlocker/request"
   method := "POST"

   payload := strings.NewReader(`{
    "actor": "unlocker.webunlocker",
    "input": {
        "url": "https://httpbin.io/get",
        "proxy_country": "US",
        "type": "",
        "redirect": false,
        "method": "GET",
        "request_id": "",
        "extractor": ""
    }
}`)

   client := &http.Client {
   }
   req, err := http.NewRequest(method, url, payload)

   if err != nil {
      fmt.Println(err)
      return
   }
   req.Header.Add("Content-Type", "application/json")

   res, err := client.Do(req)
   if err != nil {
      fmt.Println(err)
      return
   }
   defer res.Body.Close()

   body, err := ioutil.ReadAll(res.Body)
   if err != nil {
      fmt.Println(err)
      return
   }
   fmt.Println(string(body))
}

Alternative Solutions without getting blocked

1. IP Rotation

The first way a scraping website detects a web crawler is by checking its IP address and tracking its interactions with the website. If the server sees a strange pattern of behavior from "that user" or an impossible request frequency, the server can block that IP address from accessing the website again.

To avoid sending all requests through the same IP address, you can use an IP rotation service (like Scrapeless's rotating residential proxies) to route your requests through a pool of proxies, hiding your real IP address while crawling the website. This will allow you to crawl most websites without being blocked.

Why use residential proxies? Because on some websites with stricter blocking requirements, they will be more strict about your proxy detection. Choosing a residential proxy will make your crawler's identity more real, making your scraping website efforts more stable.

Ultimately, using IP rotation, your crawler can make requests appear to come from different users and mimic the normal behavior of online traffic.

When using Scrapeless Proxies, our intelligent IP rotation system will leverage years of statistical analysis and machine learning to rotate your proxies as needed from data center, residential, and mobile proxy pools to ensure a 99.99% success rate.

Get your special Rotating Proxies for Free now!

2. Use a headless browser

The hardest scraping websites may detect subtle signs like web fonts, extensions, browser cookies, and JavaScript execution to determine if the request is coming from a real user.

To scrape these websites, you may need to deploy your own headless browser (or let Scrapeless Scraping Browser do it for you!).

Headless Browsers allow you to write a program to control a real web browser that is identical to the one used by a real user to completely avoid detection.

3. CAPTCHA solver provider

CAPTCHA solvers are third-party services we often use. They work by using technologies like Optical Character Recognition (OCR), machine learning, or third-party human solvers to automatically solve CAPTCHA challenges for free, allowing scraping bots to bypass web blocks.

These tools enable continuous, automated scraping website activities by preventing disruptions caused by CAPTCHA verification. By mimicking human-like behavior or solving CAPTCHAs in real-time, they help avoid detection as bots and maintain a smooth scraping process.

However, it's important to consider the ethical and legal implications of using such tools, as they may violate website terms of service and privacy policies. Is scraping websites legal? It depends on the jurisdiction and the website's terms. Always ensure compliance with local laws and regulations. Besides, these tools usually have higher pricing as well.

4. Set the real user agent

Setting a real User-Agent is a common method to avoid detection during scraping website activities. Websites often use the User-Agent header in requests to identify whether they come from a real user browser or an automated bot. By spoofing or randomizing the User-Agent, scraping scripts can appear as if they are coming from a regular user, reducing the chances of being detected.

How to implement it:

Spoof a Real Browser's User-Agent

Use common browser User-Agent strings (like Chrome, Firefox, Safari, etc.) to mimic the behavior of real users. For example, in Python, you can set a typical browser User-Agent header using the requests library:

import requests
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
response = requests.get('https://example.com', headers=headers)

Rotate User-Agent Dynamically

Rotate different User-Agent strings to avoid detection by using a proxy pool or an API (such as random-user-agent). This makes it harder for websites to recognize scraping patterns based on a single User-Agent.

5. Combine Other Headers

Besides the User-Agent, you can also spoof other headers (like Referer, Accept-Language, etc.) to further mimic a real browser's request.

Generally, it's best to make it look like it's being accessed from Google.

You can do this with a header: "Referer": "https://www.google.com/"

This can make it even more challenging for websites to distinguish between automated requests and genuine user interactions.

6. Set a random interval between crawling requests

Websites often detect scraping activity based on the frequency and regularity of requests. If requests come in too quickly or in a predictable pattern, it's easier for websites to identify and block the scraper. By introducing random delays between requests, you can make your scraping behavior appear more natural.

You can use the time.sleep() function in Python to introduce delays between requests. By setting a random interval, you can vary the time between each request to make the behavior less predictable.

import time
import random
import requests

# Send a request with a random delay between 1 to 3 seconds
headers = {'User-Agent': 'Mozilla/5.0'}
url = 'https://example.com'

for i in range(10):
    response = requests.get(url, headers=headers)
    print(f"Request {i+1} Status: {response.status_code}")
    time.sleep(random.uniform(1, 3))  # Random sleep between 1 and 3 seconds

7. Avoid Honeypot Traps

Many websites use invisible links to detect web crawlers, as only bots will follow them.

To avoid detection, you should check if a link has the CSS properties "display: none" or "visibility: hidden." If either is set, do not visit the link! Failing to do so could expose your crawler to detection, allowing the server to identify your request attributes and block you.

Honeypots are a common method used by websites to spot crawlers, so be sure to perform this check on every page you scrape.

Moreover, some advanced webmasters might also set the link color to white (or match the background color), so it's worth checking for properties like "color: #fff;" or "color: #ffffff" to ensure the link remains invisible.

8. Delete Google cache

To avoid getting blocked while scraping, it's important to clear or bypass Google’s cache, as it may store your previous interactions and help websites detect suspicious scraping activity. Here are a few strategies for dealing with Google’s cache:

Puppeteer: Use the clearBrowserCookies() and clearBrowserCache() functions in Puppeteer to clear:

cookies and cache between scraping sessions.
const browser = await puppeteer.launch();
const page = await browser.newPage();

// Clear cache and cookies
await page.clearBrowserCache();
await page.clearBrowserCookies();

Time to Make Crawling Simple and Efficient!

Scrapeless Web Unlocker is a powerful tool that integrates an intelligent proxy pool, efficient JavaScript rendering (JSRender), and automatic CAPTCHA processing, designed to solve common problems in scraping websites. Scrapeless makes complex scraping tasks simple, efficient, and reliable.

If you want to break through the limitations of scraping and improve efficiency, whether it is dealing with complex dynamic pages or large-scale scraping tasks, Scrapeless Web Unlocker is your trustworthy solution.

DEV Community