DEV Community: Scrapfly

How to Optimize Oxylabs Proxies

Scrapfly — Thu, 08 May 2025 10:21:04 +0000

Proxies are indispensable tools for web scraping, data aggregation, and maintaining online anonymity. By routing internet traffic through intermediary servers, proxies mask users' IP addresses and facilitate access to geo-restricted content. Among the myriad of proxy providers, Oxylabs stands out for its robust infrastructure and extensive proxy pool.

However, effectively leveraging Oxylabs proxies necessitates a clear understanding of their setup and optimization techniques. This guide delves into the essentials of Oxylabs proxies, from account creation to bandwidth optimization using Python and Scrapfly's Proxy Saver.

Understanding Proxies and Their Importance

Proxies act as intermediaries between a user's device and the internet, playing a pivotal role in scenarios requiring anonymity, bypassing geo-restrictions, or managing multiple accounts. The primary types of proxies include:

Datacenter Proxies : Not affiliated with Internet Service Providers (ISPs), offering high speed and cost-effectiveness.
Residential Proxies : Sourced from real users' devices, providing higher anonymity and a lower likelihood of being blocked.([Oxylabs][1])
ISP Proxies : Combining the benefits of datacenter and residential proxies, offering both speed and legitimacy.([Oxylabs][2])

Utilizing proxies is crucial for tasks like web scraping, where accessing large volumes of data without being blocked is essential.

Introduction to Oxylabs

Oxylabs is a premium proxy service provider offering a vast pool of residential, datacenter, and mobile proxies. With over 100 million IPs globally, Oxylabs caters to businesses requiring reliable and scalable proxy solutions.

Oxylabs Free Trial

Oxylabs provides a free trial for its residential and datacenter proxies, allowing users to test their services before committing. This trial is particularly beneficial for businesses evaluating proxy solutions for their specific needs.

Setting Up Your Oxylabs Proxy

1. Account Creation

Begin by visiting Oxylabs and signing up using your business email. Complete the verification process as prompted.

2. Accessing the Dashboard

Upon successful registration, log in to your Oxylabs dashboard. Navigate through the dashboard to manage your proxies and monitor usage.

3. Generating Proxy Credentials

Select the type of proxy (residential or datacenter) you wish to use. Choose your authentication method: either username/password or IP whitelisting. Note down your proxy endpoint and port for configuration.

4. Testing Your Proxy

To verify your proxy setup, you can use the following cURL command:

curl -k --proxy http://USERNAME:PASSWORD@dc.oxylabs.io:8000 https://httpbin.dev/anything

This command sends a request through the Oxylabs proxy and returns the response, confirming successful configuration.

Fetching Data Using Oxylabs Proxies

Once your proxy is set up, you can use it to fetch data from websites. Here's an example using Python's requests library:

import requests

url = "https://example.com/product-page"

headers = {
    "User-Agent": "Mozilla/5.0",
    "Accept-Encoding": "gzip, deflate",
}

proxies = {
    "http": "http://username:password@dc.oxylabs.io:8000",
    "https": "http://username:password@dc.oxylabs.io:8000",
}

response = requests.get(url, headers=headers, proxies=proxies)
print(response.text)

This script fetches the content of the specified URL through the Oxylabs proxy, using headers to mimic a regular browser request.

How to Reduce Bandwidth Usage with Oxylabs Proxies

Optimizing bandwidth usage is crucial when dealing with large-scale data scraping. Here are several techniques to minimize bandwidth consumption, each explained with a short rationale and example.

1. Set Lightweight Request Headers

Use minimal headers to request only the essential parts of a webpage and avoid loading additional scripts or rich content.

headers = {
    "User-Agent": "Mozilla/5.0",
    "Accept": "text/html",
    "Accept-Encoding": "gzip, deflate",
    "Connection": "close"
}

This reduces the size of the server's response by excluding multimedia and encouraging text-only output with compression enabled.

2. Use HEAD Requests for Validation

HEAD requests are ideal when you only need to check if a page exists, as they return headers without a full page download.

response = requests.head("https://example.com/page", proxies=proxies, headers=headers)
print("Status code:", response.status_code)

This avoids downloading the entire response body, saving bandwidth while confirming availability.

3. Disable Loading of Images and Scripts

Blocking media and JavaScript resources can significantly reduce page load times and bandwidth usage when scraping.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless")
prefs = {
    "profile.managed_default_content_settings.images": 2,
    "profile.managed_default_content_settings.javascript": 2
}
options.add_experimental_option("prefs", prefs)

driver = webdriver.Chrome(options=options)
driver.get("https://example.com")
print(driver.page_source)
driver.quit()

This ensures only the HTML content is loaded, drastically reducing the payload size.

4. Limit Pagination

Instead of scraping thousands of pages, limit the number of pages to avoid excessive data retrieval.

for page in range(1, 6):
    url = f"https://example.com/products?page={page}"
    response = requests.get(url, headers=headers, proxies=proxies)
    print(f"Page {page} status:", response.status_code)

Limiting pagination helps manage total request volume and reduces unnecessary bandwidth consumption.

5. Extract Specific Content

Parse only the content you need from HTML responses to avoid processing or storing irrelevant data.

from lxml import html

tree = html.fromstring(response.content)
titles = tree.xpath('//h2[@class="product-title"]/text()')
print(titles)

This approach focuses on extracting specific fields, improving memory efficiency and speed.

6. Utilize Query Parameters

Take advantage of API or URL parameters to narrow results and minimize the returned dataset size.

url = "https://example.com/api/search?query=laptop&limit=5"
response = requests.get(url, headers=headers, proxies=proxies)
print(response.json())

This limits the server's response to a small, relevant subset, which is ideal for lean scraping operations.

7. Disable Redirects

Avoid following multiple redirects, especially those used by CDNs and tracking systems, to cut down on extra HTTP requests.

response = requests.get("https://example.com", headers=headers, proxies=proxies, allow_redirects=False)
print(response.status_code, response.headers.get("Location"))

This saves both time and bandwidth by halting at the initial response instead of continuing through redirection chains.

8. Implement Timeouts

Set a short timeout to quickly drop stalled or slow requests that would otherwise waste bandwidth and delay scraping.

try:
    response = requests.get("https://example.com", headers=headers, proxies=proxies, timeout=5)
    print(response.status_code)
except requests.exceptions.Timeout:
    print("Request timed out")

This ensures your scraping pipeline remains responsive and doesn't hang on slow-loading pages.

Enhancing Proxy Efficiency with Scrapfly Proxy Saver

Scrapfly's Proxy Saver is a middleware solution designed to optimize proxy usage by reducing bandwidth and improving stability. It offers features like automatic caching, fingerprint impersonation, and blocking of unnecessary resources.

import requests

params = {
    "url": "https://example.com",
    "proxy": "oxylabs",
    "country": "us",
    "block_assets": "true"
}

headers = {"X-API-Key": "your_scrapfly_api_key"}

response = requests.get("https://api.scrapfly.io/scrape", headers=headers, params=params)
print(response.json())

This setup routes your request through Scrapfly's Proxy Saver, which then uses your Oxylabs proxy, applying optimizations to reduce bandwidth usage.

Certainly! Here's the continuation and completion of the article, maintaining the format and enhancements you've requested:

Comparing Oxylabs and Bright Data

Feature	Oxylabs	Bright Data
IP Pool	100M+	72M+
Free Trial	5 datacenter IPs	Limited usage quota
Bandwidth Control	Manual + Scrapfly Integration	Requires proxy manager
Dashboard UX	Modern and intuitive	Advanced but more complex
Developer Tools	Simple proxy strings, API docs	Proxy Manager, APIs, CLI tools

Both providers are powerful, but Oxylabs' straightforward setup and compatibility with tools like Scrapfly make it an excellent choice for efficient, high-scale scraping.

You can read our Bright Data optimization guide for a detailed walkthrough on tuning their proxies:

[

How to Reduce Your Bright Data Bandwidth Usage

Learn the most effective ways to reduce Bright Data costs with bandwidth-saving techniques and streamlined proxy settings.

](https://scrapfly.io/blog/how-to-reduce-your-bright-data-bandwidth-usage/)

FAQ

What’s the best way to test Oxylabs proxies?

You can use tools like cURL or Python scripts to confirm connectivity. For example:

curl -k --proxy http://USERNAME:PASSWORD@dc.oxylabs.io:8000 https://httpbin.dev/anything

This command routes your request through an Oxylabs datacenter proxy and shows your proxied IP in the response.

Does reducing bandwidth affect data accuracy?

Not when done correctly. Headers and content stubbing remove only non-essential assets like ads or scripts, leaving the core data intact.

Can I combine Oxylabs and Scrapfly in the same project?

Yes, Scrapfly Proxy Saver acts as a proxy wrapper, allowing you to route Oxylabs traffic through their optimization layer for better efficiency.

Summary

In this guide, you learned how to set up and optimize Oxylabs proxies for efficient web scraping. We explored the different types of proxies and how to configure them using Python and cURL. To reduce bandwidth, we covered eight practical strategies including lightweight headers, pagination control, asset blocking, and more.

Finally, we introduced Scrapfly Proxy Saver, a powerful tool to enhance proxy performance through smart routing, fingerprint spoofing, and bandwidth optimization—integrating seamlessly with your Oxylabs setup.

Whether you’re scraping thousands of product listings or just experimenting with proxy management, these best practices will help you stay efficient, cost-effective, and scalable.

How to Reduce Your Bright Data Bandwidth Usage

Scrapfly — Fri, 02 May 2025 13:33:12 +0000

Bright Data is a top-tier proxy provider—but its bandwidth costs can escalate quickly if not carefully managed. Whether you're scraping product pages, monitoring SEO trends, or extracting social media data, excessive proxy traffic can burn through your budget. That’s why learning to monitor, optimize, and enhance your proxy setup is vital to efficient operations.

This guide will walk you through reducing your Bright Data bandwidth usage by first optimizing proxy requests using plain Python, and then showing how to supercharge efficiency using Scrapfly Proxy Saver. We'll cover everything from understanding Bright Data's proxy types, to tuning your scripts, to applying advanced optimizations with minimal configuration.

Understanding and Creating a Bright Data Proxy

Bright Data proxies come in several types—residential, datacenter, ISP, and mobile—each tailored for different scraping environments. Residential proxies mimic real users by routing requests through real devices, offering high stealth. Datacenter proxies offer better performance at a lower cost but are more detectable.

To start using a Bright Data proxy, you first need to create a zone:

http://brd-customer-USERNAME-zone-ZONENAME:PASSWORD@brd.superproxy.io:PORT

Steps to Create a Proxy Zone:

Log in to your Bright Data dashboard.
Navigate to Proxy Zones and click Add Zone.
Choose the desired proxy type: Residential, Datacenter, ISP, or Mobile.
Customize parameters such as rotation strategy, country targeting, and session persistence.
Copy the generated credentials and use them in your scraping scripts.

These proxy zones determine how your traffic is routed and how you're billed for bandwidth and requests. Understanding the differences between each type helps you choose the most cost-effective and appropriate one for your scraping goals.

Using Bright Data Proxies in Python

After creating your zone, you’ll receive a formatted proxy URL. You can use this with Python's standard urllib module for basic requests:

import urllib.request

proxy = 'http://brd-customer-USERNAME-zone-ZONENAME:PASSWORD@brd.superproxy.io:22225'
url = 'https://scrapfly.io/proxy-saver'

opener = urllib.request.build_opener(
    urllib.request.ProxyHandler({'http': proxy, 'https': proxy})
)

try:
    response = opener.open(url)
    print(response.read().decode())
except Exception as e:
    print(f"Error: {e}")

This setup ensures that all HTTP and HTTPS requests are routed through your configured Bright Data proxy. However, each request will include full page payloads, images, and headers—leading to significant bandwidth usage if not controlled.

Reducing Bandwidth in Python

Python gives you granular control over your requests. Here's how you can reduce overhead before reaching for external tools:

Reuse Connections with Sessions

Using a requests.Session() object maintains a persistent connection across multiple requests:

import requests

session = requests.Session()
session.proxies.update({
    'http': proxy,
    'https': proxy
})

for url in ['https://scrapfly.io/proxy-saver', 'https://scrapfly.io/blog/how-to-optimize-proxies/']:
    response = session.get(url)
    print(len(response.content))

This significantly reduces connection establishment time and redundant TCP handshakes.

Request Less Data

You don’t need every byte the server sends. Customize headers to exclude images, scripts, or compress output:

headers = {
    "User-Agent": "Mozilla/5.0",
    "Accept": "text/html",
    "Accept-Encoding": "gzip"
}

response = session.get("https://scrapfly.io/proxy-saver", headers=headers)

Cache Static Responses

If you're visiting static or semi-static pages, cache responses locally:

import os, hashlib

def get_cached_response(url):
    filename = f"/tmp/{hashlib.md5(url.encode()).hexdigest()}.cache"
    if os.path.exists(filename):
        with open(filename, 'rb') as f:
            return f.read()
    response = session.get(url)
    with open(filename, 'wb') as f:
        f.write(response.content)
    return response.content

Caching can reduce bandwidth by up to 90% when working with rarely updated pages.

Supercharge with Scrapfly Proxy Saver

Scrapfly Proxy Saver automates bandwidth-saving strategies without touching your codebase. It functions as a middleware between your scraping script and Bright Data, applying smart compression, routing, and stubbing on the fly.

Unlock Bandwidth & Latency Efficiency with Proxy Saver

Proxy Saver is designed for scale. Its optimizations deliver more value as your traffic grows. Even simple scraping tasks benefit from reduced costs and faster responses.

Key Features:

Connection reuse to reduce TCP overhead
Global public caching of common content
Redirection and CORS caching
Automatic blocking of telemetry and ad scripts
Stubbing for large media like images and CSS
Optimized TLS handshake and TCP connection pooling
DNS pre-warming for quick domain resolution
Failover and retry logic for higher reliability

All of these features are activated by default, but you can fine-tune behavior using parameters in the proxy username.

Example Integration with Python

import requests

proxy = {
    'http': 'http://proxyId-abc123-Timeout-10-FpImpersonate-chrome_win_130@proxy-saver.scrapfly.io:3333',
    'https': 'http://proxyId-abc123-Timeout-10-FpImpersonate-chrome_win_130@proxy-saver.scrapfly.io:3333'
}

response = requests.get('https://httpbin.dev/anything', proxies=proxy, verify=False)
print(response.json())

Configuration Options

Parameter	Description	Example
`proxyId`	Required ID from your dashboard	`proxyId-abc123`
`Timeout`	Request timeout in seconds	`Timeout-10`
`FpImpersonate`	Fingerprint of a real browser	`FpImpersonate-chrome_win_130`
`DisableImageStub`	Load full images instead of 1x1 pixel	`DisableImageStub-True`
`DisableCssStub`	Load real CSS files	`DisableCssStub-True`
`allowRetry`	Disable automatic retry on failure	`allowRetry-False`
`intermediateResourceMaxSize`	Max resource size in MB	`intermediateResourceMaxSize-4`

Combine multiple settings like: proxyId-xyz-FpImpersonate-chrome_win_130-Timeout-8

Passing Parameters to Bright Data

Use the | separator to pass downstream proxy config:

proxyId-abc123|country-us:API_KEY@proxy-saver.scrapfly.io:3333

This allows full control over Scrapfly optimization and Bright Data zone behavior simultaneously.

Special Note on Rotating IPs

If you're using Bright Data with session rotation, enable the "Rotating Proxy" mode in Scrapfly’s dashboard to ensure traffic patterns are preserved and connection optimizations are adjusted accordingly.

Understanding Proxy Types

Choosing the right proxy type is just as important as using it efficiently. Each scraping scenario benefits from different proxy capabilities, and making the right selection can greatly impact your results.

Residential Proxies

Residential proxies use IP addresses provided by ISPs and linked to physical locations. They offer excellent authenticity and are ideal for accessing geo-blocked or sensitive content. However, they tend to be more expensive and should be used judiciously.

You can checkout our article about residential proxies:

[

Top 5 Residential Proxy Providers for Web Scraping

Comparison of top residential proxy providers for web scraping. Blocking rates, performance and general overview of what makes a good proxy.

](https://scrapfly.io/blog/top-5-residential-proxy-providers/)

Datacenter Proxies

Datacenter proxies originate from cloud-based data centers. They are fast and cost-effective but easier to detect. They work well for non-sensitive, high-volume tasks where occasional blocks are tolerable.

You can checkout our article about datacenter proxies:

[

The Best Datacenter Proxies in 2025: A Complete Guide

Explore the best datacenter proxies for 2025 including IPRoyal, shared vs dedicated options, and how to buy unlimited bandwidth proxies.

](https://scrapfly.io/blog/the-best-datacenter-proxies-in-2025-a-complete-guide/)

FAQs

How do I create a Bright Data proxy?

You create a zone in the dashboard, select your proxy type, and configure settings like geo-targeting and session duration.

How does Scrapfly Proxy Saver reduce bandwidth?

It compresses data, stubs static content, and caches responses. You can save up to 30% or more on data transfer.

Can I use Proxy Saver with Bright Data?

Yes. Just plug your Bright Data proxy into the Proxy Saver dashboard and route traffic through Scrapfly's endpoint.

Summary

Controlling proxy bandwidth usage is crucial for keeping scraping operations efficient and affordable. Start by optimizing your Bright Data usage with smart Python practices—like connection reuse, selective content fetching, and local caching. Then, amplify those gains using Scrapfly Proxy Saver’s powerful middleware that automates compression, fingerprint impersonation, connection reuse, and more.

Whether you’re scraping a few pages or handling millions of requests per day, these techniques ensure your proxy usage remains fast, efficient, and cost-effective.

What is Rate Limiting? Everything You Need to Know

Scrapfly — Fri, 02 May 2025 13:33:03 +0000

Rate limiting is a vital concept in APIs, web services, and application development. It controls how many requests a user or system can make to a resource within a set time frame, helping ensure system stability, fair access, and protection against abuse like spam or denial-of-service attacks.

For both developers and beginners, understanding rate limiting is key to building secure and scalable systems. In this guide, we’ll cover what rate limiting is, why it matters, how it works, common algorithms, practical examples, and tips for implementing it effectively.

What is an IP Address?

Before diving deeper into rate limiting, it is essential to understand what an IP (Internet Protocol) address is, as rate limiting often involves tracking IPs. An IP address is a unique identifier assigned to each device connected to a network that uses the Internet Protocol for communication. Think of it like a mailing address for your computer or smartphone.

There are two main types of IP addresses:

IPv4: The most commonly used format, consisting of four groups of numbers separated by dots (e.g., 192.168.1.1).
IPv6: A newer format designed to accommodate the growing number of internet devices, using eight groups of hexadecimal numbers separated by colons (e.g., 2001:0db8:85a3:0000:0000:8a2e:0370:7334).

for more details checkout our article:

[

What is the difference between IPv4 vs IPv6 in web scraping?

IPv4 and IPv6 are two competing Internet Protocol version that have different advantages when it comes to web scraping. Here's what they are.

](https://scrapfly.io/blog/ipv4-vs-ipv6-in-web-scraping/)

IP addresses allow devices to find and communicate with each other across networks. In rate limiting, systems often monitor requests based on IP addresses to identify and control the source of traffic.

Why is Rate Limiting Important?

Without rate limiting, systems are vulnerable to overwhelming traffic that can slow down or crash services. Here are some essential reasons why rate limiting is important:

Prevents Abuse: Stops malicious users from spamming or overloading systems.
Ensures Fair Use: Guarantees that no single user can monopolize resources.
Protects System Stability: Maintains predictable and reliable service.
Enhances Security: Acts as a defensive mechanism against DDoS attacks.

Now that you understand its importance, let’s dive into how rate limiting actually works.

How Does Rate Limiting Work?

Rate limiting monitors the number of requests from a user, IP address, or API key over a given time window (e.g., 100 requests per minute). If the threshold is exceeded, the system responds with an error code, often HTTP 429 (Too Many Requests).

Rate limiters can be implemented at different layers:

Application Layer: Code-level checks within the application.
API Gateway Layer: Dedicated gateways like Kong, Apigee, or AWS API Gateway.
Network Layer: Firewalls and load balancers limiting by IP addresses.

Common Rate Limiting Algorithms

Understanding different rate limiting algorithms helps developers choose the best strategy for their application. Here are a few common ones:

Token Bucket

The token bucket algorithm allows for a certain number of tokens to be added to a bucket at a fixed rate. Each request "spends" a token. If tokens are available, the request is allowed.

# Simple Python simulation of Token Bucket
class TokenBucket:
    def __init__ (self, capacity, refill_rate):
        self.capacity = capacity
        self.tokens = capacity
        self.refill_rate = refill_rate

    def allow_request(self):
        if self.tokens > 0:
            self.tokens -= 1
            return True
        return False

    def refill(self):
        self.tokens = min(self.capacity, self.tokens + self.refill_rate)

In the above code, allow_request checks if tokens are available, and refill simulates token regeneration.

Leaky Bucket

The leaky bucket algorithm treats incoming requests like water poured into a bucket with a small hole at the bottom. Water (requests) leaks at a constant rate, regardless of the inflow rate. If too much water is poured at once and the bucket overflows, incoming requests are discarded. This method ensures a consistent, controlled output rate, smoothing traffic bursts and preventing system overload.

Fixed Window

Fixed window rate limiting divides time into equal segments (like 1-minute windows). It counts the number of requests in the current window and blocks requests that exceed the limit. For instance, a limit of 1000 requests per minute resets at the beginning of every minute. Although simple to implement, it may allow traffic spikes at window boundaries, causing short-term bursts.

Sliding Window Log

Sliding window log is a more accurate but resource-intensive method. It keeps a timestamped log of every request and continuously checks how many requests occurred within a moving time frame (e.g., the last 60 seconds). When a new request arrives, the system purges old timestamps and decides based on the updated log. This provides smoother traffic management and avoids sudden spikes seen in fixed windows.

Practical Examples of Rate Limiting

Example 1: API Usage

A public API like GitHub's API uses rate limiting to prevent abuse. For instance, unauthenticated users might be limited to 60 requests per hour, while authenticated users can have higher limits.

Example 2: Login Systems

Login endpoints implement rate limiting to prevent brute-force attacks. For instance, a system might allow 5 login attempts per IP address every 10 minutes.

Best Practices for Implementing Rate Limiting

Implementing rate limiting effectively requires thoughtful planning to balance user experience, system performance, and security. Below are some best practices to guide you.

Return Clear Error Messages: Include “Retry-After” headers when blocking requests.
Different Limits for Different Users: Offer higher limits for authenticated or premium users.
Monitoring and Alerts: Track rate limit events and trigger alerts if thresholds are consistently exceeded.
Graceful Degradation: Allow limited access instead of outright blocking whenever possible.

Rate Limiting in Different Industries

Rate limiting plays a critical role across many industries, ensuring that applications remain stable, secure, and efficient under varying loads. Different industries apply rate limiting strategies based on their unique operational needs.

E-commerce: In e-commerce, rate limiting protects checkout and payment APIs to prevent fraud and service degradation during major sales events like Black Friday.
Financial Services: Banks and financial institutions use rate limiting to secure sensitive transaction endpoints, prevent fraud, and comply with regulatory requirements such as PSD2 or PCI-DSS.
Social Media Platforms: Social media networks like Twitter and Instagram aggressively apply rate limiting to curb bots, reduce scraping activities, and maintain platform health.
Gaming Industry: Online games use rate limiting to ensure fairness in gameplay and protect their servers from bot attacks and spam requests.
Healthcare Applications: Healthcare systems implement rate limiting to control access to sensitive patient data, ensuring compliance with standards like HIPAA and minimizing risks of system overload.

Challenges of Rate Limiting

While rate limiting is powerful, it can introduce challenges:

False Positives: Legitimate users might get blocked.
Scaling Issues: Managing rate limits across distributed systems can be complex.
User Frustration: Overly aggressive limits can degrade the user experience.

Solutions include adaptive rate limits, user-specific thresholds, and clear communication through error messages.

Tools and Libraries for Rate Limiting

Redis: Often used for storing counters and implementing rate limits efficiently.
NGINX: Built-in modules for HTTP rate limiting.
Envoy Proxy: Offers dynamic rate limiting via external services.
Libraries: Libraries like express-rate-limit for Node.js or django-ratelimit for Django.

Proxies at ScrapFly

ScrapFly Proxy Saver is a proxy middleware that optimizes your existing proxy connections—cutting bandwidth usage, reducing failure rates, and adding advanced smart caching and fingerprinting layers to any proxy source.

Bandwidth reduction – stub unnecessary image, font, and CSS requests to save up to 30% in data costs.
Smart page caching – speed up repeated requests with automatic page, redirect, and CORS caching.
Fingerprint restoration – impersonate real browsers or restore original proxy fingerprints to avoid detection.
Automatic retries and healing – built-in logic fixes bad headers, errors, and retryable failures.
Simple integration – plug into your proxy stack via a single dashboard with full parameter forwarding support.
Full protocol support – works with HTTP, HTTPS, HTTP2, and SOCKS5 connections for maximum compatibility.

Proxy Saver: supercharge any proxy provider with middleware performance boosts.

Try Proxy Saver For FREE!

Read Proxy Saver Docs

FAQs

Below are quick answers to common questions about rate limiting.

What is a 429 Error?

A 429 Error means "Too Many Requests." It indicates that the user has sent too many requests in a given amount of time and has hit the rate limit.

How Can I Bypass Rate Limits?

Bypassing rate limits is generally unethical and discouraged. Instead, consider applying for higher usage quotas or optimizing your application's request patterns.

Can Rate Limiting Be Dynamic?

Yes, dynamic rate limiting adjusts thresholds based on server load, user tiers, or other runtime parameters to offer flexible control.

Summary

Rate limiting is an essential tool for any developer working with APIs, web services, or scalable applications. It ensures system stability, fairness, and security. By understanding the different algorithms, real-world applications, challenges, and best practices, you can implement effective rate-limiting strategies in your projects.

Now that you have a clear understanding of what is rate limiting and how to implement it, you can build more reliable and secure systems.

How to Optimize Proxies

Scrapfly — Thu, 24 Apr 2025 14:46:04 +0000

Whether you're scraping websites, managing multiple accounts, or protecting your privacy, using proxies efficiently can be the difference between success and constant frustration. Knowing how to optimize proxies isn't just a technical necessity—it's a strategic advantage for developers.

In this article, we'll explore the key techniques to optimize proxy use, compare proxies with VPNs for clarity, and show you how tools like Scrapfly Proxy Saver can save you time and resources.

What Does It Mean to Optimize Proxies?

Optimizing proxies means configuring and using them in a way that maximizes speed, maintains anonymity, and reduces costs. This involves selecting the right proxy types, managing sessions properly, and understanding your use case.

Choosing the Right Type of Proxy

There are different types of proxies, each with specific advantages:

Datacenter Proxies: Fast and affordable but easier to detect.
Residential Proxies: Harder to block and better for anonymity but more expensive.
Mobile Proxies: Offer the highest anonymity but often come with limitations in speed and availability.

Selecting the right proxy depends on your specific needs—whether it's speed, cost-efficiency, or stealth.

Technical Setup for Maximum Speed

Speed optimization starts with minimizing latency and ensuring stability. Here's a sample setup using a proxy in Python:

import requests

proxies = {
    'http': 'http://user:pass@proxyserver:port',
    'https': 'http://user:pass@proxyserver:port'
}

response = requests.get('https://httpbin.org/ip', proxies=proxies)
print(response.json())

In the above code, we configure HTTP and HTTPS requests to route through a proxy. This method allows us to distribute requests and avoid rate limiting.

Maintaining Anonymity

To maintain anonymity while using proxies:

Rotate proxies frequently.
Use user-agent strings that mimic real browsers.
Avoid predictable patterns in request behavior.

These practices help prevent detection and blocking by websites.

Keeping Costs Under Control

Bandwidth costs and proxy rates can add up quickly. To reduce expenses:

Use datacenter proxies for high-volume, low-risk scraping.
Reserve residential proxies for complex or sensitive targets.
Implement intelligent request throttling to reduce unnecessary usage.

Proxy vs. VPN: A Quick Comparison

Feature	Proxy	VPN
Speed	Faster	Slightly slower due to encryption
Anonymity	Depends on proxy type	High, but centralized
Use Case	Scraping, automation, SEO tools	General browsing, streaming
Cost	Variable (can be low)	Often subscription-based

For more detail, check out our article:

[

Proxy vs VPN: In-Depth Comparison

Explore the proxy vs vpn debate with insights on key differences, benefits, limitations and alternatives. Discover when to choose a proxy or VPN.

](https://scrapfly.io/blog/proxy-vs-vpn/)

Proxy in Web Scraping

Proxies play a pivotal role in web scraping, acting as intermediaries that mask your IP address, rotate identities, and help access region-restricted or rate-limited data sources. Whether you're working on small scripts or enterprise-scale data pipelines, proxies ensure that your scraping operations remain anonymous and uninterrupted.

Why Proxies Matter in Web Scraping

Using a proxy allows you to:

Avoid IP bans by rotating through multiple addresses.
Access geo-specific content by routing requests through different countries.
Stay under the radar with residential or mobile IPs that mimic real user behavior.
Integrating proxies effectively helps ensure scalability, reliability, and compliance in web scraping tasks.

Now let’s look at how to improve proxy usage by reducing resource load.

Blocking Resource Loading in Web Scraping Tools

Blocking unnecessary resources like images and media files can significantly speed up your web scraping process and save proxy bandwidth. Here's how you can do it in different libraries:

Selenium

First, install Selenium:

pip install selenium

Use Chrome options to disable images or combine Selenium with mitmproxy for advanced filtering:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--blink-settings=imagesEnabled=false')
chrome_options.add_experimental_option(
    "prefs", {"profile.managed_default_content_settings.images": 2}
)
driver = webdriver.Chrome(options=options, chrome_options=chrome_options)
driver.get("https://www.example.com")
driver.quit()

Or block specific resource types using mitmproxy:

First, install mitmproxy:

pip install mitmproxy


# Save as block.py and run with mitmproxy -s block.py
from mitmproxy import http
BLOCK_RESOURCE_EXTENSIONS = ['.gif', '.jpg', '.jpeg', '.png', '.webp']
def request(flow: http.HTTPFlow) -> None:
    if any(flow.request.pretty_url.endswith(ext) for ext in BLOCK_RESOURCE_EXTENSIONS):
        flow.response = http.Response.make(404, b"Blocked", {"Content-Type": "text/html"})

For more, read the full guide:

[

Web Scraping with Selenium and Python Tutorial + Example Project

Selenium and Python tutorial for web scraping dynamic, javascript powered websites using a headless Chrome webdriver. Real life example project.

](https://scrapfly.io/blog/web-scraping-with-selenium-and-python/)

Playwright

First, install Playwright:

pip install playwright
playwright install

Intercept requests and block unwanted resources by type or keyword:

from playwright.sync_api import sync_playwright

def intercept_route(route):
    if route.request.resource_type in ['image', 'media']:
        return route.abort()
    return route.continue_()

with sync_playwright() as pw:
    browser = pw.chromium.launch(headless=True)
    page = browser.new_page()
    page.route("**/*", intercept_route)
    page.goto("https://www.example.com")
    browser.close()

For more, read the full guide:

[

Web Scraping with Playwright and Python

Playwright is the new, big browser automation toolkit - can it be used for web scraping? In this introduction article, we'll take a look how can we use Playwright and Python to scrape dynamic websites.

](https://scrapfly.io/blog/web-scraping-with-playwright-and-python/)

Puppeteer

First, install Puppeteer:

npm install puppeteer

Puppeteer enables blocking based on resource type or matching URLs:

const puppeteer = require('puppeteer');
(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.setRequestInterception(true);

  page.on('request', request => {
    if (['image', 'media'].includes(request.resourceType())) {
      request.abort();
    } else {
      request.continue();
    }
  });

  await page.goto('https://www.example.com');
  await browser.close();
})();

For more, read the full guide:

[

How to Web Scrape with Puppeteer and NodeJS in 2025

Puppeteer and nodejs tutorial (javascript) for web scraping dynamic web pages and web apps. Tips and tricks, best practices and example project.

](https://scrapfly.io/blog/web-scraping-with-playwright-and-python/)

Scrapfly Proxy Saver

Scrapfly Proxy Saver is a middleware solution designed to enhance your existing proxy setup by optimizing bandwidth usage, improving stability, and providing advanced fingerprinting capabilities. It acts as a man-in-the-middle (MITM) service, offering a suite of features tailored for developers and data professionals.

Key Benefits

Bandwidth Optimization : By stubbing unnecessary resources like images and CSS, Proxy Saver can reduce bandwidth consumption by up to 30%.
Automatic Caching : Leverage Scrapfly's CDN to automatically cache results, redirects, and CORS, enhancing response times and reducing redundant requests.
Fingerprint Impersonation : Choose from a pool of real web browser profiles to mimic genuine user behavior, aiding in bypassing proxy detection mechanisms.
Enhanced Stability : Proxy Saver improves connection stability by automatically retrying failed requests and resolving common proxy issues.
Seamless Integration : Supports integration with platforms like Python and TypeScript, ensuring flexibility across different development environments.

Use Cases

Proxy Saver is versatile and caters to various industries:

AI Training : Reduce bandwidth usage and increase response times when working with data-intensive websites.
Compliance : Efficiently proxy to compliance sources, ensuring data integrity and reduced overhead.
eCommerce : Enhance stability when accessing e-commerce platforms, ensuring consistent data retrieval.
Financial Services : Optimize bandwidth and response times when interfacing with financial data sources.
Fraud Detection : Improve response times and reduce bandwidth usage in fraud detection systems.

Getting Started

To utilize Proxy Saver:

Create a Proxy Saver Instance : Access the Scrapfly dashboard and set up a new Proxy Saver instance.
Configure Your Proxy : Attach your existing proxy connection to the Proxy Saver instance.
Authentication : Use the standard username:password scheme, where the username is proxyId-XXX (your proxy ID) and the password is your API key.
Advanced Configuration : Utilize parameters like Timeout-10 to set timeouts or FpImpersonate-chrome_win_130 to impersonate specific browser fingerprints.

Pricing

Proxy Saver operates on a pay-as-you-go model:

Base Rate : $0.2 per GB of bandwidth used.
Additional Features : Fingerprint impersonation incurs an extra $0.1 per GB.

Monitor your usage and billing details directly from the Proxy Saver dashboard.

Try Proxy Saver

More on Scrapfly

Caching Proxy Strategies

Caching is a powerful technique to boost the efficiency of proxy usage. By avoiding redundant data requests, developers can significantly reduce costs and improve speed, especially in large-scale scraping projects.

Why Use Caching with Proxies?

Caching in proxy workflows ensures that data retrieval is not only faster but also more economical. By storing commonly accessed responses, you can greatly minimize redundant traffic and API load.

Reduce Bandwidth Costs : Avoid fetching the same data multiple times, which is especially useful with paid proxies.
Improve Speed : Cached data loads faster, reducing wait times.
Enhance Stability : Reduces the volume of live requests sent through proxies, minimizing potential failures.

How to Implement Caching

There are multiple layers at which caching can be implemented, each offering unique advantages. Whether you're working locally or integrating with a proxy service, there are effective solutions to fit your needs.

Local Caching : Use tools like requests-cache in Python.
Proxy-Level Caching : Leverage built-in features in services like Scrapfly Proxy Saver that offer CDN caching.
Custom Strategies : Develop logic that checks for cached responses before querying external sites.

import requests
import requests_cache

requests_cache.install_cache('demo_cache', backend='sqlite', expire_after=180)
response = requests.get('https://example.com/data')
print(response.from_cache) # Indicates if response was cached

Now that you understand how caching can boost proxy efficiency, let’s move on to common questions developers have.

FAQs

Can proxies handle JavaScript-heavy sites?

Yes, proxies can be used with JavaScript-heavy websites, but you'll need to use headless browsers or frameworks like Puppeteer and Playwright that support JavaScript rendering. Proxies ensure traffic routing while these tools manage dynamic content loading.

Are there free proxies worth using?

Free proxies exist and may work for basic or low-risk tasks, but they often suffer from issues like slow speeds, instability, or a high chance of being blocked. For reliable performance, it's recommended to use paid or vetted proxy services.

How do I test if a proxy is working?

You can test proxies by sending a request to a service like httpbin.org/ip or using proxy checker tools. If the IP in the response matches your proxy and no errors occur, the proxy is functioning correctly.

Summary

To optimize proxies effectively, you need to select the appropriate proxy type, fine-tune your technical implementation for speed, and practice cost-efficient usage. By understanding the differences between proxies and VPNs, and using tools like Scrapfly Proxy Saver, developers can significantly improve their workflow and performance.

How to Build an MCP Server in Python: A Complete Guide

Scrapfly — Fri, 18 Apr 2025 10:26:55 +0000

Building an MCP (Model Context Protocol) server allows your applications to interact directly with large language models by exposing custom tools, resources, and prompts. Whether you're building a plugin-like system for LLMs or enabling external AI integrations, the MCP server serves as a crucial bridge.

In this guide, we'll walk through how to build a simple MCP server in Python using a calculator example.

What is the Model Context Protocol (MCP)?

The model context protocol (MCP) is an open standard developed to let external tools, APIs, or plugins communicate with large language models (LLMs). An MCP server is a program you run locally or remotely that LLMs (like Claude or those in Cursor) can connect to and call defined functions, query resources, or use prompt templates.

In MCP, there are three key components:

Tools: Functions that can be called by the model.
Resources: Static or dynamic files or data the model can request.
Prompts: Templated messages that guide the model's output.

For more details about MCP checkout our article:

[

What Is MCP? Understanding the Model Context Protocol

What is MCP? Learn how the Model Context Protocol powers tools like Copilot Studio by giving AI models access to real-time, structured context.

](https://scrapfly.io/blog/what-is-mcp-understanding-the-model-context-protocol/)

Understanding the Basics of MCP Communication

Before diving into code, it’s essential to understand how models interact with your server. MCP servers operate over transports like stdio, http, or websocket. A host like Cursor will send JSON-based requests, and your server responds with tool results, prompt content, or resource data.

This design allows the model to dynamically call your tools or read your files just like a plugin system.

Why Use MCP Instead of Other APIs?

MCP is purpose-built for LLMs. Unlike REST APIs that require explicit engineering effort to query, MCP integrates directly with model interfaces. Your functions become accessible as if the model "knew" how to call them.

This makes it ideal for prototyping, teaching, internal tools, and research-driven interfaces.

Setting Up Your Python Environment

First, ensure you have Python 3.10 or later installed. Then, create a virtual environment:

python -m venv mcp-env
source mcp-env/bin/activate # On Windows: mcp-env\Scripts\activate

This creates an isolated environment for your project, helping avoid conflicts with other Python packages.

Install the MCP SDK:

pip install mcp "mcp[cli]"

The mcp package provides the server framework and CLI utilities. The [cli] extra installs additional command-line tools.

To verify installation:

mcp version

You should see the installed version number, confirming a successful setup.

Creating Your First MCP Server (Calculator Example)

Let’s start with a basic calculator tool that adds two numbers. Create a file named calculator.py:

from mcp.server.fastmcp import FastMCP # Import FastMCP, the quickstart server base

mcp = FastMCP("Calculator Server") # Initialize an MCP server instance with a descriptive name

@mcp.tool() # Register a function as a callable tool for the model
def add(a: int, b: int) -> int:
    """Add two numbers and return the result."""
    return a + b # Simple arithmetic logic

if __name__ == " __main__":
    mcp.run(transport="stdio") # Run the server, using standard input/output for communication

This script defines a minimal MCP server with one tool, add. The @mcp.tool() decorator tells the MCP framework that this function should be available to connected LLMs.

Using Context and Advanced Tools

MCP tools can go beyond simple math—they can access the internet, return rich media like images, and be written asynchronously. Here are a few examples to extend your calculator-themed server with more functionality.

Body Mass Index (BMI) Calculator Tool

This tool calculates BMI, which is a useful health-related metric:

@mcp.tool()
def calculate_bmi(weight_kg: float, height_m: float) -> float:
    """Calculate BMI given weight in kg and height in meters"""
    return round(weight_kg / (height_m ** 2), 2)

This fits well into a calculator suite for health and fitness features.

Live Exchange Rate Fetcher (Async)

Here’s how to add a tool that fetches live currency exchange rates:

import httpx

@mcp.tool()
async def get_exchange_rate(from_currency: str, to_currency: str) -> str:
    """Fetch current exchange rate from one currency to another."""
    async with httpx.AsyncClient() as client:
        response = await client.get(
            f"https://api.exchangerate-api.com/v4/latest/{from_currency}"
        )
        rates = response.json().get("rates", {})
        rate = rates.get(to_currency)
        if rate:
            return f"1 {from_currency} = {rate} {to_currency}"
        return "Unable to fetch exchange rate."

This can make your calculator server useful for travelers and finance apps.

Previewing Image-Based Calculations

You can also process images using the built-in Image class. For instance, previewing a graph or bill snapshot:

from mcp.server.fastmcp import Image
from PIL import Image as PILImage

@mcp.tool()
def generate_thumbnail(image_path: str) -> Image:
    """Generate a thumbnail for a provided image (e.g., bill or graph)."""
    img = PILImage.open(image_path)
    img.thumbnail((120, 120))
    return Image(data=img.tobytes(), format="png")

This could be used when the LLM is reviewing visual data like receipts or chart screenshots.

Using Context to Track Progress

Some tasks, like parsing multiple calculation files, may take time. MCP provides a Context object to manage progress and logging:

from mcp.server.fastmcp import Context

@mcp.tool()
async def batch_process(files: list[str], ctx: Context) -> str:
    """Simulate batch calculation from uploaded files with progress feedback."""
    for i, file in enumerate(files):
        ctx.info(f"Processing file {file}")
        await ctx.report_progress(i + 1, len(files))
        data, mime_type = await ctx.read_resource(f"file://{file}")
    return "Batch processing complete"

This gives both you and the model transparency into what's happening behind the scenes—ideal for long-running tools.

Adding More Tools

Expand your calculator with subtraction, multiplication, and division:

@mcp.tool()
def subtract(a: int, b: int) -> int:
    """Subtract the second number from the first."""
    return a - b

@mcp.tool()
def multiply(a: int, b: int) -> int:
    """Multiply two numbers."""
    return a * b

@mcp.tool()
def divide(a: float, b: float) -> float:
    """Divide the first number by the second. Raises error on division by zero."""
    if b == 0:
        raise ValueError("Division by zero")
    return a / b

Each tool is explicitly documented. If an LLM queries available tools, it will see these docstrings, helping it understand the correct usage.

Organizing Your MCP Project

For beginners, it helps to structure your code into folders:

mcp-calculator/
├── calculator.py
├── tools/
│ └── arithmetic.py
├── prompts/
│ └── templates.txt
└── docs/
    └── usage.txt

Then import and register each tool separately. This improves maintainability and scaling.

Exposing Resources

In MCP, resources can be either static files or dynamic responses. Here's how to define a dynamic resource using the @mcp.resource() decorator:

# Add a dynamic greeting resource
@mcp.resource("calculator://greet/{name}")
def calculator_greeting(name: str) -> str:
    """Get a personalized greeting"""
    return f"Hello, {name}! Ready to calculate something today?"

This makes the URL greeting://yourname available to the model. When the model queries this resource, the function will execute and return the corresponding greeting.

You can also use static resources like text files for documentation or data.

Create a docs folder and add a file named usage.txt:

This MCP server can perform basic arithmetic functions. Use tools like add, subtract, multiply, and divide.

And use:

@mcp.resource("usage://guide")
def get_usage() -> str:
    with open("docs/usage.txt") as f:
        return f.read()

This returns the content of the file when requested by the model.

Creating Prompts

In MCP, prompts can also be defined using functions decorated with @mcp.prompt(). This allows for dynamic, conditional, and reusable prompt generation.

Here’s an example that combines all four operations into a single prompt function:

@mcp.prompt()
def calculator_prompt(a: float, b: float, operation: str) -> str:
    """Prompt for a calculation and return the result."""
    if operation == "add":
        return f"The result of adding {a} and {b} is {add(a, b)}"
    elif operation == "subtract":
        return f"The result of subtracting {b} from {a} is {subtract(a, b)}"
    elif operation == "multiply":
        return f"The result of multiplying {a} and {b} is {multiply(a, b)}"
    elif operation == "divide":
        try:
            return f"The result of dividing {a} by {b} is {divide(a, b)}"
        except ValueError as e:
            return str(e)
    else:
        return "Invalid operation. Please choose add, subtract, multiply, or divide."

This function returns a human-readable summary for any supported operation. The model can invoke it with arguments, and receive consistent, contextual output.

Running the Server

To run the server:

mcp run path/to/calculator.py

You can connect it with a tool like Cursor or Claude Desktop. In Cursor:

Go to Cursor Settings > MCP.
Click Add new global MCP server.

Cursor stores MCP server definitions in a JSON format. You can add your server manually to the configuration like this:

{
    "mcpServers": {
      "local-mcp": {
        "command": "python",
        "args": [
          "path/to/your/local/mcp/calculator.py"
        ]
      }
    }
 }

Once connected, Cursor will automatically detect your tools and show them under your server listing.

You should see something like this:

Then test it with natural language prompts like:

Add 7 and 5
What's 12 divided by 4?

Where and How to Test Your MCP Server

The easiest way to test your MCP server is by running it with the MCP CLI tool, which includes a local dashboard for interacting with your prompts and tools.

Run the Dev Dashboard

Use the following command in your terminal:

mcp dev ./calculator.py

This command launches a local dev dashboard. Once the server is running, it will open a browser window where you can:

View all registered tools and prompts
Test each prompt by filling in parameters
See the results and any errors in real time

Which will result of this screen:

This makes mcp dev the most straightforward way to test and iterate on your MCP server.

Real-World Use Cases for MCP Servers

MCP servers are useful beyond experiments—they help connect your code to language models in practical, meaningful ways.

Internal tools: Let models assist with tasks like running calculations, generating reports, or querying databases.
Customer support bots: Provide models access to live data, documentation, or helpdesk tools.
Education: Build interactive learning aids for math, science, or coding using model-driven prompts and tools.

These examples show how MCP can add model interaction to everyday workflows, making your applications smarter and more interactive.

Power-up with Scrapfly

ScrapFly provides web scraping, screenshot, extraction, and proxy saver APIs for data collection at scale. Each product is equipped with an automatic bypass for any anti-bot system, and we achieve this by:

Maintaining a fleet of real, reinforced web browsers with real fingerprint profiles.
Millions of self-healing proxies of the highest possible trust score.
Constantly evolving and adapting to new anti-bot systems.
Introducing Proxy Saver – a performance-boosting middleware that reduces bandwidth, improves stability, and adds fingerprint support to your own proxies.
We've been doing this publicly since 2020 with the best bypass on the market!

Try for FREE!

More on Scrapfly

FAQ

Have questions about building or using an MCP server? Here are quick answers to some of the most common ones:

How do I create a dynamic resource?

Use the @mcp.resource() decorator with a dynamic path. For example, calculator://greet/{name} allows models to access personalized data. The function will be called with the provided parameter.

What types can MCP tools return?

MCP tools can return basic data types like strings, numbers, lists, and even binary media using the Image class. The return type should match the expected use by the model or interface calling it.

Can I use async functions in my tools?

Yes, MCP fully supports async def functions. These are useful for non-blocking operations like fetching data from APIs or processing large files without stalling your server.

Summary

This guide covered how to build an MCP server in Python using a calculator app as a clear, hands-on example. We explored the core components of the Model Context Protocol: tools, resources, and prompts, and how each one allows large language models to interact dynamically with your code. You learned how to implement basic and advanced tools including asynchronous functions, handle user input with custom prompts, and expose data through both static and dynamic resources.

We also showed how to test your server locally using mcp dev, connect it to an LLM interface like Cursor, and structure your project for real-world development. Whether you're building internal utilities, educational tools, or LLM-driven interfaces, MCP provides a lightweight yet powerful standard to bridge AI and software.

What Is MCP? Understanding the Model Context Protocol

Scrapfly — Mon, 14 Apr 2025 16:26:38 +0000

The world of AI development is evolving fast, and with that evolution comes the need for more structured, scalable communication between models and their tools or environments. Enter the Model Context Protocol (MCP) a modern, standardized way for large language models (LLMs) like GPT-4 to interact with external data, tools, and APIs within a secure and modular ecosystem. If you’ve heard about Copilot Studio or MCP servers, you’re already closer to understanding how this framework works in action.

If you're building your own AI assistant or developing context-aware applications, MCP ensures your models access the right information at the right time with built-in support for security, modularity, and scalability.

What is MCP?

The Model Context Protocol (MCP) is a communication framework that allows large language models (LLMs) to dynamically interact with real-time, structured data. Instead of relying solely on static prompts or pre-trained information, MCP equips models with access to live tools, APIs, and services that are contextually relevant to a user’s query.

MCP separates the AI model from backend logic and data, making systems more modular and easier to scale. This allows developers to reuse tools across different models and platforms while simplifying the integration of real-time data into AI-driven workflows.

How Does MCP Work?

MCP operates by managing the lifecycle of tool registration, context gathering, and inference orchestration. Here's a breakdown of how it functions:

Tool Registration : APIs and services register themselves with an MCP server, declaring what they can do and what inputs they need.
Context Compilation : When a query is submitted, the MCP system gathers all the necessary context from these registered tools—this might include user profiles, historical actions, or data from third-party systems.
Model Invocation : The compiled context is passed into the model as part of the inference request, allowing the model to generate a response that’s highly contextual and personalized.

This architecture ensures that models remain lightweight and focused on reasoning, while the MCP infrastructure handles the complexities of data access, validation, and selection.

How Were AI Systems Handling Context Before MCP?

Before the advent of MCP, AI developers had to rely on homegrown solutions for tool orchestration and context injection. These were often brittle and difficult to scale, typically involving hardcoded prompts or ad hoc plugin-style architectures.

For instance:

Tools were added to LLMs using manual function calling, often lacking dynamic discovery or selection.
Context (like user profile or recent history) was flattened into prompts, leading to bloated, inefficient token usage.
Security and permissions had to be custom coded, exposing vulnerabilities.

This approach created significant friction, especially for enterprises building conversational interfaces or automated workflows across multiple services.

How To Get Started with MCP

The easiest way to begin using the Model Context Protocol (MCP) is to explore the official MCP documentation and open-source repositories. Organizations like Anthropic have already laid the groundwork, providing detailed specifications and SDKs for languages like Python and Java. Whether you're building an agent from scratch or plugging into an existing system like Copilot Studio , the setup process is fairly straightforward.

Step 1: Set Up an MCP Server

Your first move is to deploy or install an MCP server connected to the tools or data sources you want your model to access. Anthropic and other contributors offer a library of pre-built MCP servers for popular services like:

Google Drive, Gmail, and Calendar
Slack (chat and file APIs)
GitHub and Git repos
SQL databases like Postgres
Web browsers and automation tools like Puppeteer

You can usually get started by cloning the server’s repo, installing dependencies, and configuring credentials (like API keys or tokens). Many setups are as easy as running a single CLI command.

Step 2: Connect the MCP Client

Once your MCP server is live, it’s time to wire it into your LLM or agent framework. If you’re using a hosted AI platform like Claude Desktop or Copilot Studio, this may involve entering the server address into a settings UI. For developers building their own tools, the MCP SDK allows you to instantiate a client, register your tool server’s endpoint, and start interacting with it programmatically.

Step 3: Enable and Extend the Capabilities

After setup, your MCP-aware model client will automatically discover the registered tools and enhance its abilities—adding new function calls, prompt templates, or dynamic context inputs based on what the server provides. You don’t need to hand-craft every interaction; the model knows how to interpret what’s available and use it as needed.

Step 4: Invoke and Iterate

Now you're ready to test it out. Ask your AI model to perform a task that requires tool usage (e.g., “Summarize the latest emails from my team” or “Find the last edited Google Doc”). Watch the MCP logs to verify that your requests are reaching the server and that responses are flowing back. You’ll see real-time interaction between the LLM and your toolset—fully mediated by MCP.

Now that you know how to get started, let’s look at why this protocol matters so much for modern AI development.

How MCP Can Be Used in Web Scraping

Web scraping is a valuable method for gathering real-time data, but integrating that data into AI systems often requires complex transformations. With the Model Context Protocol (MCP), scraping tools can act as context providers, delivering structured data directly to language models at inference time.

This approach simplifies how scraped content—like product listings, headlines, or reviews—is used by AI. Instead of embedding raw HTML into prompts, the MCP server formats scraped data into a standardized context format. When the model is prompted, it receives clean, relevant context without extra processing.

Power-up using Scrapfly

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

Anti-bot protection bypass - scrape web pages without blocking!
Rotating residential proxies - prevent IP address and geographic blocks.
JavaScript rendering - scrape dynamic web pages through cloud browsers.
Full browser automation - control browsers to scroll, input and click on objects.
Format conversion - scrape as HTML, JSON, Text, or Markdown.
Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.

Try for FREE!

More on Scrapfly

Popular MCP Servers You Should Know

MCP supports a growing list of tool integrations, making it easy to connect AI models with real-world services. Here are some of the most popular MCP servers:

Google Drive, Slack, GitHub, and Postgres – Access files, chats, code, and databases directly from your AI agent.
Fetch and Puppeteer – For scraping and reading web content in real-time.
Stripe, Spotify, Todoist – Manage payments, playlists, and tasks through natural language.
Docker and Kubernetes – Let your model interact with DevOps environments.

These tools show how flexible MCP is—whether you're building AI for research, automation, or everyday productivity. You can explore more at modelcontextprotocol.io/examples.

FAQ

Below are quick answers to common questions about the Model Context Protocol (MCP) and its uses.

Can I build my own MCP server?

Yes! MCP is designed to be modular and developer-friendly. You can build a custom server using the open-source SDKs, often by wrapping an existing tool or API in a standardized format.

What languages does MCP support?

The official MCP SDKs are currently available in Python and Java, with more language support expected as the ecosystem grows.

Is MCP secure for enterprise use?

Yes. MCP supports role-aware access control and data filtering, ensuring users only access the context they’re authorized to see—making it suitable for secure, enterprise-scale applications.

Summary

The Model Context Protocol (MCP) represents a major shift in how AI systems access and use external data. It enables tools, APIs, and documents to act as live context providers for large language models, replacing hardcoded prompts and brittle plugins with a standardized, scalable system. Whether you're integrating file systems, web scraping tools, or enterprise databases, MCP allows your AI to operate with richer, real-time awareness.

We explored how MCP works, how to get started, and how it's being used across fields like DevOps, productivity, and even scraping. With growing community support and a robust library of MCP servers, it's never been easier to build smarter, context-aware AI applications.

Build a Proxy API: Rotate Proxies and Save Bandwidth

Scrapfly — Mon, 31 Mar 2025 23:13:48 +0000

APIs can consume significant bandwidth, especially when multiple clients or services are fetching the same resources repeatedly. One way to reduce this overhead is by using a proxy API – an intermediary that sits between your application and external APIs or websites. A proxy API can cache responses and filter out unnecessary data, saving bandwidth and speeding up requests for all clients that use it.

In this tutorial, we'll walk through building a simple API proxy in Python using mitmproxy, a powerful open-source MITM (man-in-the-middle) proxy tool. By rotating proxies on each request and caching responses, our proxy will help avoid IP blocks and reduce duplicate data transfers. We’ll also configure it to drop unwanted resources (like images or styles) to further conserve bandwidth. Let’s dive into the benefits of rotating proxies and how to set up this bandwidth-saving proxy tool.

What Is a Proxy API and Why Is It Useful?

A proxy API is a server that forwards API requests from clients to external services. Instead of directly contacting the target API or website, your application sends requests through the proxy, which may modify requests, manage authentication, caching, or IP rotation, and then returns responses to your application.

Proxy APIs offer several key benefits:

Privacy: Conceal your application's IP address.
Centralized control: Simplify logging, rate limiting, and caching.
Efficiency: Reduce bandwidth usage and improve response reliability through caching.
Flexibility: Easily manage API rate limits and bypass IP restrictions.

Understanding these advantages helps highlight the essential features needed for building an effective proxy API, which we'll explore next.

Key Features of a Good Proxy API

Not all proxies are created equal. A good proxy API for bandwidth saving and web scraping tasks should include a few important features out of the box. Below are some key features and why they matter:

Feature	Purpose & Benefit
Proxy Rotation	Use a pool of IP proxies and rotate them on each request. This prevents any single proxy from being overused and getting blocked, ensuring higher availability and fewer captchas or bans. It also distributes traffic load across multiple IPs.
Response Caching	Store responses (e.g., API results or webpage content) and serve them for identical requests. Caching avoids redundant downloads of the same data, significantly saving bandwidth and improving response times for repeated queries.
Content Filtering	Drop or ignore unnecessary resource requests like images, CSS, or ads. By filtering out these non-critical assets, the proxy saves bandwidth and focuses on the data that your application actually needs (e.g. HTML or API JSON).
HTTPS Support	Intercept and handle HTTPS traffic by using trusted certificates. Full HTTPS support ensures even secure API calls can be proxied, while still allowing the proxy to inspect and cache their content.

These features combined make a proxy API both efficient and resilient. Proxy rotation keeps your scraping or API consumption stealthy and unblockable. Caching and filtering make it lightweight on bandwidth usage. In the next section, we'll start building our own proxy API step by step with Python and mitmproxy, incorporating each of these features.

Now that we've identified what we need (rotation, caching, filtering, etc.), it's time to get our hands dirty and build the proxy API with these capabilities.

Step 1: Set Up mitmproxy in Python

To build our proxy API, we'll use mitmproxy, a Python-based intercepting proxy. Mitmproxy can be scripted with Python addons to modify requests and responses on the fly. First, let's install mitmproxy and create a basic addon script to ensure everything is wired up correctly:

# Install mitmproxy via pip if you haven't already
$ pip install mitmproxy

# (Optional) Verify the installation by checking the version
$ mitmproxy --version

Next, we'll set up a simple mitmproxy addon in Python. Create a file (for example, proxy_tool.py) and add a basic class that will handle proxy events. For now, we'll just log each request to confirm our proxy is intercepting traffic:

from mitmproxy import http

class BandwidthSaver:
    def request(self, flow: http.HTTPFlow):
        # Log each incoming request URL (for debugging purposes)
        print("Request URL:", flow.request.pretty_url)

# Register the addon with mitmproxy
addons = [BandwidthSaver()]

In this snippet, we import mitmproxy's http module and define a class BandwidthSaver with a request method. Mitmproxy will call request() for every HTTP request passing through the proxy. Here we simply print the URL of the request (flow.request.pretty_url) to the console. The last line registers our class as a mitmproxy addon.

Running the proxy: To test this setup, run mitmproxy (or its console-less variant mitmdump) with the addon script:

$ mitmdump -s proxy_tool.py

Example Output


[20:01:58.311] Loading script proxy_tool.py
[20:01:58.311] HTTP(S) proxy listening at *:8080.

By default, mitmproxy listens on localhost:8080 as an HTTP proxy. Configure your application or browser to use localhost:8080 as the HTTP/HTTPS proxy and perform a request (for example, open a webpage or make an API call). You should see the request URLs being printed by our script. This confirms the proxy is intercepting requests successfully.

With mitmproxy installed and our basic addon logging requests, we have the foundation ready. Next, we'll ensure HTTPS traffic can be handled by our proxy.

Step 2: Enable HTTPS by Installing mitmproxy’s Certificate

Modern APIs and websites mostly use HTTPS. For our proxy API to inspect and cache those requests, we need to enable HTTPS interception. Mitmproxy does this by acting as a "man-in-the-middle" with its own Certificate Authority (CA). We must install mitmproxy's CA certificate on the client system so that it trusts the proxy for HTTPS connections.

First, start mitmproxy (or mitmdump) to generate the necessary certificates if not already done:

$ mitmproxy # Start the proxy; it will generate a CA cert on first run

While mitmproxy is running, open a web browser (or use your device) and visit http://mitm.it. This special page provides instructions to download and install the mitmproxy CA certificate for various platforms (Windows, macOS, Linux, Android, iOS). Install the certificate according to your environment. This typically involves trusting a new CA in your system or browser settings.

Once the certificate is installed, your system will treat the mitmproxy as a trusted authority. This means mitmproxy can decrypt HTTPS traffic between clients and servers, allowing our addon to read and modify those requests and responses. HTTPS support is now enabled for our proxy API.

Note: Only install the mitmproxy certificate on devices or environments you control for development or scraping. It gives the proxy power to intercept secure communications, which should be used responsibly.

With the proxy set up and HTTPS enabled, we can proceed to implement the core features of our bandwidth-saving proxy API. Next, we'll add functionality to rotate upstream proxies for each request.

Now that secure traffic can flow through our proxy, we’re ready to enhance it with proxy rotation for better IP diversity.

Step 3: Rotate Proxies Randomly on Each Request

One major benefit of a proxy API is the ability to hide the client's IP address. We can take this further by rotating through a list of upstream proxy servers on every request. By doing so, each request appears to come from a different IP—helping avoid rate limits or bans on the target service. Mitmproxy supports forwarding requests to an upstream proxy, which we can control in our script.

Let's update our addon to choose a random proxy for each request. Suppose we have a list of proxy server addresses (IP:port or host:port). We’ll configure mitmproxy to use one by default and then override it per request in our script:

import random
from mitmproxy import http

class BandwidthSaver:
    # List of upstream proxy servers to rotate through
    upstream_proxies = [
        "203.0.113.10:3128",
        "198.51.100.23:3128",
        "203.0.113.47:3128",
        # ... add as many proxies (IP:port or host:port) as you have
    ]

    def request(self, flow: http.HTTPFlow):
        # Pick a random upstream proxy for this request
        proxy_address = random.choice(self.upstream_proxies)
        host, port = proxy_address.split(":")
        # In upstream mode, tell mitmproxy to use the chosen proxy
        if flow.live:
            flow.live.change_upstream_proxy_server((host, int(port)))

        # (Optional) Log which proxy was chosen for debugging
        print(f"→ Rotating via proxy: {proxy_address} for {flow.request.host}")

In this code, we added an upstream_proxies list to our class containing proxy server addresses (you would replace these example IPs with actual proxies you have access to). In the request method, we use Python's random.choice to select a proxy from the list for each incoming request. The flow.live.change_upstream_proxy_server((host, port)) call tells mitmproxy to forward the current request through that upstream proxy.

A couple of important notes for this to work:

Start mitmproxy in upstream mode : When launching mitmdump or mitmproxy, use the --mode upstream: option with any proxy (or a default one) specified. For example:
HTTPS requests : The rotation logic above works for both HTTP and HTTPS requests, now that we've installed the certificate. Mitmproxy will decrypt the HTTPS request, then re-encrypt it as it passes it to the chosen upstream proxy.

With proxy rotation in place, every request through our proxy API will emerge from a random IP address. This helps distribute traffic and avoid IP-based blocking. For example, if you're scraping a website that limits one request per second per IP, using five rotating proxies could effectively allow ~5 requests per second without triggering blocks.

At this stage, our proxy API is forwarding requests through random proxies, enhancing anonymity and reliability. Next, we'll implement response caching to reuse results and save more bandwidth.

Step 4: Cache Responses to Save Bandwidth

Caching is a crucial feature for saving bandwidth. If multiple clients request the same resource through our proxy API, there's no need to fetch it from the origin server every time – we can return a stored copy. Let's add a simple cache to our proxy using a Python dictionary to store responses.

We'll cache responses by URL. When a request comes in, the addon will first check if we have a cached response for that URL. If yes, it will immediately return the cached data without forwarding the request to the internet. If not, it will proceed normally (possibly using a rotated proxy upstream), and then save the response for next time.

Here's how we can integrate caching into our BandwidthSaver addon:

from mitmproxy import http
import random

class BandwidthSaver:
    upstream_proxies = [
        "203.0.113.10:3128",
        "198.51.100.23:3128",
        "203.0.113.47:3128",
        # ... (same proxy list as before)
    ]
    # Initialize an in-memory cache (dictionary)
    cache = {}

    def request(self, flow: http.HTTPFlow):
        # 1. If this URL was seen before and cached, serve it from cache
        if flow.request.pretty_url in self.cache:
            cached_resp = self.cache[flow.request.pretty_url]
            # Create a response directly from cache without contacting upstream
            flow.response = http.HTTPResponse.make(
                cached_resp["status_code"], # e.g. 200
                cached_resp["content"], # cached raw content (bytes)
                cached_resp["headers"] # cached headers
            )
            return # respond from cache, no need to forward request

        # 2. Not cached: pick a random proxy as in Step 3
        proxy_address = random.choice(self.upstream_proxies)
        host, port = proxy_address.split(":")
        if flow.live:
            flow.live.change_upstream_proxy_server((host, int(port)))
        # (The request will now be forwarded to the origin through the chosen proxy)

    def response(self, flow: http.HTTPFlow):
        # After receiving a response from origin, cache it for future requests
        url = flow.request.pretty_url
        if url not in self.cache:
            self.cache[url] = {
                "status_code": flow.response.status_code,
                "content": flow.response.content, # raw bytes of the response body
                "headers": dict(flow.response.headers)
            }
            # (Now the next request for the same URL will hit the cache)

Let's break down the caching logic:

We added a class attribute cache as a dictionary to store responses by URL. In a real scenario, you might want a more robust cache with size limits or expiration, but this simple dict will do for demonstration.
Cache check in request: Before forwarding a request, we check if flow.request.pretty_url (the full URL as a string) exists in our cache. If it does, we retrieve the cached data and use http.HTTPResponse.make(...) to create a synthetic response. We supply the cached status code, content, and headers. Setting flow.response in the request phase like this short-circuits the request – the client will get the response immediately from our proxy, and mitmproxy will not forward the request to the upstream server.
Saving in response: If the request was not cached, it went out to the origin (through a proxy). In the response handler, we take the newly received response and store it in the cache dict. We use the same URL as key. We store the status code, the content (which is a bytes object for the body), and the headers (converted to a regular dict for simplicity). Next time the same URL is requested, the request method will find it in cache and return this data.

With caching enabled, repeated requests for the same resource will be served from the proxy API's memory instead of the network. This saves bandwidth because the data travels only once from the external server; subsequent requests get the data from the local cache. It also reduces latency for those requests since returning data from memory is faster than making a network round-trip.

For example, if client A requests https://api.example.com/data?id=123 and then client B (or even A again) requests the same URL, the second request will get an instant cached response. No outgoing proxy usage or internet bandwidth is needed for the second call.

Now our proxy API rotates proxies and caches responses, making it efficient and fast for repeated requests. Next, we'll add a final touch: filtering out unnecessary requests to conserve even more bandwidth.

Step 5: Drop Unnecessary Requests (Stylesheets & Images)

When proxying web content (as opposed to pure API JSON), browsers often try to fetch images, stylesheets, scripts, and other assets. In a scraping context, these usually aren't needed – they just waste bandwidth. Our proxy API can proactively drop such requests. Even for API use cases, there might be certain endpoints or file types you know are extraneous. By filtering them out, the proxy saves the client from downloading useless data.

We'll update the request method in our addon to identify requests for common static asset types (like images and CSS) and short-circuit them with an empty response. This should happen before the caching check or proxy forwarding:

    def request(self, flow: http.HTTPFlow):
        # 0. Filter out unwanted asset types to save bandwidth
        if flow.request.pretty_url.endswith((".png", ".jpg", ".jpeg", ".gif", ".css", ".js")):
            # Return an empty 204 No Content response for these requests
            flow.response = http.HTTPResponse.make(204, b"", {})
            return

        # 1. Serve from cache if available (as implemented in Step 4)
        if flow.request.pretty_url in self.cache:
            cached_resp = self.cache[flow.request.pretty_url]
            flow.response = http.HTTPResponse.make(
                cached_resp["status_code"],
                cached_resp["content"],
                cached_resp["headers"]
            )
            return

        # 2. Otherwise, rotate proxy and forward (from Step 3)
        proxy_address = random.choice(self.upstream_proxies)
        host, port = proxy_address.split(":")
        if flow.live:
            flow.live.change_upstream_proxy_server((host, int(port)))

The new addition here is the first if block: it checks the URL's suffix against a tuple of file extensions for images (.png, .jpg, .jpeg, .gif), stylesheets (.css), and scripts (.js). You can adjust this list based on what you consider "unnecessary" for your scenario. If a match is found, we immediately set flow.response to an HTTP 204 (No Content) with an empty body. A 204 status tells the client that the request succeeded but there's no content to load. We then return without forwarding the request further. The result is that, for example, if a webpage tries to load a large .png image, our proxy will respond with nothing (saving the bandwidth that would have been used to download the image).

After adding this filter, the rest of the logic remains the same: we check the cache, and if not cached, we forward the request through a rotated proxy. The response handler also remains as implemented in Step 4 (caching any new responses). We typically don't need to cache the dropped items since we never fetch them in the first place.

With this final step, our proxy API tool is quite complete. It rotates among multiple upstream proxies, caches responses to reuse data, and blocks superfluous asset requests. All these measures contribute to substantial bandwidth savings and can speed up your data fetching pipelines.

To run the full proxy with all features combined, use the script and start mitmproxy as before. For instance:

$ mitmdump --mode upstream:http://203.0.113.10:3128 -s proxy_tool.py

Remember to update the upstream_proxies list in the script with proxies you have. Also ensure your clients are configured to use the mitmproxy server (e.g., HTTP_PROXY environment variable or browser proxy settings pointing to localhost:8080). Once running, your proxy API will handle incoming requests according to the logic we implemented.

We have now built a functional proxy API that can be used as a drop-in bandwidth-saving layer for web scraping or API consumption.

Proxies at ScrapFly

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

Anti-bot protection bypass - scrape web pages without blocking!
Rotating residential proxies - prevent IP address and geographic blocks.
JavaScript rendering - scrape dynamic web pages through cloud browsers.
Full browser automation - control browsers to scroll, input and click on objects.
Format conversion - scrape as HTML, JSON, Text, or Markdown.
Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.

Try for FREE!

More on Scrapfly

FAQ

What is a Proxy API?

A proxy API is a server that forwards requests from clients to external services, optionally modifying requests and responses (e.g., adding caching or authentication). It helps hide client details, enforce policies, and aggregate data.

Why use rotating proxies in a proxy API?

Rotating proxies distribute requests across different IPs, preventing rate-limiting and bans when scraping or accessing restricted APIs. This ensures reliability and higher request volume.

How does caching in a proxy API save bandwidth?

Caching stores responses locally on the proxy. Subsequent identical requests use cached responses rather than fetching again from external services, significantly reducing bandwidth usage.

Summary

In this article, we built a bandwidth-saving proxy API from scratch using Python and mitmproxy. We started by setting up mitmproxy and enabling HTTPS interception so that we could handle secure traffic. Then we added proxy rotation, allowing each request to exit through a different IP address to avoid rate limits and blocking. Next, we implemented a simple in-memory cache to store responses and serve repeated requests without re-downloading data. We also introduced a filtering mechanism to drop unnecessary resources like images and styles, conserving bandwidth further.

The Best Datacenter Proxies in 2025: A Complete Guide

Scrapfly — Tue, 25 Mar 2025 12:53:35 +0000

Datacenter proxies are a top choice for web scraping, automation, and online anonymity thanks to their speed, low cost, and scalability. Whether you're scaling a scraping operation, grabbing limited Supreme drops, or testing with free data center proxies, there's a setup for every use case.

In this article, we'll dive into what datacenter proxies are, how they compare to residential and mobile proxies, and which services currently offer the best bang for your buck. and highlight top providers like IPRoyal and Razorproxy.

What Are Datacenter Proxies?

Datacenter proxies are IP addresses provided by servers hosted in data centers. Unlike residential proxies, which are tied to real devices and ISPs, these proxies are not associated with a physical location or end user. This makes them incredibly fast and reliable, but also more detectable by websites with strong anti-bot systems.

What Are the Types of Datacenter Proxies?

Shared datacenter proxies are used by multiple clients simultaneously. They're more affordable and ideal for low-risk tasks like general scraping or data gathering. However, since the IPs are shared, there's a higher chance of detection and IP bans.

Dedicated datacenter proxies are assigned to a single user, offering greater stability, speed, and security. They're the better choice for sensitive tasks like account creation, sneaker botting, or large-scale scraping that require consistent performance.

Now that you understand what they are, let’s explore why and when you should choose datacenter proxies over other types.

Why Choose Datacenter Proxies?

Datacenter proxies stand out for their speed, affordability, and scalability. They can be deployed in bulk, making them ideal for handling high-volume, data-intensive tasks. Whether you're running a scraping operation or automating e-commerce tasks, these proxies offer the performance you need at a fraction of the cost of residential or mobile proxies.

Here are some key use cases where datacenter proxies shine:

Web Scraping at Scale

Providers like IPRoyal and Razorproxy offer flexible, high-performance plans that support thousands of requests per second. This makes datacenter proxies a top choice for collecting data at scale without worrying about bandwidth limitations or speed bottlenecks.

Bypassing Geo-Restrictions

Even though they aren't linked to real user devices, EU datacenter proxies can still help access location-specific content. They're especially useful for testing localized versions of websites or gathering data from different regions for compliance and market research.

E-commerce Automation

Speed and reliability are crucial when it comes to automated purchasing or monitoring stock. That’s why datacenter proxies are considered the best proxies for Supreme drops and similar high-stakes e-commerce scenarios they keep your bots running fast and smooth under pressure.

Whether you’re looking to buy dedicated proxy access, try out free data center proxies, or go for unlimited bandwidth proxies, there’s a setup that fits your use case and budget.

Now that we've covered the benefits, let’s break down the differences between shared and dedicated datacenter proxies.

Shared vs Dedicated Datacenter Proxies

When choosing the best datacenter proxies, it’s important to understand the difference between shared and dedicated options. Each serves different needs, and the right choice depends on your use case, budget, and performance requirements.

Shared Datacenter Proxies

Shared datacenter proxies are IP addresses that are used simultaneously by multiple users. Because of this, they’re much more affordable, making them a great entry-level option for individuals or small teams working on non-sensitive tasks.

These proxies are well-suited for activities like keyword tracking, market research, or scraping publicly available data where speed and full anonymity aren’t critical. However, since several users may be using the same IP at once, websites can flag or block them more easily, especially if abuse is detected.

They’re a solid choice if you're looking to experiment with free data center proxies or run lightweight projects where occasional IP bans are acceptable.

Dedicated Datacenter Proxies

Dedicated datacenter proxies are reserved for a single user, giving you full control and exclusive access to the IP. This greatly reduces the risk of bans, ensures more consistent performance, and allows you to manage session-based tasks more effectively.

Because of their reliability and clean IP history, dedicated proxies are the preferred option for high-stakes applications like sneaker botting (such as copping drops from Supreme), login automation, social media account management, or solving captchas. They're also the go-to for businesses that need to run high-frequency scraping without interruptions.

While they come at a higher cost, the stability, speed, and security they offer make them well worth the investment especially when paired with unlimited bandwidth proxies from providers like IPRoyal or Razorproxy.

Here’s a side-by-side comparison to help you visualize the key differences:

Feature	Shared Proxies	Dedicated Proxies
Performance	Moderate	High
Cost	Low	Higher
Risk of IP Ban	Higher (shared usage)	Lower (exclusive access)
Ideal For	General scraping, SEO, research	Sneaker bots, login automation, account creation

Now that you have a clearer understanding of both options, let's explore how to choose a good proxy provider that matches your goals.

What Makes a Good Proxy Provider?

Selecting the right datacenter proxy provider is critical, especially if you're running high-volume scraping, automation, or e-commerce operations. Here are the key factors to consider when evaluating good proxy websites:

Speed and Bandwidth

Performance is a top priority. If you're handling large-scale scraping or time-sensitive tasks, speed is essential. Look for providers that offer unlimited bandwidth to ensure consistent throughput without the risk of data limits or throttling.

Reliability and Uptime

A dependable provider should offer at least 99.9% uptime. This ensures your operations run smoothly without frequent interruptions. Providers like Razorproxy are known for publishing infrastructure performance metrics, which is a good indicator of reliability and transparency.

IP Pool Diversity

Access to a broad and diverse IP pool allows you to target multiple regions effectively. Look for providers offering EU datacenter proxies and other global locations, which can be valuable for tasks like localized testing, geo-restricted content access, or international market research.

Support and Developer Features

If you're managing large or complex projects, advanced features can make a big difference. Look for providers that offer:

Proxy rotation support
API access for automation and monitoring
Detailed usage statistics
Flexible authentication options (IP whitelisting or user/password)

These capabilities provide better control, scalability, and efficiency, especially in technical or enterprise-level use cases.

Before choosing a provider, consider starting with a trial or a low-commitment plan to evaluate real-world performance. This helps ensure the service meets your specific needs before scaling up.

Best Datacenter Proxy Providers in 2025

Choosing the right provider depends on your specific needs whether you're testing the waters with free proxies, running high-frequency scrapers, or managing automation at scale.

Below are some of the top-rated datacenter proxy providers for 2025, along with a comparison table to help guide your decision.

IPRoyal

IPRoyal is known for its affordability, user-friendly dashboard, and flexible plans. It's an excellent choice for individuals or small teams looking to get started with free data center proxies or scale gradually. Their shared and dedicated datacenter proxies are competitively priced, making them accessible for a wide range of users.

Best for: Entry-level users, budget-conscious scraping, and small-scale automation.

Razorproxy

Razorproxy delivers premium performance with high-speed, low-latency proxies designed for serious use cases. It supports buy dedicated proxy access with unlimited bandwidth, making it a top pick for users running high-frequency scraping, sneaker bots, or e-commerce monitoring tools.

Best for: Power users, large-scale scraping, and automation with zero tolerance for downtime.

Comparison Table: Top Datacenter Proxy Providers in 2025

Provider	Best For	Bandwidth	Speed	Price Level	Free Trial/Test	Notes
IPRoyal	Beginners, small projects	Unlimited (on paid plans)	Fast	Affordable	Yes	Offers free data center proxies and flexible upgrades
Razorproxy	High-performance, automation	Unlimited	Very fast	Moderate-High	Yes	Great for real-time scraping, bots, and low latency use

For more details on scraping strategies, proxy rotation, and privacy tools, check out this related article:

[

How to Use TOR for Web Scraping

Learn about web scraping using Tor as a proxy and rotating proxy server by randomly changing the IP address with HTTP or SOCKS.

](https://scrapfly.io/blog/how-to-use-web-scaping-for-rag-applications/)

Now that you know the top providers and how they compare, let’s look at how to integrate datacenter proxies into your workflow with a quick code example.

How to Use Datacenter Proxies in Code

Using a datacenter proxy in your code is straightforward. Below is a simple Python example using the requests library to send traffic through a proxy server.

import requests

# Set up your proxy credentials and address
proxy = {
    'http': 'http://username:password@123.456.789.101:8080',
    'https': 'http://username:password@123.456.789.101:8080'
}

# Make a GET request using the proxy
response = requests.get('https://httpbin.org/ip', proxies=proxy)

print(response.text)

In the above code, we configured both HTTP and HTTPS traffic to go through a datacenter proxy. This is a basic way to test if your proxy is working. For larger projects, consider rotating proxies or using session management.

Proxies With ScrapFly

ScrapFly is a web scraping API with residential proxies from over 50+ countries, which allows for avoiding IP throttling and blocking while also allowing for scraping from almost any geographical location.

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

Anti-bot protection bypass - scrape web pages without blocking!
Rotating residential proxies - prevent IP address and geographic blocks.
JavaScript rendering - scrape dynamic web pages through cloud browsers.
Full browser automation - control browsers to scroll, input and click on objects.
Format conversion - scrape as HTML, JSON, Text, or Markdown.
Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.

Here is how we can use ScrapFly proxies to avoid web scraping blocking. All we have to do is select a proxy pool and enable the asp argument:

from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

scrapfly = ScrapflyClient(key="Your ScrapFly API key")

api_response: ScrapeApiResponse = scrapfly.scrape(
    ScrapeConfig(
        # target website URL
        url="https://www.leboncoin.fr",
        # select a proxy pool (residential or datacenter)
        proxy_pool="public_residential_pool",
        # Set the proxy location to a specific country
        country="FR",
        # JavaScript rendering, similar to headless browsers
        render_js=True,
        # Bypass anti scraping protection
        asp=True
    )
)
# Print the website's status code
print(api_response.upstream_status_code)
"200"

Try for FREE!

More on Scrapfly

FAQS

Below are quick answers to common questions about datacenter proxies.

Are datacenter proxies safe to use?

Yes, when purchased from reliable providers like IPRoyal or Razorproxy, datacenter proxies are safe and legal for most use cases like scraping and automation.

Can I use free data center proxies?

You can, but free proxies often come with slower speeds, limited uptime, and higher ban rates. Use them for testing or low-risk activities only.

What are good proxy websites to buy from?

Some good proxy websites include IPRoyal, Razorproxy, Smartproxy, and Bright Data. Choose based on your budget and technical needs.

Summary

Datacenter proxies remain a powerful tool in 2025 for scraping, automation, and online tasks that demand speed and scalability. They’re cost-effective, fast, and easy to deploy especially for users who don’t need the complexity or price tag of residential proxies.

Whether you're using shared proxies for basic scraping or dedicated proxies for high-stakes automation, providers like IPRoyal and Razorproxy offer flexible plans to suit every need.

A Comprehensive Guide to TikTok API

Scrapfly — Thu, 20 Mar 2025 22:54:18 +0000

TikTok has rapidly grown into one of the most popular social media platforms, attracting users and businesses alike. As the platform’s reach expands, so does the demand for data and insights.

There are several official TikTok’s APIs which provide developers with tools to integrate TikTok functionalities into their applications and access TikTok data. This guide explores the available TikTok APIs, their use cases, and alternative methods for obtaining TikTok data.

Legal Disclaimer and Precautions

This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:

Do not scrape at rates that could damage the website.
Do not scrape data that's not available publicly.
Do not store PII of EU citizens who are protected by GDPR.
Do not repurpose the entire public datasets which can be illegal in some countries.

Scrapfly does not offer legal advice but these are good general rules to follow in web scraping

and for more you should consult a lawyer.

What TikTok APIs Are Available?

TikTok offers several APIs designed to serve different needs. Below, each API is explained in detail:

TikTok Login Kit

The TikTok Login Kit allows users to log in to third-party applications using their TikTok credentials. This API simplifies authentication and helps developers personalize user experiences by integrating TikTok accounts seamlessly.

Features:

Enables TikTok-based authentication.
Access to basic user profile information.
Secure token-based login.
Customizable user consent flow.

Use Case:

An e-commerce app uses the TikTok Login Kit to allow users to log in with their TikTok credentials. This streamlines registration and enables the app to offer personalized recommendations based on the user's TikTok activity.

TikTok Share Kit

The TikTok Share Kit enables users to share content from third-party apps directly to TikTok. It supports sharing videos, hashtags, captions, and more, making content creation and engagement more fluid.

Features:

Share video content directly to TikTok.
Pre-fill hashtags and captions.
Multi-platform compatibility (iOS, Android).
Real-time content preview.

Use Case:

A video editing app integrates the Share Kit to allow users to post their edited videos to TikTok with pre-filled trending hashtags and captions, increasing user engagement and app retention.

Content Posting API

The The Content Posting API provides functionality for developers to automate the uploading of videos to TikTok. This API is particularly useful for businesses managing multiple accounts or scheduling content.

Features:

Automate video uploads.
Schedule posts for optimal times.
Upload as drafts or publish directly.
Multi-account management.

Use Case:

A digital marketing agency uses the Content Posting API to manage TikTok campaigns for multiple clients, ensuring posts are published during peak engagement times.

Data Portability API

The The Data Portability API facilitates the transfer of user data between TikTok and third-party applications, ensuring compliance with data privacy regulations and enhancing user control over their information.

Features:

Secure data transfer for user requests.
GDPR compliance for European users.
Access to user activity history and account data.
Token-based authentication for secure interactions.

Use Case:

A fitness app allows users to import their TikTok activity data through the Data Portability API to create customized workout challenges based on trending TikTok fitness trends.

Display API

The The Display API allows developers to get basic TikTok profile info and content, such as videos and user feeds. This API is ideal for showcasing TikTok trends and enhancing content discoverability.

Features:

Read a user's profile info (open id, avatar, display name, ...).
Read a user's public videos on TikTok.

Use Case:

A news website displays their trending TikTok videos on its homepage using the Display API, providing readers with engaging, real-time multimedia content.

Research API

TikTok’s Research API is a specialized tool for academic and market research. It enables researchers to access anonymized and aggregated data for studying user behavior, trends, and platform dynamics.

Features:

Access anonymized user data.
Analyze hashtag performance and trends.
Retrieve aggregated data on specific topics.
Compliance with data privacy laws.

Use Case:

A university research team uses the Research API to study the impact of TikTok challenges on adolescent mental health, utilizing anonymized data to maintain privacy.

Commercial Content API

The Commercial Content API is considered part for the research tools tiktok offer. It provides access to public advertiser data.

For example, you can query the TikTok ads created in Italy between January 2, 2021 to January 9, 2021 with the keyword "coffee".

TikTok Business API

The TikTok Business API is a robust tool for brands and advertisers. It allows businesses to create, manage, and optimize ad campaigns, providing comprehensive tools for audience targeting and analytics.

Features:

Campaign creation and management.
Audience segmentation and targeting.
Performance tracking and reporting.
Automated ad placement.

Use Case:

An e-commerce retailer uses the TikTok Business API to launch a series of targeted ad campaigns, optimizing for conversions based on real-time performance data.

However, this API is strictly geared towards business-related functionalities. It does not provide access to raw public tiktok data. For those needs, researchers often turn to the TikTok Research API.

TikTok API Alternative – Web Scraping

For those unable to access TikTok’s official Research API, web scraping can serve as an alternative. This method involves extracting data directly from TikTok’s publicly accessible web pages using automated scripts. While effective, it comes with several caveats:

Legal Risks: Scraping may violate TikTok’s terms of service and could result in legal consequences.
Technical Challenges: TikTok employs anti-scraping mechanisms such as CAPTCHA and dynamic content loading, which require advanced techniques to bypass.
Ethical Considerations: Respecting user privacy and adhering to ethical data collection practices is paramount.

At scrapfly, we are dedicated to provide developer withs all the resources they need to reach their scraping goals. Check out our comprehensive guide on scraping tiktok as well as our example tiktok scraper using Scrapfly's APIs on github.

Power Up Tiktok Scraping with Scrapfly

Since TikTok employs anti-scraping mechanisms to prevent web-scrapers from accessing tiktok data, scraping tiktok with the traditional approaches cannot be done.

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

Anti-bot protection bypass - scrape web pages without blocking!
Rotating residential proxies - prevent IP address and geographic blocks.
JavaScript rendering - scrape dynamic web pages through cloud browsers.
Full browser automation - control browsers to scroll, input and click on objects.
Format conversion - scrape as HTML, JSON, Text, or Markdown.
Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.

Here is simple python code the uses Scrapfly's python SDK to scrape tiktok posts and parse them into JSON data:

import json
import jmespath
from typing import Dict, List
from urllib.parse import urlencode, quote, urlparse, parse_qs
from loguru import logger as log
from scrapfly import ScrapeConfig, ScrapflyClient, ScrapeApiResponse

SCRAPFLY = ScrapflyClient(key="YOUR_SCRAPFLY_KEY")

BASE_CONFIG = {
    # bypass tiktok.com web scraping blocking
    "asp": True,
    # set the proxy country to US
    "country": "US",
}

def parse_post(response: ScrapeApiResponse) -> Dict:
    """parse hidden post data from HTML"""
    selector = response.selector
    data = selector.xpath("//script[@id='__UNIVERSAL_DATA_FOR_REHYDRATION__']/text()").get()
    post_data = json.loads(data)["__DEFAULT_SCOPE__"]["webapp.video-detail"]["itemInfo"]["itemStruct"]
    parsed_post_data = jmespath.search(
        """{
        id: id,
        desc: desc,
        createTime: createTime,
        video: video.{duration: duration, ratio: ratio, cover: cover, playAddr: playAddr, downloadAddr: downloadAddr, bitrate: bitrate},
        author: author.{id: id, uniqueId: uniqueId, nickname: nickname, avatarLarger: avatarLarger, signature: signature, verified: verified},
        stats: stats,
        locationCreated: locationCreated,
        diversificationLabels: diversificationLabels,
        suggestedWords: suggestedWords,
        contents: contents[].{textExtra: textExtra[].{hashtagName: hashtagName}}
        }""",
        post_data,
    )
    return parsed_post_data


async def scrape_posts(urls: List[str]) -> List[Dict]:
    """scrape tiktok posts data from their URLs"""
    to_scrape = [ScrapeConfig(url, **BASE_CONFIG) for url in urls]
    data = []
    async for response in SCRAPFLY.concurrent_scrape(to_scrape):
        post_data = parse_post(response)
        data.append(post_data)
    log.success(f"scraped {len(data)} posts from post pages")
    return data

FAQ

To wrap up our intro to official Tiktok API here are some frequently asked questions that we might have not covered in the article:

Can I access TikTok’s Research API as an individual?

Typically, no. Access is restricted to accredited researchers or organizations that meet specific criteria.

What are the requirements to access the TikTok APIs?

Accessing TikTok APIs typically requires developers to create an account on the TikTok Developers Portal and register their application. Some APIs, like the Research API, require additional approval, which includes meeting criteria such as being an accredited researcher or organization, demonstrating a legitimate purpose, and adhering to privacy guidelines.

Can I use the APIs to fetch public TikTok data?

Most TikTok APIs, such as the Business API and Research API, are not designed to provide raw public data. Access to public data via the Research API is limited to approved researchers and comes with strict data privacy and usage guidelines. Developers looking for public data may consider alternatives like web scraping.

Summary

TikTok’s APIs offer a range of tools for developers, businesses, and researchers, but access and functionality are often limited by stringent requirements. For those looking to extract large-scale data, the Research API is the most suitable option but comes with access restrictions.

Web scraping remains an alternative, albeit with significant risks and limitations. Understanding these tools and their boundaries is essential for making informed decisions about TikTok data integration and analysis.

GPT Crawler: The AI Training Data Collection Guide

Scrapfly — Thu, 20 Mar 2025 22:41:47 +0000

GPT Crawler is a powerful, specialized tool designed to automate web data collection specifically for training large language models (LLMs) like ChatGPT. In today's AI development landscape, high-quality training data is essential, but obtaining it can be challenging and time-consuming.

This guide provides a comprehensive walkthrough of GPT Crawler's capabilities, showing AI developers and researchers how to efficiently gather diverse, contextually-rich web content to enhance their language models' performance.

What is GPT Crawler?

GPT Crawler distinguishes itself from traditional web scraping tools by focusing specifically on AI training data collection. Unlike general-purpose scrapers, GPT Crawler was built from the ground up with machine learning requirements in mind.

Key Features of GPT Crawler

GPT Crawler has gained popularity among AI developers due to its powerful capabilities that streamline the data collection process.

Intelligent Content Extraction

Intelligent content extraction is a core feature of GPT Crawler, enabling it to extract relevant text and metadata from web pages effectively. Key capabilities include:

Semantic parsing that understands document structure
Content quality assessment to filter low-value text
Metadata preservation for better context understanding
Multi-format support including HTML, JavaScript-rendered content, and PDFs

Now, let's look at how GPT Crawler handles content extraction in practice.

Scalability and Performance

GPT Crawler is designed to handle large-scale data collection tasks efficiently. It offers features that ensure optimal performance and scalability, such as:

Distributed crawling architecture for handling large-scale data collection
Rate limiting and politeness controls to respect website resources
Checkpoint and resume capabilities for long-running crawl jobs
Resource-efficient operation even on modest hardware

Let's look at how these features translate to practical implementation.

Setting Up GPT Crawler

Getting started with GPT Crawler requires some basic setup. Here's a straightforward process to begin collecting web data.

Installation

To install GPT Crawler, you will need to clone the repository and install the necessary dependencies:

$ git clone https://github.com/builderio/gpt-crawler
$ cd gpt-crawler
$ npm install

This will set up the project and install the required packages. Next, you'll need to configure the crawler for your specific data collection needs.

Basic Configuration

Creating a crawl configuration file is essential for defining what and how you'll crawl:

# config.ts
import { Config } from "./src/config";

export const defaultConfig: Config = {
  url: "https://web-scraping.dev/products",
  match: "https://web-scraping.dev/product/**",
  maxPagesToCrawl: 10,
  outputFileName: "output.json",
  maxTokens: 2000000,
};

In the config.ts file, you can define the starting URL, URL patterns to match, the maximum number of pages to crawl, the output file name, and other settings. The url is the starting point of the crawl, and the match is a pattern to match URLs to crawl. The maxPagesToCrawl sets the limit on the number of pages to crawl, and the outputFileName specifies the name of the output file where the extracted data will be saved.

Running Your First Crawl

With the configuration set up, you can start crawling with just one command:

$ npm run start

Example output of the crawler run


INFO PlaywrightCrawler: Starting the crawler.
INFO PlaywrightCrawler: Crawling: Page 1 / 10 - URL: https://web-scraping.dev/products...
INFO PlaywrightCrawler: Crawling: Page 2 / 10 - URL: https://web-scraping.dev/product/1...
...
INFO PlaywrightCrawler: Crawling: Page 9 / 10 - URL: https://web-scraping.dev/product/1?variant=orange-large...
INFO PlaywrightCrawler: Crawling: Page 10 / 10 - URL: https://web-scraping.dev/product/1?variant=cherry-small...
INFO PlaywrightCrawler: Crawler reached the maxRequestsPerCrawl limit of 10 requests and will shut down soon. Requests that are in progress will be allowed to finish.
INFO PlaywrightCrawler: Crawling: Page 11 / 10 - URL: https://web-scraping.dev/product/1?variant=cherry-medium...
INFO PlaywrightCrawler: Earlier, the crawler reached the maxRequestsPerCrawl limit of 10 requests and all requests that were in progress at that time have now finished. In total, the crawler processed 11 requests and will shut down.
Found 11 files to combine...
Wrote 11 items to output-1.json

This command will start the crawler, and you'll see the progress as it extracts content from the specified URLs. Once the crawl is complete, the extracted data will be saved to the output file you specified in the configuration.

Run with CLI Only

You can also run the crawler with CLI only without the need for a configuration file:

$ npm run start -- --url https://web-scraping.dev/products --match https://web-scraping.dev/product/** --maxPagesToCrawl 10 --outputFileName output.json --maxTokens 2000000

This command will start the crawler with the specified parameters directly from the command line. It's a convenient way to run the crawler without needing to create a configuration file.

Common Challenges and Solutions

When working with GPT Crawler, you may encounter several challenges. Here are practical solutions to the most common issues:

Rate Limiting and Blocking

Websites often implement rate limiting and may block IP addresses that send too many requests. To avoid this, consider the following strategies:

Implement adaptive rate limiting that responds to server response times
Rotate user agents to appear less like an automated system
Use proxy rotation for large-scale crawling projects
Add random delays between requests to mimic human browsing patterns

By implementing these strategies, you can reduce the risk of being rate-limited or blocked while crawling websites.

Content Quality Control

Some web pages contain low-quality or irrelevant content that can negatively impact your training data. To address this, consider the following approaches:

Filter by content length to avoid short, low-value pages
Implement language detection to focus on content in specific languages
Use keyword relevance scoring to prioritize topical content
Detect and skip duplicate or near-duplicate content

Following these strategies will help you maintain a high-quality dataset for your AI training needs.

Cleaning Extracted Data

Extracted data may contain unwanted elements like ads, navigation links, or boilerplate text. To clean the data effectively:

import re

def clean_text(text):
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    # Remove non-alphanumeric characters
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)
    # Add more cleaning operations as needed

    return text

This Python function uses regular expressions to clean the extracted text by removing URLs, non-alphanumeric characters, and extra whitespace. You can customize this function further based on your specific data cleaning requirements.

Preparing Crawled Data for AI Training

Once you've collected your data, proper formatting is crucial for effective AI training:

Clean and normalize text to remove inconsistencies
Apply tokenization compatible with your target LLM
Structure the data in the format required by your training pipeline
Create train/validation splits for proper model evaluation

Here's a simple example of preparing the collected data:

import json
import random
from sklearn.model_selection import train_test_split

# Load the crawled data
with open("training_data.jsonl", "r") as f:
    data = [json.loads(line) for line in f]

# Basic text cleaning
cleaned_data = []
for item in data:
    text = item["content"]
    # Remove excessive whitespace
    text = " ".join(text.split())
    # Other cleaning operations...

    cleaned_data.append({
        "text": text,
        "metadata": item["metadata"]
    })

# Create train/validation split
train_data, val_data = train_test_split(cleaned_data, test_size=0.1, random_state=42)

# Save in a format suitable for LLM training
with open("train_data.jsonl", "w") as f:
    for item in train_data:
        f.write(json.dumps(item) + "\n")

with open("val_data.jsonl", "w") as f:
    for item in val_data:
        f.write(json.dumps(item) + "\n")

In the above Python script, we load the crawled data, clean the text content, and create a train/validation split. Finally, we save the cleaned data in a format suitable for training an LLM.

If you want a comprehensive guide on what is the difference between json and jsonl file formats, you can check out our article:

[

JSONL vs JSON

Learn the differences between JSON and JSONLines, their use cases, and efficiency. Why JSONLines excels in web scraping and real-time processing.

](https://scrapfly.io/blog/jsonl-vs-json/)

GPT Crawler vs. Alternative Tools

GPT Crawler offers unique advantages for AI training data collection, but it's essential to consider how it compares to alternative tools. Here's a comparison of GPT Crawler with other popular web scraping and data collection tools:

Feature	GPT Crawler	Scrapy	Beautiful Soup	Playwright
Focus	AI training data	General web scraping	HTML parsing	Browser automation
JavaScript Support	Built-in	Requires add-ons	No	Built-in
Ease of Setup	Medium	Complex	Simple	Medium
Content Quality Filtering	Advanced	Manual	Manual	Manual
Token Counting	Built-in	Not available	Not available	Not available
Scalability	High	Very high	Low	Medium
Learning Curve	Medium	Steep	Gentle	Medium

GPT Crawler's focus on AI training data collection, built-in JavaScript support, and content quality filtering set it apart from other tools. While Scrapy and Beautiful Soup are more general-purpose web scraping tools, Playwright offers browser automation capabilities similar to GPT Crawler.

FAQ

Now, let's address some common questions about GPT Crawler:

Is GPT Crawler open source?

Yes, GPT Crawler is available as an open-source project under the MIT license. This allows developers to freely use, modify, and contribute to the codebase while building their own specialized data collection solutions.

How does GPT Crawler compare to Scrapy?

GPT Crawler is specifically optimized for AI training data collection with built-in semantic processing and quality filtering, while Scrapy is a more general-purpose web scraping framework. GPT Crawler requires less configuration for AI-specific tasks but has fewer customization options than Scrapy.

Can GPT Crawler handle content behind login pages?

Yes, GPT Crawler supports authenticated crawling through its browser automation features. You can configure login credentials and actions in the browser settings to access content that requires authentication before collection begins.

Summary

GPT Crawler represents a significant advancement in specialized data collection for AI training. By focusing on high-quality, contextually-relevant content extraction, it addresses many of the challenges faced by AI researchers and developers in gathering suitable training data.

Whether you're building a domain-specific model or enhancing an existing LLM with specialized knowledge, GPT Crawler provides the tools needed to efficiently collect and process web data for AI training purposes.

As the field of AI continues to evolve, tools like GPT Crawler will play an increasingly important role in helping developers access the high-quality data needed to train the next generation of language models.

How to Choose the Best Proxy Unblocker?

Scrapfly — Fri, 14 Mar 2025 23:01:58 +0000

The internet is full of restrictions, from workplace firewalls to geo-blocked websites. If you've seen "You seem to be using an unblocker or proxy", you know the frustration. Luckily, proxy unblockers help bypass these barriers with ease.

In this guide, we’ll break down everything you need to know about proxies, including how they change your IP, why websites block access, and how you can save bandwidth with a proxy saver.

What is a Proxy Unblocker?

A proxy unblocker is a tool that allows users to bypass website restrictions by routing their internet traffic through a different IP address. Proxies act as intermediaries between your device and the website you’re trying to access. Instead of connecting directly to a site, your request goes through a proxy server, which then fetches the content on your behalf.

What is a Proxy?

At its core, a proxy is any server that acts as a gateway between your device and the internet. There are different types of proxies, including:

HTTP Proxy: A protocol designed for web traffic, handling only HTTP/HTTPS requests.
UDP Proxies: Designed for connectionless traffic, often used for gaming, VoIP, and real-time streaming applications.
Residential Proxies: These use IP addresses assigned by ISPs to real homes. They are harder to detect because they mimic genuine user traffic.

For a detailed explanation, check out our blog:

[

The Complete Guide To Using Proxies For Web Scraping

Introduction to proxy usage in web scraping. What types of proxies are there? How to evaluate proxy providers and avoid common issues.

](https://scrapfly.io/blog/introduction-to-proxies-in-web-scraping/)

Now that we understand proxies, let's explore how they actually change your IP address and help you unblock sites.

How Proxies Change IP

Websites track and identify users through their IP addresses. If a website blocks an IP, that device can no longer access the site. A proxy works by replacing your real IP address with a new one, making it appear as if your request is coming from a different location.

Example: Changing IP with a Proxy

Here’s a simple Python example using requests and a proxy to access a blocked site:

import requests

# Define the proxy
proxies = {
    "http": "http://your-proxy-ip:port",
    "https": "https://your-proxy-ip:port"
}

# Make a request through the proxy
response = requests.get("https://web-scraping.dev/", proxies=proxies)

# Print the response
print(response.text)

In the above code, we use a proxy server to fetch the webpage instead of directly connecting. If the website had blocked our real IP, this method would help us regain access.

Now that we know how proxies change IPs, let's talk about their costs and efficiency.

Proxy Costs

Using a proxy service isn’t always free. While free proxy unblockers exist, they come with risks such as slow speeds, data logging, or unreliable connections. On the other hand, premium proxies offer stability and security but at a cost.

Free vs. Paid Proxies

Free proxies may be tempting, but they often come with slow speeds, security risks, and limits. Paid proxies offer better performance, security, and reliability. Here's a quick comparison:

Feature	Free Proxy	Paid Proxy
Speed	Slow	Fast & Reliable
Security	Risky (may log data)	High (no logging)
IP Rotation	Limited	Frequent changes
Usage Limits	Often restricted	Unlimited

If cost is a concern, proxy savers can help reduce expenses by optimizing bandwidth.

Save Proxy Bandwidth with Proxy Saver

A proxy saver helps you optimize proxy usage, reducing data consumption and costs. Many users burn through proxies too quickly, leading to unnecessary expenses. By implementing a few smart techniques, you can extend the life of your proxies and improve efficiency.

How to Reduce Proxy Bandwidth Usage

Efficient proxy usage helps reduce costs and improve performance. Here are some effective ways to save bandwidth:

Cache Responses: Store frequently accessed pages locally to reduce proxy requests.
Use Compression: Enable gzip compression to shrink data size and speed up loading.
Optimize Requests: Fetch only necessary elements (e.g., text or API data) instead of full pages.

Using these strategies helps maximize proxy efficiency. Now, let’s explore why websites block users.

What Causes Blocking?

Websites block users for various reasons, often to enforce security, prevent abuse, or comply with regional restrictions. Understanding these blocking methods can help you choose the best way to bypass them.

IP-Based Blocking

Websites track users by their IP addresses. If too many requests come from the same IP, the site may flag it as suspicious and block access. This often happens with shared proxies or when trying to visit unblocked websites from restricted locations.

User-Agent Blocking

Some sites restrict access based on browser user-agent data , which identifies the browser and device you’re using. Automated scripts or outdated user-agents can trigger blocks, preventing access to certain websites.

Geo-Restrictions

Streaming services, news sites, and other platforms often block content based on geolocation. If a website isn’t available in your country, using a residential proxy or VPN can help bypass these restrictions.

Here’s a table summarizing the main causes of website blocking and possible solutions:

Blocking Method	How It Works	Solution
IP-Based Blocking	Too many requests from the same IP trigger a block.	Use rotating or residential proxies to change IPs.
User-Agent Blocking	Sites restrict access based on browser/device data.	Use updated user-agents or browser emulation.
Geo-Restrictions	Content is blocked based on location.	Use a VPN or residential proxy to bypass restrictions.

Understanding these blocking methods helps you choose the best way to regain access. Now, let’s explore other ways to unblock websites.

Other Ways to Unblock Websites

While proxies are effective tools for bypassing website restrictions, several alternative methods offer unique advantages depending on your specific needs. Here's a comprehensive overview of other powerful unblocking techniques:

VPNs – Complete Network Protection

Virtual Private Networks provide a more comprehensive solution than standard proxies by encrypting your entire internet connection. Unlike proxies that only redirect specific web requests, VPNs route all your device's traffic through an encrypted tunnel.

Key benefits:

End-to-end encryption protects all internet traffic
Device-level protection covers all applications
Stronger security protocols (OpenVPN, WireGuard, IKEv2)
Kill switch features prevent data leaks if connection drops
No-logs policies with premium providers

VPNs are ideal for privacy-conscious users, accessing geo-restricted streaming services, and protecting sensitive communications on public networks.

Smart DNS Services – Optimized for Streaming

Smart DNS solutions specifically target geo-restrictions while maintaining full connection speeds:

Key benefits:

No encryption overhead results in faster connections than VPNs
Selective routing only redirects DNS queries for geo-blocked sites
Wide device compatibility including smart TVs and gaming consoles
Easier configuration with no software installation on many platforms

These services work particularly well for streaming enthusiasts who prioritize speed and device compatibility over comprehensive privacy.

Browser Extensions – Lightweight Solutions

Modern browser extensions offer convenient and targeted unblocking capabilities:

Key benefits:

One-click activation for immediate access to blocked content
Selective protection lets you choose which sites use the proxy
Resource efficiency with minimal performance impact
Additional privacy features like tracker blocking and cookie management
WebRTC leak protection prevents accidental IP exposure

Extensions work best for casual users who need occasional access to blocked websites without complex setup procedures.

Advanced DNS Techniques

DNS-based solutions provide a simple yet effective approach to bypassing certain restrictions:

Key benefits:

Public DNS servers (Google 8.8.8.8, Cloudflare 1.1.1.1) bypass ISP DNS blocks
DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) encrypt DNS requests
Network-wide protection with solutions like Pi-hole
Can overcome basic censorship with minimal configuration

These techniques are perfect for bypassing basic ISP-level blocking with minimal performance impact.

SSH Tunneling – Technical but Powerful

For tech-savvy users, SSH tunneling offers high security with low visibility:

Key benefits:

Creates SOCKS proxies through encrypted SSH connections
Enables port forwarding for specific applications
Lower detection risk compared to commercial VPN services
Highly customizable for specific technical requirements

SSH tunneling is best suited for technical users requiring secure access to specific services or applications.

Tor Network – Maximum Anonymity

The Tor network provides the highest level of anonymity through multi-layered routing:

Key benefits:

Onion routing sends traffic through multiple encrypted relay points
Global relay network makes tracking extremely difficult
Provides access to .onion sites not available on the regular internet
Built-in protection against various tracking techniques

For a detailed explanation, check out our blog:

[

How to Use Tor For Web Scraping

Learn about web scraping using Tor as a proxy and rotating proxy server by randomly changing the IP address with HTTP or SOCKS.

](https://scrapfly.io/blog/how-to-use-tor-for-web-scraping//)

Tor is especially valuable for users requiring maximum anonymity or accessing content in heavily censored regions.

Mobile-Specific Solutions

Mobile devices have unique options for bypassing restrictions:

Key benefits:

Using mobile data instead of WiFi can bypass local network restrictions
eSIM services provide remote cellular connections from different countries
Operator-specific apps may include free access to certain content
Tethering through alternative connections can bypass device-specific limits

These approaches are particularly helpful for mobile users dealing with WiFi restrictions or needing country-specific access.

While these methods work for general browsing, web scraping often requires more advanced solutions. Let’s explore how Scrapfly can help.

Unblock Scraping with Scrapfly

Bypassing anti-bot systems while possible is often very difficult - let Scrapfly do it for you!

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

Anti-bot protection bypass - scrape web pages without blocking!
Rotating residential proxies - prevent IP address and geographic blocks.
JavaScript rendering - scrape dynamic web pages through cloud browsers.
Full browser automation - control browsers to scroll, input and click on objects.
Format conversion - scrape as HTML, JSON, Text, or Markdown.
Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.

Here's how we can scrape data without being blocked using ScrapFly. All we have to do is enable the asp parameter, select the proxy pool (datacenter or residential), and proxy country:

from scrapfly import ScrapflyClient, ScrapeConfig, ScrapeApiResponse

scrapfly = ScrapflyClient(key="Your ScrapFly API key")

response: ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(
   url="the target website URL",
   # select a proxy pool
   proxy_pool="public_residential_pool",
   # select a the proxy country
   country="us",
   # enable the ASP to bypass any website's blocking
   asp=True,
   # enable JS rendering, similar to headless browsers
   render_js=True,
))

# get the page HTML content
print(response.scrape_result['content'])
# use the built-in parsel selector
selector = response.selector

FAQ

To wrap this introduction up let's take a look at some frequently asked questions regarding Cloudscraper.

What is the difference between a VPN and a proxy?

A VPN encrypts your entire connection, making it more secure, while a proxy only changes your IP for specific requests. VPNs are better for privacy, while proxies are useful for bypassing simple blocks.

For a detailed comparison, check out our blog:

[

Proxy vs VPN: Which One Should You Use?

Understand the key differences between proxies and VPNs, their use cases, and which one is best for your needs.

](https://scrapfly.io/blog/proxy-vs-vpn/)

Why do some proxies stop working?

Websites constantly update their blocking methods. If a proxy stops working, it may be blacklisted, detected, or overloaded. Switching to a rotating or residential proxy can help.

Are free proxies safe to use?

Free proxies come with risks like slow speeds, data logging, and security threats. For safer browsing, use a trusted paid proxy or VPN.

Summary

Choosing the right proxy unblocker is essential for bypassing website restrictions. Proxies work by changing your IP, while proxy savers help optimize bandwidth. Websites block users based on IP, geo-location, and user agents, but alternatives like VPNs, DNS changes, and browser extensions can also help. For web scraping, specialized tools like Scrapfly offer more advanced solutions.

Guide To Google Image Search API and Alternatives

Scrapfly — Wed, 12 Mar 2025 18:03:43 +0000

Google Image Search API allows developers to integrate Google Image Search functionality into their applications. This API provides access to a vast collection of images indexed by Google, enabling users to search for images based on various criteria such as keywords, image type, and more.

Whether you're building an image search feature, creating a visual recognition tool, or developing content analysis software, this guide will help you understand your options for programmatically accessing image search functionality.

Is There an Official Google Image Search API?

Google previously provided a dedicated Image Search API as part of its AJAX Search API suite, but this service was deprecated in 2011. Since then, developers looking for official Google-supported methods to access image search results have had limited options.

However, Google does offer a partial solution through its Custom Search JSON API, which can be configured to include image search results. This requires setting up a Custom Search Engine (CSE) and limiting it to image search, but it comes with significant limitations:

Quota restrictions : The free tier is limited to 100 queries per day
Commercial use fees : Usage beyond the free tier requires payment
Limited results : Each query returns a maximum of 10 images per request
Restricted customization : Fewer filtering options compared to the original Image Search API

For developers needing more robust image search capabilities, exploring alternative services is often necessary.

Google Image Search Alternatives

While Google does not provide an official Image Search API, there are several alternatives available:

Bing Image Search API

Microsoft's Bing Image Search API provides a comprehensive solution for integrating image search capabilities into applications. Part of the Azure Cognitive Services suite, this API offers advanced search features and returns detailed metadata about images.

import requests

subscription_key = "YOUR_SUBSCRIPTION_KEY"
search_url = "https://api.bing.microsoft.com/v7.0/images/search"
search_term = "mountain landscape"

headers = {"Ocp-Apim-Subscription-Key": subscription_key}
params = {"q": search_term, "count": 10, "offset": 0, "mkt": "en-US", "safeSearch": "Moderate"}

response = requests.get(search_url, headers=headers, params=params)
response.raise_for_status()
search_results = response.json()

# Process the results
for image in search_results["value"]:
    print(f"URL: {image['contentUrl']}")
    print(f"Name: {image['name']}")
    print(f"Size: {image['width']}x{image['height']}")
    print("---")

In the above code, we're sending a request to the Bing Image Search API with our search term and additional parameters. The API returns a JSON response containing image URLs, names, and dimensions, which we can then process according to our application's needs.

The Bing API offers competitive pricing with a free tier that includes 1,000 transactions per month, making it accessible for small projects and testing before scaling.

DuckDuckGo Image Search

DuckDuckGo doesn't offer an official API for image search, but it's worth noting that their image search results are primarily powered by Bing's search engine. For developers looking for a more privacy-focused approach, some have created unofficial wrappers around DuckDuckGo's search functionality.

Since this method relies on web scraping, you should have prior knowledge of it. If you're interested in learning more about web scraping and best practices, check out our article.

[

Everything to Know to Start Web Scraping in Python Today

Ultimate modern intro to web scraping using Python. How to scrape data using HTTP or headless browsers, parse it using AI and scale and deploy.

](https://scrapfly.io/blog/everything-to-know-about-web-scraping-python/)

Now, let's move on to the example.

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def scrape_duckduckgo_images():
    # Start Playwright in a context manager to ensure clean-up
    with sync_playwright() as p:
        # Launch the Chromium browser in non-headless mode for visual debugging
        browser = p.chromium.launch(headless=False)
        page = browser.new_page()

        # Navigate to DuckDuckGo image search for 'python'
        page.goto("https://duckduckgo.com/?q=python&iax=images&ia=images")

        # Wait until the images load by waiting for the image selector to appear
        page.wait_for_selector(".tile--img__img")

        # Get the fully rendered page content including dynamically loaded elements
        content = page.content()

        # Parse the page content using BeautifulSoup for easier HTML traversal
        soup = BeautifulSoup(content, "html.parser")
        images = soup.find_all("img")

        # Loop through the first three images only
        for image in images[:3]:
            # Safely extract the 'src' attribute with a default message if not found
            src = image.get("src", "No src found")
            # Safely extract the 'alt' attribute with a default message if not found
            alt = image.get("alt", "No alt text")
            print(src) # Print the image source URL
            print(alt) # Print the image alt text
            print("---------------------------------")

        # Close the browser after the scraping is complete
        browser.close()

scrape_duckduckgo_images()

Example Output


//external-content.duckduckgo.com/iu/?u=https%3A%2F%2Ftse3.mm.bing.net%2Fth%3Fid%3DOIP.jrcuppJ7JfrVrpa9iKnnnAHaHa%26pid%3DApi&f=1&ipt=a11d9de5b863682e82564114f090c443350005fe945cfdfdba2ca1a05a43fa2b&ipo=images
Advanced Python Tutorials - Real Python
---------------------------------
//external-content.duckduckgo.com/iu/?u=https%3A%2F%2Ftse2.mm.bing.net%2Fth%3Fid%3DOIP.Po6Ot_fcf7ya7xkrOL27hQHaES%26pid%3DApi&f=1&ipt=156829965359c98ab2bbc69fb73e2a4963284ff665c83887d6278d6cecc08841&ipo=images
¿Para qué sirve Python?
---------------------------------
//external-content.duckduckgo.com/iu/?u=https%3A%2F%2Ftse4.mm.bing.net%2Fth%3Fid%3DOIP._zLHmRNYHt-KYwYC8cC3RwHaHa%26pid%3DApi&f=1&ipt=04bdcfc11eee3ef4e96bf7d1b47230633b7c936363cf0c9f86c5dfa2e6fb4f32&ipo=images
¿Qué es Python y por qué debes aprender

In the above code, we're making a request to DuckDuckGo's search page with parameters that trigger the image search interface. However, this approach requires web scraping.

Can Google Images be Scraped?

Scraping Google Images is technically possible and can be a good approach when API options don't meet your specific requirements. But there are several echnical obstacles that make it a complex and often unreliable approach

Google Blocks Bots Aggressively : Google actively detects and blocks automated scraping, requiring constant evasion tactics.
Headless Browsers Required : Running Selenium or Puppeteer in headless mode is usually necessary to mimic real users.
Page Structure Changes Frequently : Google updates its layout and elements, breaking scrapers that rely on fixed XPath or CSS selectors.
High Resource Consumption : Running Selenium-based automation in a full browser environment significantly increases CPU and memory usage compared to API-based solutions.

For many applications, using an official API from Bing or another provider is a more sustainable approach. However, for specific use cases or when other options aren't viable, let's explore some effective scraping techniques.

Scrapfly Web Scraping API

ScrapFly provides web scraping, screenshot, and extraction APIs for data collection at scale.

Anti-bot protection bypass - scrape web pages without blocking!
Rotating residential proxies - prevent IP address and geographic blocks.
JavaScript rendering - scrape dynamic web pages through cloud browsers.
Full browser automation - control browsers to scroll, input and click on objects.
Format conversion - scrape as HTML, JSON, Text, or Markdown.
Python and Typescript SDKs, as well as Scrapy and no-code tool integrations.

Here's an example of how to scrape a google images with the Scrapfly web scraping API:

from scrapfly import ScrapflyClient, ScrapeConfig, ScrapeApiResponse

scrapfly = ScrapflyClient(key="YOUR_SCRAPFLY_KEY")

result: ScrapeApiResponse = scrapfly.scrape(ScrapeConfig(
    tags=[
    "player","project:default"
    ],
    format="json",
    extraction_model="search_engine_results",
    country="us",
    lang=[
    "en"
    ],
    asp=True,
    render_js=True,
    url="https://www.google.com/search?q=python&tbm=isch"
))

Example Output


{
    "query": "python - Google Search",
    "results": [
        {
            "displayUrl": null,
            "publishDate": null,
            "richSnippet": null,
            "snippet": null,
            "title": "Wikipedia Python (programming language) - Wikipedia",
            "url": "https://en.wikipedia.org/wiki/Python_(programming_language)"
        },
        {
            "displayUrl": null,
            "publishDate": null,
            "richSnippet": null,
            "snippet": null,
            "title": "Juni Learning What is Python Coding? | Juni Learning",
            "url": "https://junilearning.com/blog/guide/what-is-python-101-for-students/"
        },
        {
            "displayUrl": null,
            "publishDate": null,
            "richSnippet": null,
            "snippet": null,
            "title": "Wikiversity Python - Wikiversity",
            "url": "https://en.wikiversity.org/wiki/Python"
        },
        ...
   }

Try for FREE!

More on Scrapfly

Scrape Google Image Search using Python

For a direct approach to scraping Google Images using Python, the following code demonstrates how to extract image data using Requests and BeautifulSoup:

import requests
from bs4 import BeautifulSoup
import random
import time
from lxml import etree # For XPath support

def scrape_google_images_bs4(query, num_results=20):
    # Encode the search query
    encoded_query = query.replace(" ", "+")
    # Set up headers to mimic a browser
    user_agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36"
    ]
    headers = {
        "User-Agent": random.choice(user_agents),
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.5",
        "Referer": "https://www.google.com/"
    }

    # Make the request
    url = f"https://www.google.com/search?q={encoded_query}&tbm=isch"
    response = requests.get(url, headers=headers)

    if response.status_code != 200:
        print(f"Failed to retrieve the page: {response.status_code}")
        return []

    # Parse the HTML using both BeautifulSoup and lxml for XPath
    soup = BeautifulSoup(response.text, 'html.parser')
    dom = etree.HTML(str(soup)) # Convert to lxml object for XPath

    # Process the response
    image_data = []

    # Use XPath to select divs instead of class-based selection
    # This pattern selects all similar divs in the structure
    base_xpath = "/html/body/div[3]/div/div[14]/div/div[2]/div[2]/div/div/div/div/div[1]/div/div/div"

    # Get all div indices to match the pattern
    div_indices = range(1, num_results + 1) # Start with 1 through num_results

    for i in div_indices:
        try:
            # Create XPath for the current div
            current_xpath = f"{base_xpath}[{i}]"
            div_element = dom.xpath(current_xpath)

            if not div_element:
                continue

            item = {}

            # Get the data-lpage attribute (page URL) from the div
            page_url_xpath = f"{current_xpath}/@data-lpage"
            page_url = dom.xpath(page_url_xpath)
            if page_url:
                item["page_url"] = page_url[0]

            # Get the alt text of the image
            alt_xpath = f"{current_xpath}//img/@alt"
            alt_text = dom.xpath(alt_xpath)
            if alt_text:
                item["alt_text"] = alt_text[0]

            if item:
                image_data.append(item)

            # Stop if we've reached the requested number of results
            if len(image_data) >= num_results:
                break

        except Exception as e:
            print(f"Error processing element {i}: {e}")

    return image_data

# Example usage
image_data = scrape_google_images_bs4("python", num_results=5)
print(image_data)

Example Output


[{'page_url': 'https://en.wikipedia.org/wiki/Python_(programming_language)', 'alt_text': '\u202aPython (programming language) - Wikipedia\u202c\u200f'},
{'page_url': 'https://beecrowd.com/blog-posts/best-python-courses/', 'alt_text': '\u202aPython: find out the best courses - beecrowd\u202c\u200f'},
{'page_url': 'https://junilearning.com/blog/guide/what-is-python-101-for-students/', 'alt_text': '\u202aWhat is Python Coding? | Juni Learning\u202c\u200f'},
{'page_url': 'https://medium.com/towards-data-science/what-is-a-python-environment-for-beginners-7f06911cf01a', 'alt_text': "\u202aWhat Is a 'Python Environment'? (For Beginners) | by Mark Jamison | TDS Archive | Medium\u202c\u200f"},
{'page_url': 'https://quantumzeitgeist.com/why-is-the-python-programming-language-so-popular/', 'alt_text': '\u202aWhy Is The Python Programming Language So Popular?\u202c\u200f'}]

In the above code, we created a Google Images scraper that uses XPath targeting instead of class-based selectors for better reliability. The script mimics browser behavior with rotating user agents, fetches search results for a given query, and extracts both the source page URL (data-lpage attribute) and image alt text from the search results.

Scrape Google Reverse Image Search using Python

Reverse image search allows you to find similar images and their sources using an image as the query instead of text. Implementing this requires a slightly different approach, often involving browser automation with tools like Selenium.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
import time

def google_reverse_image_search(image_url, max_results=5):
    # Set up Chrome options
    chrome_options = Options()
    # chrome_options.add_argument("--headless") # Run in headless mode
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    chrome_options.add_argument("--disable-gpu")
    chrome_options.add_argument("--window-size=1920,1080")
    chrome_options.add_argument("--lang=en-US,en")
    chrome_options.add_experimental_option('prefs', {'intl.accept_languages': 'en-US,en'})
    chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
    chrome_options.add_experimental_option('useAutomationExtension', False)
    chrome_options.add_argument("--disable-blink-features=AutomationControlled")
    chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36")

    # Initialize the driver
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)

    try:
        # Navigate to Google Images
        driver.get("https://www.google.com/imghp?hl=en&gl=us")

        # Find and click the camera icon for reverse search
        camera_button = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.XPATH, "//div[@aria-label='Search by image']"))
        )
        camera_button.click()

        # Wait for the URL input field and enter the image URL
        url_input = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//input[@placeholder='Paste image link']"))
        )
        url_input.send_keys(image_url)

        # Click search button
        search_button = WebDriverWait(driver, 10).until(
            EC.element_to_be_clickable((By.XPATH, "//div[text()='Search']"))
        )
        search_button.click()

        # Wait for results page to load
        WebDriverWait(driver, 15).until(
            EC.presence_of_element_located((By.XPATH, "//div[contains(text(), 'All')]"))
        )

        # Extract similar image results
        similar_images = []

        # Click on "Find similar images" if available
        try:
            # Extract image data
            for i in range(max_results):
                try:
                    # Get image element using index in XPath
                    img_xpath = f"/html/body/div[3]/div/div[12]/div/div/div[2]/div[2]/div/div/div[1]/div/div/div/div/div/div/div[{i+1}]/div/div/div[1]/div/div/div/div/img"
                    img = WebDriverWait(driver, 5).until(
                        EC.presence_of_element_located((By.XPATH, img_xpath))
                    )

                    # Get image URL by clicking and extracting from larger preview
                    img.click()
                    time.sleep(1) # Wait for larger preview

                    # Find the large image
                    img_container = WebDriverWait(driver, 5).until(
                        EC.presence_of_element_located((By.XPATH, "//*[@id='Sva75c']/div[2]/div[2]/div/div[2]/c-wiz/div/div[2]/div/a[1]"))
                    )

                    img_url = driver.find_element(By.XPATH, "//*[@id='Sva75c']/div[2]/div[2]/div/div[2]/c-wiz/div/div[2]/div/a[1]/img").get_attribute("src")

                    # Get source website
                    source_url = img_container.get_attribute("href")

                    similar_images.append({
                        "url": img_url,
                        "source_url": source_url,
                    })
                except Exception as e:
                    print(f"Error extracting image {i+1}: {e}")
        except Exception as e:
            print(f"Could not find 'similar images' link: {e}")

        return similar_images

    finally:
        # Clean up
        driver.quit()

# Example usage
sample_image_url = "https://avatars.githubusercontent.com/u/54183743?s=280&v=4"
similar_images = google_reverse_image_search(sample_image_url)

print("Similar Images:")
for idx, img in enumerate(similar_images, 1):
    print(f"Image {idx}:")
    print(f" URL: {img['url']}")
    print(f" Source: {img['source_url']}")
    print()

In the above code, we're using Selenium to automate the process of performing a reverse image search. This approach simulates a user visiting Google Images, clicking the camera icon, entering an image URL, and initiating the search. The full implementation would include parsing the results page to extract similar images, websites containing the image, and other relevant information.

This method requires more resources than simple HTTP requests but provides access to functionality that isn't easily available through direct scraping. For production use, you would need to add error handling, result parsing, and potentially proxy rotation to avoid detection.

FAQ

Is there an official Google Image Search API?

No, Google does not offer an official Image Search API. The previously available Google Image Search API was deprecated and is no longer supported.

What are the alternatives to Google Image Search API?

Alternatives to Google Image Search API include Bing Image Search API, DuckDuckGo Image Search, and image search APIs from other search engines like Yahoo and Yandex.

Can I scrape Google Images?

Scraping Google Images is possible, but it comes with challenges and legal considerations. It's important to use ethical scraping practices and consider using APIs provided by other search engines as alternatives.

Summary

In this article, we explored the Google Image Search API, its alternatives, and how to scrape Google Image Search results using Python. While Google does not offer an official Image Search API, developers can use the Google Custom Search JSON API or alternatives like Bing Image Search API and DuckDuckGo Image Search. Additionally, we discussed the challenges of scraping Google Images and provided example code snippets for scraping image search results.