DEV Community: datacollection

Build Smart Business News Monitoring with Dify + Deep SerpApi

datacollection — Tue, 17 Jun 2025 10:36:19 +0000

In today’s highly competitive landscape, staying informed of brand reputation, industry developments, and competitor intelligence in real time is crucial for effective decision-making. However, manually monitoring news and information is time-consuming, labor-intensive, and prone to missing critical insights.

This solution integrates Dify, a leading no-code AI automation platform, with the Scrapeless Deep SerpApi, an enterprise-grade Google Search data interface, to build a smart and scalable business news monitoring system that enables enterprises to:

Collect and filter real-time news automatically
Leverage AI for intelligent analysis and actionable insights
Push alerts and reports across multiple channels automatically

1. Solution Overview

Component	Description
Dify Intelligent Workflow Platform	No-code workflow design and execution with drag-and-drop support for AI and API integration
Scrapeless Deep SerpApi	High-speed, stable, anti-blocking Google Search API supporting multi-region and multilingual queries
AI Models (e.g., GPT-4 / Claude)	Performs automatic semantic analysis and generates intelligent news summaries and business insights
Notification Plugins (e.g., Discord Webhook)	Real-time push of monitoring reports to ensure rapid information delivery

2. Enterprise-Grade Tooling Overview

Dify Intelligent Workflow Platform

A no-code AI automation platform designed for flexible, enterprise-grade workflows

Visual interface for drag-and-drop workflow building—no coding required
Seamless integration with mainstream AI models (GPT-4, Claude 3, Gemini, etc.)
Plugin ecosystem for connecting with APIs and external data sources
Real-time monitoring with detailed logs and error tracing
Role-based access control and team collaboration support
Suitable for private deployment in secure enterprise environments

Scrapeless Deep SerpApi

A real-time, high-fidelity Google SERP API engineered for AI workflows and business intelligence

Scrapeless Deep SerpApi is purpose-built for enterprise-grade use cases like brand monitoring, market intelligence, content generation, and AI-powered decision-making. It extracts real-time, structured data directly from Google search results (HTML parsing), ensuring accuracy, freshness, and reliability.

Key Advantages

Instant access to real-time Google SERP data (under 3s response)
Comprehensive result coverage: organic results, google local, google image, google news etc.
Zero caching: Direct HTML parsing ensures up-to-date, verifiable results
Anti-scraping technology: 99.9% success rate, no manual proxy configuration needed
Supports 195+ countries and multiple languages for global monitoring
Structured output in common data formats, making it easy for AI models and automated workflows to parse and analyze
Transparent, usage-based billing with no hidden limits or field restrictions

📌 Ideal for:

Building enterprise-grade media monitoring and alert systems
Tracking competitor activity and market trends globally
Creating search-tuned datasets for retrieval-augmented generation (RAG)
Powering SEO and content automation at scale

3. Environment Setup & Account Registration

3.1 Register a Scrapeless Account and Obtain API Token

Visit the Scrapeless Dashboard
Register a business account
After logging in, navigate to the API Management page to obtain your API token

⚠️ Important: Keep your API token secure and never share it publicly.

3.2 Register a Dify account and install the Deep SerpApi plugin

Sign up for Dify if you haven't already and install https://marketplace.dify.ai/plugins/scrapelesshq/deep_serpapi
Create a new application and select "Workflow"
In the workflow studio, click the "+" button to add a new tool
Navigate to the "Tools" tab in the panel
Look for "Deep SerpApi" by scrapelesshq (as shown in the tools list)
Click on "Deep SerpApi" to add it to your workflow

4. Detailed Configuration Process

Step 1: Add the Deep SerpApi Node

Click the "+" button in the workflow editor
Select the Tools tab
Choose Deep SerpApi (Scrapeless) and add it to your workflow

In the configuration panel, paste the API Token copied earlier

Step 2: Configure Search Parameters

In the Query String field of the Deep SerpApi node, enter your search query, for example:
- "Your Company Name" news
Supports advanced search syntax such as:
- "Your Company Name" OR "Industry Keyword"
- "Company Name" AND (announcement OR partnership)
In this example, we use:

{{ company }} latest business news June 2025 site:reuters.com OR site:bloomberg.com OR site:cnn.com

Step 3: Add Template Node to Format Search Results

Click the “+” button after the Deep SerpApi node.
Select “Template” from the available blocks.

In the Template field, enter the following formatting template:


Search Results:
{% for item in arg1[0].organic_results %}
- Title: {{ item.title }}
- Link: {{ item.link }}
{% endfor %}

This template will display the search results in a structured manner to facilitate subsequent AI analysis.

Step 4: Configure the AI Analysis Node

Click the “+” button after the Deep SerpApi node.
Select “LLM” from the available blocks.
Choose your preferred AI model (GPT-4 is recommended). > You will need to click “Model Provider Settings” to install or activate your model.

You'll be taken to a page with a choice of LLM. You're free to choose the one you want. For our example, we'll use Claude.

In the System prompt, reference the search results:

You are a business intelligence analyst.

Based on the following search results, generate a concise B2B intelligence report for the company "{{ company }}". Your report should include:

1. Overall sentiment (Positive/Neutral/Negative)
2. Major news developments or updates
3. Business risks or opportunities
4. Strategic implications for the company
5. Any urgent or noteworthy items

If the search results are too generic or lack company-specific content, please point that out and suggest how to improve the query.

Use bullet points where appropriate. Keep the tone professional and actionable.

In the User prompt, reference the formatted template results:


Please analyze these search results about the company: insights based on the news titles、contents and sources found.

Then, in the Prompt text box, use / to call out the variable selector, and you can call out a list of variables, including output, text, sys., etc., for you to insert into the template or set variables. You can see the picture below.

Step 5: Run and Debug the Workflow

Click the Run button in the top-right corner of the interface

Wait for the workflow to execute and check the output results
Based on the analysis results, adjust the search keywords and AI prompts to optimize performance

Step 6: Integrate Enterprise Notification Channels (e.g., Discord Webhook) (Optional)

To receive notifications directly in your Discord server when the workflow completes, you can add a webhook integration:

Add a New Block:

Click the “+” button after your LLM analysis step
Select “Tools” from the block menu

Find Discord Webhook in the Marketplace:
In the Tools section, click on “Marketplace”
Search for “Discord” or “webhook”
Install the Discord webhook tool if it’s not already available Discord Plugin on Marketplace

Configure Your Webhook:

Select the Discord Webhook tool
Enter your Discord Webhook URL (you can obtain this from your Discord server settings)
Customize the message format to include the analysis results
Use variables from previous steps to include dynamic content

Message Customization:
Include the search query in the notification
Add a summary of key findings
Format the message for easy reading in Discord

🔍 **Daily Business Intelligence Report**

/ context

---
📊 *Generated by Dify + Scrapeless Deep SerpAPI*

Note: You can use any webhook service of your choice (Slack, Microsoft Teams, etc.) by following the same process and searching for the appropriate tool in the marketplace.

Step 7: Add an end node to complete the workflow configuration

To properly complete your workflow, add an End block:
1. Add Final Block:

Click the "+" button after your webhook step (or LLM step if you skipped the webhook)
Select "End" from the block menu

2. Configure End Block:

The End block marks the completion of your workflow
You can optionally configure output variables that will be returned when the workflow completes
This is useful if you want to use this workflow as part of a larger automation

Your complete workflow should now look like:

Step 8: Output the results

🚀 Ready to Power Your Intelligence Workflows?

Sign up for Scrapeless Google SERP API today and instantly receive 2,500 free API calls — no credit card required.

Experience real-time, structured search data built for scale, precision, and AI-native workflows.

👉 Get Started for Free and supercharge your next project!

Workflow Demo

To help you better understand how this smart business news monitoring workflow runs from start to finish, we’ve created a short GIF demo. It shows each step in action — from fetching real-time search results with Deep SerpApi, formatting them with a Template block, analyzing the data using an LLM, and finally sending the insights via Discord webhook.

5. Success Stories & Performance Impact

Leading Financial Institution

“From Reactive to Proactive” — Real-Time News Monitoring with 95% Accuracy

A major financial institution faced challenges in monitoring fast-moving news cycles related to banking regulations, reputational risks, and macroeconomic events. Prior to deploying the system, their compliance and risk teams relied heavily on manual media tracking, which was time-consuming and often delayed critical responses.

After integrating the Dify + Scrapeless monitoring system:

News detection latency was reduced by 80%, enabling near real-time awareness of regulatory or reputational risks.
The accuracy of sentiment-based alerting models improved to 95%, thanks to high-quality structured SERP data feeding AI classifiers.
Cross-departmental collaboration improved, as alerts were pushed directly into internal Slack channels and BI dashboards.
Result: Risk mitigation windows were shortened from hours to minutes, reducing potential damage from negative press or misinformation.

Global Manufacturing Enterprise

“Global Eyes, Local Insights” — Multi-language Market Intelligence at Scale

This multinational manufacturing firm needed to monitor global news across diverse markets to inform its supply chain strategy, trade risk exposure, and competitor activity—especially across Europe, Southeast Asia, and Latin America.

With the integrated solution in place:

Automated SERP-based monitoring covered 20+ languages and 100+ country-specific domains, reducing blind spots in non-English media.
Alerts about policy shifts, environmental incidents, or labor disputes were surfaced up to 72 hours earlier than previous manual workflows.
Internal dashboards consolidated insights across time zones and teams, allowing senior decision-makers to act faster on global disruptions.
Result: Strategic responsiveness improved significantly, particularly in procurement and logistics planning.

🔧 Want to Build More Intelligent Workflows?

If you're looking to take your data monitoring system to the next level, don’t miss these in-depth guides:

📈 Build an Intelligent Trend Monitoring System with Make
Learn how to combine Scrapeless with Make to create automated trend alerts and real-time dashboards.

🌍 Build a Google Trends Monitor with Pipedream

Discover how to set up a scalable trend tracking system using Google Trends API and Pipedream workflows.

Explore these tutorials and start building smarter, faster, and more automated intelligence pipelines today!

6. FAQs & Best Practices

Issue	Recommended Solution
No search results	Check your API token validity and permissions
Inaccurate search results	Refine keywords and exclude irrelevant search terms
AI analysis is not accurate	Improve the prompt to clarify the main focus of the analysis
API quota exceeded or errors	Monitor usage frequency and plan API calls accordingly

7. Summary

This solution leverages the deep integration between the Dify intelligent workflow platform and Scrapeless Deep SerpApi to enable automated monitoring and intelligent analysis of enterprise-level business news. With this system, companies can stay informed of brand developments in real time, gain insights into industry trends, respond quickly to market changes, and empower decision-makers to strategically plan for the future.

Building an AI-Powered Web Data Pipeline with n8n, Scrapeless, and Claude

datacollection — Mon, 19 May 2025 10:59:09 +0000

Introduction

In today's data-driven landscape, organizations need efficient ways to extract, process, and analyze web content. Traditional web scraping faces numerous challenges: anti-bot protections, complex JavaScript rendering, and the need for constant maintenance. Furthermore, making sense of unstructured web data requires sophisticated processing.

This guide demonstrates how to build a complete web data pipeline using n8n workflow automation, Scrapeless web scraping, Claude AI for intelligent extraction, and Qdrant vector database for semantic storage. Whether you're building a knowledge base, conducting market research, or developing an AI assistant, this workflow provides a powerful foundation.

What You'll Build

Our n8n workflow combines several cutting-edge technologies:

Scrapeless Web Unlocker: Advanced web scraping with JavaScript rendering
Claude 3.7 Sonnet: AI-powered data extraction and structuring
Ollama Embeddings: Local vector embedding generation
Qdrant Vector Database: Semantic storage and retrieval
Notification System: Real-time monitoring via webhooks

This end-to-end pipeline transforms messy web data into structured, vectorized information ready for semantic search and AI applications.

Installation and Setup

Installing n8n

n8n requires Node.js v18, v20, or v22. If you encounter version compatibility issues:

# Check your Node.js version
node -v

# If you have a newer unsupported version (e.g., v23+), install nvm
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.5/install.sh | bash
# Or for Windows, use NVM for Windows installer

# Install a compatible Node.js version
nvm install 20

# Use the installed version
nvm use 20

# Install n8n globally
npm install n8n -g

# Run n8n
n8n

Your n8n instance should now be available at http://localhost:5678.

Setting up Claude API

Visit Anthropic Console and create an account
Navigate to API Keys section
Click "Create Key" and set appropriate permissions
Copy your API key for use in the n8n workflow (In AI Data Checker, Claude Data extractor and Claude AI Agent)

Setting up Scrapeless

Visit Scrapeless and create an account
Navigate to the Universal Scraping API section in your dashboard https://app.scrapeless.com/exemple/overview

Copy your token for use in the n8n workflow

You can customize your Scrapeless web scraping request using this curl command and import it directly into the HTTP Request node in n8n:

curl -X POST "https://api.scrapeless.com/api/v1/unlocker/request" \
  -H "Content-Type: application/json" \
  -H "x-api-token: scrapeless_api_key" \
  -d '{
    "actor": "unlocker.webunlocker",
    "proxy": {
      "country": "ANY"
    },
    "input": {
      "url": "https://www.scrapeless.com",
      "method": "GET",
      "redirect": true,
      "js_render": true,
      "js_instructions": [{"wait":100}],
      "block": {
        "resources": ["image","font","script"],
        "urls": ["https://example.com"]
      }
    }
  }'

Installing Qdrant with Docker


# Pull Qdrant image
docker pull qdrant/qdrant

# Run Qdrant container with data persistence
docker run -d \
  --name qdrant-server \
  -p 6333:6333 \
  -p 6334:6334 \
  -v $(pwd)/qdrant_storage:/qdrant/storage \
  qdrant/qdrant

Verify Qdrant is running:

curl http://localhost:6333/healthz

Installing Ollama

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download and install from Ollama's website.

Start Ollama server:

ollama serve

Install the required embedding model:

ollama pull all-minilm

Verify model installation:

ollama list

Setting Up the n8n Workflow

Workflow Overview

Our workflow consists of these key components:

Manual/Scheduled Trigger: Starts the workflow
Collection Check: Verifies if Qdrant collection exists
URL Configuration: Sets the target URL and parameters
Scrapeless Web Request: Extracts HTML content
Claude Data Extraction: Processes and structures the data
Ollama Embeddings: Generates vector embeddings
Qdrant Storage: Saves vectors and metadata
Notification: Sends status updates via webhook

Step 1: Configure Workflow Trigger and Collection Check

Start by adding a Manual Trigger node, then add a HTTP Request node to check if your Qdrant collection exists. You can customize the collection name in this initial step - the workflow will automatically create the collection if it doesn't exist.

Important Note: If you want to use a different collection name than the default "hacker-news", make sure to change it consistently in ALL nodes that reference Qdrant.

Step 2: Configure Scrapeless Web Request

Add an HTTP Request node for Scrapeless web scraping. Configure the node using the curl command provided earlier as a reference, replacing YOUR_API_TOKEN with your actual Scrapeless API token.

You can configure more advanced scraping parameters at Scrapeless Web Unlocker.

Step 3: Claude Data Extraction

Add a node to process the HTML content using Claude. You'll need to provide your Claude API key for authentication. The Claude extractor analyzes the HTML content and returns structured data in JSON format.

Step 4: Format Claude Output

This node takes Claude's response and prepares it for vectorization by extracting the relevant information and formatting it appropriately.

Step 5: Ollama Embeddings Generation

This node sends the structured text to Ollama for embedding generation. Make sure your Ollama server is running and the all-minilm model is installed.

Step 6: Qdrant Vector Storage

This node takes the generated embeddings and stores them in your Qdrant collection along with relevant metadata.

Step 7: Notification System

The final node sends a notification with the status of the workflow execution via your configured webhook.

Troubleshooting Common Issues

n8n Node.js Version Issues

If you see an error like:

Your Node.js version X is currently not supported by n8n.
Please use Node.js v18.17.0 (recommended), v20, or v22 instead!

Fix by installing nvm and using a compatible Node.js version as described in the setup section.

Scrapeless API Connection Issues

Verify your API token is correct
Check if you're hitting API rate limits
Ensure proper URL formatting

Ollama Embedding Errors

Common error: connect ECONNREFUSED ::1:11434

Fix:

Ensure Ollama is running: ollama serve
Verify model is installed: ollama pull all-minilm
Use direct IP (127.0.0.1) instead of localhost
Check if another process is using port 11434

Advanced Usage Scenarios

Batch Processing Multiple URLs

To process multiple URLs in one workflow execution:

Use a Split In Batches node to process URLs in parallel
Configure proper error handling for each batch
Use the Merge node to combine results

Scheduled Data Updates

Keep your vector database current with scheduled updates:
Replace manual trigger with Schedule node
Configure update frequency (daily, weekly, etc.)
Use the If node to process only new or changed content

Custom Extraction Templates

Adapt Claude's extraction for different content types:
Create specific prompts for news articles, product pages, documentation, etc.
Use the Switch node to select the appropriate prompt
Store extraction templates as environment variables

Conclusion

This n8n workflow creates a powerful data pipeline combining the strengths of Scrapeless web scraping, Claude AI extraction, vector embeddings, and Qdrant storage. By automating these complex processes, you can focus on using the extracted data rather than the technical challenges of obtaining it.

The modular nature of n8n allows you to extend this workflow with additional processing steps, integration with other systems, or custom logic to meet your specific needs. Whether you're building an AI knowledge base, conducting competitive analysis, or monitoring web content, this workflow provides a solid foundation.

Best Practices for Automation and Web Scraping Using Scrapeless Scraping Browser

datacollection — Thu, 08 May 2025 13:47:37 +0000

Introduction: A New Paradigm of Browser Automation and Data Collection in the AI Era

With the rapid rise of generative AI, AI agents, and data-intensive applications, browsers are evolving from traditional "user interaction tools" into "data execution engines" for intelligent systems. In this new paradigm, many tasks no longer rely on single API endpoints but instead leverage automated browser control to handle complex page interactions, content scraping, task orchestration, and context retrieval.

From price comparisons on e-commerce sites and map screenshots to search engine result parsing and social media content extraction, the browser is becoming a crucial interface for AI to access real-world data. However, the complexity of modern web structures, robust anti-bot measures, and high concurrency demands pose significant technical and operational challenges for traditional solutions like local Puppeteer/Playwright instances or proxy rotation strategies.

Enter the Scrapeless Scraping Browser—an advanced, cloud-based browser platform purpose-built for large-scale automation. It overcomes key technical barriers such as anti-scraping mechanisms, fingerprint detection, and proxy maintenance. Furthermore, it offers cloud-native concurrency scheduling, human-like behavior simulation, and structured data extraction, positioning itself as a vital infrastructure component in the next generation of automation systems and data pipelines.

This article explores the core capabilities of Scrapeless and its practical applications in browser automation and web scraping. By analyzing current industry trends and future directions, we aim to provide developers, product builders, and data teams with a comprehensive and systematic guide.

I. Background: Why Do We Need Scrapeless Scraping Browser?

1.1 The Evolution of Browser Automation

In the AI-driven automation era, browsers are no longer just tools for human interaction—they have become essential execution endpoints for acquiring both structured and unstructured data. In many real-world scenarios, APIs are either unavailable or limited, making it necessary to simulate human behavior via browsers for data collection, task execution, and information extraction.

Common use cases include:

Price comparison on e-commerce sites: Price and stock data are often loaded asynchronously in the browser.
Parsing search engine result pages: Content must be fully loaded by scrolling and clicking on page elements.
Multilingual websites, legacy systems, and intranet platforms: Data access is impossible via API.

Traditional scraping solutions (e.g., locally run Puppeteer/Playwright or proxy rotation setups) often suffer from poor stability under high concurrency, frequent anti-bot blocking, and high maintenance costs. Scrapeless Scraping Browser, with its cloud-native deployment and real browser behavior simulation, provides developers with a high-availability, reliable browser automation platform—serving as critical infrastructure for AI automation systems and data workflows.

1.2 The Challenge of Anti-Bot Mechanisms

At the same time, as anti-bot technologies evolve, traditional crawler tools are increasingly flagged as bot traffic by target websites, resulting in IP bans and access restrictions. Common anti-scraping mechanisms include:

Browser fingerprinting: Detects abnormal access patterns via User-Agent, canvas rendering, TLS handshake, and more.
CAPTCHA verification: Requires users to prove they are human.
IP blacklisting: Blocks IPs that access too frequently.
Behavioral analysis algorithms: Detect unusual mouse movement, scroll speeds, and interaction logic.

Scrapeless Scraping Browser effectively overcomes these challenges through precise browser fingerprint customization, built-in CAPTCHA solving, and flexible proxy support—becoming core infrastructure for the next generation of automation tools.

II. Core Capabilities of Scrapeless

The Scrapeless Scraping Browser delivers powerful core capabilities, offering users stable, efficient, and scalable data interaction features. Below are its main functional modules and technical details:

2.1 Real Browser Environment

Scrapeless is built on the Chromium engine, providing a complete browser environment capable of simulating real user behavior. Key features include:

TLS fingerprint spoofing: Fakes TLS handshake parameters to bypass traditional anti-bot mechanisms.
Dynamic fingerprint obfuscation: Adjusts User-Agent, screen resolution, timezone, etc., to make each session appear highly human-like.
Localization support: Customize language, region, and timezone settings to make interactions with target websites more natural.

Deep Customization of Browser Fingerprints

Scrapeless offers comprehensive customization of browser fingerprints, allowing users to create more "authentic" browsing environments:

User-Agent control: Define the User-Agent string in browser HTTP requests, including browser engine, version, and OS.
Screen resolution mapping: Set the return values of screen.width and screen.height to simulate common display sizes.
Platform property locking: Specify the return value of navigator.platform in JavaScript to simulate the operating system type.
Localized environment emulation: Fully supports custom localization settings, affecting content rendering, time format, and language preference detection on websites.

2.2 Cloud-Based Deployment and Scalability

Scrapeless is fully deployed in the cloud and offers the following advantages:

No local resources required: Reduces hardware costs and improves deployment flexibility.
Globally distributed nodes: Supports large-scale concurrent tasks and overcomes geographic restrictions.
High concurrency support: From 50 to unlimited concurrent sessions—ideal for everything from small tasks to complex automation workflows.

Performance Comparison

Compared with traditional tools such as Selenium and Playwright, Scrapeless excels in high-concurrency scenarios. Below is a simple comparison table:

Feature	Scrapeless	Selenium	Playwright
Concurrency Support	Unlimited (Enterprise-grade customization)	Limited	Moderate
Fingerprint Customization	Advanced	Basic	Moderate
CAPTCHA Solving	Built-in (98% success rate) Supports reCAPTCHA, Cloudflare Turnstile/Challenge, AWS WAF, DataDome, etc.	External dependency	External dependency

At the same time, Scrapeless performs better than other competing products in high-concurrency scenarios. The following is a summary of its capabilities from different dimensions:

Feature / Platform	Scrapeless	Browserless	Browserbase	HyperBrowser	Bright Data	ZenRows	Steel.dev
Deployment Method	Cloud-based	Cloud-based Puppeteer containers	Multi-browser cloud cluster	Cloud-based headless browser platform	Cloud deployment	Browser API interface	Browser cloud cluster + Browser API
Concurrency Support	50 to Unlimited	3–50	3–50	1–250	Up to unlimited (depending on plan)	Up to 100 (Business plan)	No official data
Anti-Detection Capability	Free CAPTCHA recognition & bypass, supports reCAPTCHA, Cloudflare Turnstile/Challenge, AWS WAF, DataDome, etc.	CAPTCHA bypass	CAPTCHA bypass + Incognito Mode	CAPTCHA bypass + Incognito + Session Mgmt	CAPTCHA bypass + Fingerprint spoofing + Proxy	Custom browser fingerprints	Proxy + Fingerprint recognition
Browser Runtime Cost	$0.063 – $0.090/hour (includes free CAPTCHA bypass)	$0.084 – $0.15/hour (unit-based)	$0.10 – $0.198/hour (includes 2–5GB free proxy)	$30–$100/month	~$0.10/hour	~$0.09/hour	$0.05 – $0.08/hour
Proxy Cost	$1.26 – $1.80/GB	$4.3/GB	$10/GB (beyond free quota)	No official data	$9.5/GB (standard); $12.5/GB (premium domains)	$2.8 – $5.42/GB	$3 – $8.25/GB

2.3 CAPTCHA automatic solution and event monitoring mechanism

Scrapeless provides advanced CAPTCHA solutions and extends a series of custom functions through Chrome DevTools Protocol (CDP) to enhance the reliability of browser automation.

CAPTCHA solving ability

Scrapeless can automatically handle mainstream CAPTCHA types, including: reCAPTCHA, Cloudflare Turnstile/Challange, AWS WAF, DataDome, etc.

Event monitoring mechanism

Scrapeless provides three core events for monitoring the CAPTCHA solving process:

Event Name	Description
Captcha.detected	CAPTCHA detected
Captcha.solveFinished	CAPTCHA solved
Captcha.solveFailed	CAPTCHA solving failed

Event Response Data Structure

Field	Type	Description
type	string	CAPTCHA type (e.g., recaptcha, turnstile)
success	boolean	Result of solving
message	string	Status message (e.g., "NOT_DETECTED", "SOLVE_FINISHED")
token?	string	Returned token upon success (optional)

2.4 Powerful proxy support

Scrapeless provides a flexible and controllable proxy integration system that supports multiple proxy modes:

Built-in residential proxy: supports geographic proxy in 195 countries/regions around the world, out of the box.
Custom proxy (premium subscription): allows users to connect to their own proxy service, which is not included in Scrapeless's proxy billing.

2.5 Session replay

Session replay is one of the most powerful features of Scrapeless Scraping Browser. It allows you to replay the session page by page to check the operations and network requests performed.

3. Code example: Scrapeless integration and use

3.1 Use of Scrapeless Scraping Browser

Puppeteer example

const puppeteer = require('puppeteer-core');
const connectionURL = 'wss://browser.scrapeless.com/browser?token=your-scrapeless-api-key&session_ttl=180&proxy_country=ANY';

(async () => {
    const browser = await puppeteer.connect({browserWSEndpoint: connectionURL});
    const page = await browser.newPage();
    await page.goto('https://www.scrapeless.com');
    console.log(await page.title());
    await browser.close();
})();

Playwright Example

const {chromium} = require('playwright-core');
const connectionURL = 'wss://browser.scrapeless.com/browser?token=your-scrapeless-api-key&session_ttl=180&proxy_country=ANY';

(async () => {
    const browser = await chromium.connectOverCDP(connectionURL);
    const page = await browser.newPage();
    await page.goto('https://www.scrapeless.com');
    console.log(await page.title());
    await browser.close();
})();

3.2 Scrapeless Scraping Browser Fingerprint Parameters Example Code

The following is a simple example code showing how to integrate Scrapeless's browser fingerprint customization function through Puppeteer and Playwright:
Puppeteer Example

const puppeteer = require('puppeteer-core');

// custom browser fingerprint
const fingerprint = {
    userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.1.2.3 Safari/537.36',
    platform: 'Windows',
    screen: {
        width: 1280, height: 1024
    },
    localization: {
        languages: ['zh-HK', 'en-US', 'en'], timezone: 'Asia/Hong_Kong',
    }
}

const query = new URLSearchParams({
  token: 'APIKey', // required
  session_ttl: 180,
  proxy_country: 'ANY',
  fingerprint: encodeURIComponent(JSON.stringify(fingerprint)),
});

const connectionURL = `wss://browser.scrapeless.com/browser?${query.toString()}`;

(async () => {
    const browser = await puppeteer.connect({browserWSEndpoint: connectionURL});
    const page = await browser.newPage();
    await page.goto('https://www.scrapeless.com');
    const info = await page.evaluate(() => {
        return {
            screen: {
                width: screen.width,
                height: screen.height,
            },
            userAgent: navigator.userAgent,
            timeZone: Intl.DateTimeFormat().resolvedOptions().timeZone,
            languages: navigator.languages
        };
    });
    console.log(info);
    await browser.close();
})();

Playwright Example

const { chromium } = require('playwright-core');

// custom browser fingerprint
const fingerprint = {
    userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.1.2.3 Safari/537.36',
    platform: 'Windows',
    screen: {
        width: 1280, height: 1024
    },
    localization: {
        languages: ['zh-HK', 'en-US', 'en'], timezone: 'Asia/Hong_Kong',
    }
}

const query = new URLSearchParams({
  token: 'APIKey', // required
  session_ttl: 180,
  proxy_country: 'ANY',
  fingerprint: encodeURIComponent(JSON.stringify(fingerprint)),
});

const connectionURL = `wss://browser.scrapeless.com/browser?${query.toString()}`;

(async () => {
    const browser = await chromium.connectOverCDP(connectionURL);
    const page = await browser.newPage();
    await page.goto('https://www.scrapeless.com');
    const info = await page.evaluate(() => {
        return {
            screen: {
                width: screen.width,
                height: screen.height,
            },
            userAgent: navigator.userAgent,
            timeZone: Intl.DateTimeFormat().resolvedOptions().timeZone,
            languages: navigator.languages
        };
    });
    console.log(info);
    await browser.close();
})();

3.3 CAPTCHA event monitoring example

The following is a complete code example of using Scrapeless to monitor CAPTCHA events, showing how to monitor the solution status of CAPTCHA in real time:

// Listen for CAPTCHA solving events
const client = await page.createCDPSession();

client.on('Captcha.detected', (result) => {
  console.log('Captcha detected:', result);
});

await new Promise((resolve, reject) => {
  client.on('Captcha.solveFinished', (result) => {
    if (result.success) resolve();
  });
  client.on('Captcha.solveFailed', () =>
    reject(new Error('Captcha solve failed'))
  );
  setTimeout(() =>
      reject(new Error('Captcha solve timeout')),
    5 * 60 * 1000
  );
});

After mastering the core features and advantages of Scrapeless Scraping Browser, we can not only better understand its value in modern web scraping but also leverage its performance advantages more effectively. To help developers automate and scrape websites more efficiently and securely, we will now explore how to apply Scrapeless Scraping Browser in specific use cases, based on common scenarios.

4. Best Practices for Automation and Web Scraping Using Scrapeless Scraping Browser

Legal Disclaimer and Precautions

This tutorial covers popular web scraping techniques for education. Interacting with public servers requires diligence and respect and here's a good summary of what not to do:

Do not scrape at rates that could damage the website.

Do not scrape data that's not available publicly.

Do not store PII of EU citizens who are protected by GDPR.

Do not repurpose the entire public datasets which can be illegal in some countries. ### Understanding Cloudflare Protection

What is Cloudflare?

Cloudflare is a cloud platform that integrates content delivery network (CDN), DNS acceleration, and security protection. Websites use Cloudflare to mitigate Distributed Denial of Service (DDoS) attacks (i.e., websites going offline due to multiple access requests) and ensure that websites using it are always operational.

Here’s a simple example to understand how Cloudflare works:

When you visit a website that has Cloudflare enabled (such as example.com), your request first reaches Cloudflare’s edge server, not the origin server. Cloudflare will then determine whether to allow your request to continue based on several rules, such as:

Whether the cached page can be returned directly;
Whether you need to pass a CAPTCHA test;
Whether your request will be blocked;
Whether the request will be forwarded to the actual website server (origin).

If you are identified as a legitimate user, Cloudflare will forward the request to the origin server and return the content to you. This mechanism greatly enhances the website's security but also presents significant challenges for automated access.

Bypassing Cloudflare is one of the toughest technical challenges in many data collection tasks. Below, we will dive deeper into why bypassing Cloudflare is difficult.

Challenges in Bypassing Cloudflare Protection Bypassing Cloudflare is not easy, especially when advanced anti-bot features (such as Bot Management, Managed Challenge, Turnstile Verification, JS challenges, etc.) are enabled. Many traditional scraping tools (like Selenium and Puppeteer) are often detected and blocked before requests are even made due to obvious fingerprint features or unnatural behavior simulations.

Although there are some open-source tools specifically designed to bypass Cloudflare (such as FlareSolverr, undetected-chromedriver), these tools typically have a short lifespan. Once they are widely used, Cloudflare quickly updates its detection rules to block them. This means that to bypass Cloudflare's protection mechanisms in a sustained and stable manner, teams often need in-house development capabilities and continuous resource investment for maintenance and updates.

Here are the main challenges in bypassing Cloudflare protection:

Strict Browser Fingerprint Recognition: Cloudflare detects fingerprint features in requests such as User-Agent, language settings, screen resolution, time zone, and Canvas/WebGL rendering. If it detects abnormal browsers or automation behaviors, it blocks the request.
Complex JS Challenge Mechanisms: Cloudflare dynamically generates JavaScript challenges (such as CAPTCHA, delayed redirects, logical calculations, etc.), and automated scripts often struggle to correctly parse or execute these complex logics.
Behavioral Analysis Systems: In addition to static fingerprints, Cloudflare also analyzes user behavior trajectories, such as mouse movements, time spent on a page, scrolling actions, etc. This requires high precision in simulating human behavior.
Rate and Concurrency Control: High-frequency access can easily trigger Cloudflare’s rate limiting and IP blocking strategies. Proxy pools and distributed scheduling must be highly optimized.
Invisible Server-Side Validation: Since Cloudflare is an edge interceptor, many real requests are blocked before reaching the origin server, making traditional packet capture analysis methods ineffective.

Therefore, successfully bypassing Cloudflare requires simulating real browser behavior, executing JavaScript dynamically, configuring fingerprints flexibly, and using high-quality proxies and dynamic scheduling mechanisms.

Bypassing Idealista Cloudflare with Scrapeless Scraping Browser to Collect Real Estate Data

In this chapter, we will demonstrate how to use Scrapeless Scraping Browser to build an efficient, stable, and anti-anti-scraping automation system for scraping real estate data from Idealista, a leading European real estate platform. Idealista employs multiple protection mechanisms, including Cloudflare, dynamic loading, IP rate limiting, and user behavior recognition, making it a highly challenging target platform.

We will focus on the following technical aspects:

Bypassing Cloudflare verification pages
Custom fingerprinting and simulating real user behavior
Using Session Replay
High-concurrency scraping with multiple proxy pools
Cost optimization

Understanding the Challenge: Idealista's Cloudflare Protection

Idealista is a leading online real estate platform in Southern Europe, offering millions of listings for various types of properties, including residential homes, apartments, and shared rooms. Given the highly commercial value of its property data, the platform has implemented strict anti-scraping measures.

To combat automated scraping, Idealista has deployed Cloudflare — a widely used anti-bot and security protection system designed to defend against malicious bots, DDoS attacks, and data abuse. Cloudflare's anti-scraping mechanisms primarily consist of the following elements:

Access Verification Mechanisms: Including JS Challenge, browser integrity checks, and CAPTCHA verification, to determine whether the visitor is a real user.
Behavioral Analysis: Detecting real users through actions such as mouse movements, clicking patterns, and scroll speeds.
HTTP Header Analysis: Inspecting browser types, language settings, and referrer data to check for discrepancies. Suspicious headers may expose attempts to disguise automated bots.
Fingerprint Detection and Blocking: Identifying traffic generated by automation tools (like Selenium and Puppeteer) through browser fingerprints, TLS fingerprints, and header information.
Edge Node Filtering: Requests first enter Cloudflare's global edge network, which evaluates their risk. Only requests deemed low-risk are forwarded to Idealista's origin servers.

Next, we will explain in detail how to use Scrapeless Scraping Browser to bypass Idealista's Cloudflare protection and successfully collect real estate data.

Bypassing Idealista Cloudflare with Scrapeless Scraping Browser

Prerequisites

Before we begin, let's make sure we have the necessary tools:

Python: If you haven't installed Python yet, please download the latest version and install it on your system.
Required Libraries: You need to install several Python libraries. Open a terminal or command prompt and run the following command:

  pip install requests beautifulsoup4 lxml selenium selenium-wire undetected-chromedriver

ChromeDriver: Download ChromeDriver. Make sure to choose the version that matches your installed version of Chrome.
Scrapeless Account: To bypass Idealista's bot protection, you’ll need a Scrapeless Scraping Browser account. You can sign up here and receive a $2 free trial.

Locating the Data

Our goal is to extract detailed information about each property listing on Idealista. We can use the browser’s developer tools to understand the structure of the site and identify the HTML elements we need to target.

Right-click anywhere on the page and select Inspect to view the page source.

In this article, we will focus on scraping property listings from Alcala de Henares, Madrid using the following URL:

https://www.idealista.com/venta-viviendas/alcala-de-henares-madrid/

We want to extract the following data points from each listing:

Title
Price
Area information
Property description
Image URLs

Below you can see the annotated property listing page showing where all the information for each property is located.

By inspecting the HTML source code, we can identify the CSS selector for each data point. CSS selectors are patterns used to select elements in an HTML document.

By inspecting the HTML source code, we found that each property listing is contained within an <article> tag with the class item. Within each item:

The title is located in an <a> tag with the class item-link.
The price is found in a <span> tag with the class item-price.
And so on for other data points.

Step 1: Set Up Selenium with ChromeDriver

First, we need to configure Selenium to use ChromeDriver. Start by setting up chrome_options and initializing the ChromeDriver.

from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time
from datetime import datetime
import json
def listings(url):
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    s = Service("Replace with your path to ChromeDriver")
    driver = webdriver.Chrome(service=s, chrome_options=chrome_options)

This code imports the necessary modules, including seleniumwire for advanced browser interactions and BeautifulSoup for HTML parsing.

We define a function listings(url) and configure Chrome to run in headless mode by adding the --headless argument to chrome_options. Then, we initialize ChromeDriver using the specified service path.

Step 2: Load the Target URL

Next, we load the target URL and wait for the page to fully load.

    driver.get(url)
    time.sleep(8)  # Adjust based on website's load time

Here, the driver.get(url) command instructs the browser to navigate to the specified URL.

We use time.sleep(8) to pause the script for 8 seconds, allowing enough time for the web page to fully load. This wait time can be adjusted depending on the website's loading speed.

Step 3: Parse the Page Content

Once the page is loaded, we use BeautifulSoup to parse its content:

    soup = BeautifulSoup(driver.page_source, "lxml")
    driver.quit()

Here, we use driver.page_source to retrieve the HTML content of the loaded page, and parse it using BeautifulSoup with the lxml parser. Finally, we call driver.quit() to close the browser instance and clean up resources.

Step 4: Extract Data from the Parsed HTML

Next, we extract the relevant data from the parsed HTML.

    house_listings = soup.find_all("article", class_="item")
    extracted_data = []
    for listing in house_listings:
        description_elem = listing.find("div", class_="item-description")
        description_text = description_elem.get_text(strip=True) if description_elem else "nil"
        item_details = listing.find_all("span", class_="item-detail")
        bedrooms = item_details[0].get_text(strip=True) if len(item_details) > 0 else "nil"
        area = item_details[1].get_text(strip=True) if len(item_details) > 1 else "nil"
        image_urls = [img["src"] for img in listing.find_all("img") if img.get("src")]
        first_image_url = image_urls[0] if image_urls else "nil"
        listing_info = {
            "Title": listing.find("a", class_="item-link").get("title", "nil"),
            "Price": listing.find("span", class_="item-price").get_text(strip=True),
            "Bedrooms": bedrooms,
            "Area": area,
            "Description": description_text,
            "Image URL": first_image_url,
        }
        extracted_data.append(listing_info)

Here, we look for all elements matching the article tag with the class name item, which represent individual property listings. For each listing, we extract its title, details (such as number of bedrooms and area), and the image URL. We store these details in a dictionary and append each dictionary to a list called extracted_data.

Step 5: Save the Extracted Data

Finally, we save the extracted data into a JSON file.

   current_datetime = datetime.now().strftime("%Y%m%d%H%M%S")
    json_filename = f"new_revised_data_{current_datetime}.json"
    with open(json_filename, "w", encoding="utf-8") as json_file:
        json.dump(extracted_data, json_file, ensure_ascii=False, indent=2)
    print(f"Extracted data saved to {json_filename}")
url = "https://www.idealista.com/venta-viviendas/alcala-de-henares-madrid/"
idealista_listings = listings(url)

Here is the complete code:

from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import time
from datetime import datetime
import json
def listings(url):
    chrome_options = Options()
    chrome_options.add_argument("--headless")
    s = Service("Replace with your path to ChromeDriver")
    driver = webdriver.Chrome(service=s, chrome_options=chrome_options)
    driver.get(url)
    time.sleep(8)  # Adjust based on website's load time
    soup = BeautifulSoup(driver.page_source, "lxml")
    driver.quit()
    house_listings = soup.find_all("article", class_="item")
    extracted_data = []
    for listing in house_listings:
        description_elem = listing.find("div", class_="item-description")
        description_text = description_elem.get_text(strip=True) if description_elem else "nil"
        item_details = listing.find_all("span", class_="item-detail")
        bedrooms = item_details[0].get_text(strip=True) if len(item_details) > 0 else "nil"
        area = item_details[1].get_text(strip=True) if len(item_details) > 1 else "nil"
        image_urls = [img["src"] for img in listing.find_all("img") if img.get("src")]
        first_image_url = image_urls[0] if image_urls else "nil"
        listing_info = {
            "Title": listing.find("a", class_="item-link").get("title", "nil"),
            "Price": listing.find("span", class_="item-price").get_text(strip=True),
            "Bedrooms": bedrooms,
            "Area": area,
            "Description": description_text,
            "Image URL": first_image_url,
        }
        extracted_data.append(listing_info)
    current_datetime = datetime.now().strftime("%Y%m%d%H%M%S")
    json_filename = f"new_revised_data_{current_datetime}.json"
    with open(json_filename, "w", encoding="utf-8") as json_file:
        json.dump(extracted_data, json_file, ensure_ascii=False, indent=2)
    print(f"Extracted data saved to {json_filename}")
url = "https://www.idealista.com/venta-viviendas/alcala-de-henares-madrid/"
idealista_listings = listings(url)

Bypassing Bot Detection

If you’ve run the script at least twice during this tutorial, you may have noticed that a CAPTCHA page appears.

The Cloudflare Challenge page initially loads the cf-chl-bypass script and performs JavaScript computations, which typically takes about 5 seconds.

Scrapeless offers a simple and reliable way to access data from sites like Idealista without having to build and maintain your own scraping infrastructure. The Scrapeless Scraping Browser is a high-concurrency automation solution built for AI. It’s a high-performance, cost-effective, anti-blocking browser platform designed for large-scale data scraping and simulates highly human-like behavior. It can handle reCAPTCHA, Cloudflare Turnstile/Challenge, AWS WAF, DataDome, and more in real time, making it an efficient web scraping solution.

Below are the steps to bypass Cloudflare protection using Scrapeless:

Step 1: Preparation

1.1 Create a Project Folder

Create a new folder for your project, for example, scrapeless-bypass.
Navigate to the folder in your terminal:

cd path/to/scrapeless-bypass

1.2 Initialize the Node.js project

Run the following command to create the package.json file:

npm init -y

1.3 Install required dependencies

Install Puppeteer-core, which allows remote connections to the browser instance:

npm install puppeteer-core

If Puppeteer is not already installed on your system, install the full version:

npm install puppeteer puppeteer-core

Step 2: Get Your Scrapeless API Key

2.1 Sign Up on Scrapeless

Go to Scrapeless and create an account.
Navigate to the API Key Management section.
Generate a new API key and copy it.

Step 3: Connect to Scrapeless Browserless

3.1 Get the WebSocket connection URL

Scrapeless provides Puppeteer with a WebSocket connection URL to interact with the cloud-based browser.

The format is:

wss://browser.scrapeless.com/browser?token=APIKey&session_ttl=180&proxy_country=ANY

Replace APIKey with your actual Scrapeless API key.

3.2 Configure Connection Parameters

token: Your Scrapeless API key
session_ttl: Duration of the browser session (in seconds), e.g., 180
proxy_country: Country code of the proxy server (e.g., GB for the United Kingdom, US for the United States)

Step 4: Write the Puppeteer Script

4.1 Create the Script File

Inside your project folder, create a new JavaScript file named bypass-cloudflare.js.

4.2 Connect to Scrapeless and Launch Puppeteer

Add the following code to bypass-cloudflare.js:

import puppeteer from 'puppeteer-core';

const API_KEY = 'your_api_key'; // Replace with your actual API Keyconst host = 'wss://browser.scrapeless.com';
const query = new URLSearchParams({token: API_KEY,session_ttl: '180', // Browser session duration in secondsproxy_country: 'GB', // Proxy country codeproxy_session_id: 'test_session', // Proxy session ID (keeps the same IP)proxy_session_duration: '5' // Proxy session duration in minutes
}).toString();

const connectionURL = `${host}/browser?${query}`;

const browser = await puppeteer.connect({browserWSEndpoint: connectionURL,defaultViewport: null,
});
console.log('Connected to Scrapeless');

4.3 Open a webpage and bypass Cloudflare

Extend the script to open a new page and navigate to a website protected by Cloudflare:

const page = await browser.newPage();
await page.goto('https://www.scrapingcourse.com/cloudflare-challenge', { waitUntil: 'domcontentloaded' });

4.4 Waiting for page elements to load

Make sure Cloudflare protection is bypassed before proceeding:

await page.waitForSelector('main.page-content .challenge-info', { timeout: 30000 }); // Adjust selector as needed

4.5 Take a screenshot

To verify whether Cloudflare protection has been successfully bypassed, take a screenshot of the page:

await page.screenshot({ path: 'challenge-bypass.png' });
console.log('Screenshot saved as challenge-bypass.png');

4.6 Complete script

The following is the complete script:

import puppeteer from 'puppeteer-core';

const API_KEY = 'your_api_key'; // Replace with your actual API Key
const host = 'wss://browser.scrapeless.com';
const query = new URLSearchParams({
  token: API_KEY,
  session_ttl: '180',
  proxy_country: 'GB',
  proxy_session_id: 'test_session',
  proxy_session_duration: '5'
}).toString();

const connectionURL = `${host}/browser?${query}`;

(async () => {
  try {
    // Connect to Scrapeless
    const browser = await puppeteer.connect({
      browserWSEndpoint: connectionURL,
      defaultViewport: null,
    });
    console.log('Connected to Scrapeless');

    // Open a new page and navigate to the target website
    const page = await browser.newPage();
    await page.goto('https://www.scrapingcourse.com/cloudflare-challenge', { waitUntil: 'domcontentloaded' });

    // Wait for the page to load completely
    await page.waitForTimeout(5000); // Adjust delay if necessary
    await page.waitForSelector('main.page-content', { timeout: 30000 });

    // Capture a screenshot
    await page.screenshot({ path: 'challenge-bypass.png' });
    console.log('Screenshot saved as challenge-bypass.png');

    // Close the browser
    await browser.close();
    console.log('Browser closed');
  } catch (error) {
    console.error('Error:', error);
  }
})();

Step 5: Run the script

5.1 Save the script

Make sure the script is saved as bypass-cloudflare.js.

5.2 Execute the script

Run the script using Node.js:

node bypass-cloudflare.js

5.3 Expected Output

If everything is set up correctly, the terminal will display:

Connected to Scrapeless
Screenshot saved as challenge-bypass.png
Browser closed

The challenge-bypass.png file will appear in your project folder, confirming that Cloudflare protection has been successfully bypassed.

You can also integrate Scrapeless Scraping Browser directly into your scraping code:

const puppeteer = require('puppeteer-core');
const connectionURL = 'wss://browser.scrapeless.com/browser?token=C4778985476352D77C08ECB031AF0857&session_ttl=180&proxy_country=ANY';

(async () => {
    const browser = await puppeteer.connect({browserWSEndpoint: connectionURL});
    const page = await browser.newPage();
    await page.goto('https://www.scrapeless.com');
    console.log(await page.title());
    await browser.close();
})();

Fingerprint Customization

When scraping data from websites—especially large real estate platforms like Idealista—even if you successfully bypass Cloudflare challenges using Scrapeless, you might still be flagged as a bot due to repetitive or high-volume access.

Websites often use browser fingerprinting to detect automated behavior and restrict access.

⚠️ Common Issues You May Encounter

Slow response times after multiple scrapes

The site may throttle requests based on IP or behavioral patterns.
Page layout fails to render

Dynamic content may rely on real browser environments, causing missing or broken data during scraping.
Missing listings in certain regions

Websites may block or hide content based on suspicious traffic patterns.

These problems are usually caused by identical browser configurations for each request. If your browser fingerprint remains unchanged, it’s easy for anti-bot systems to detect automation.

Solution: Custom Fingerprinting with Scrapeless

Scrapeless Scraping Browser provides built-in support for fingerprint customization to mimic real user behavior and avoid detection.

You can randomize or customize the following fingerprint elements:

Fingerprint Element	Description
User-Agent	Mimic various OS/browser combinations (e.g., Chrome on Windows/Mac).
Platform	Simulate different operating systems (Windows, macOS, etc.).
Screen Size	Emulate various device resolutions to avoid mobile/desktop mismatches.
Localization	Align language and timezone with geolocation for consistency.

By rotating or customizing these values, each request appears more natural—reducing the risk of detection and improving data extraction reliability.

Code example:

const puppeteer = require('puppeteer-core');

const query = new URLSearchParams({
  token: 'your-scrapeless-api-key', // required
  session_ttl: 180,
  proxy_country: 'ANY',
  // Set fingerprint parameters
  userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.6998.45 Safari/537.36',
  platform: 'Windows',
  screen: JSON.stringify({ width: 1280, height: 1024 }),
  localization: JSON.stringify({
    locale: 'zh-HK',
    languages: ['zh-HK', 'en-US', 'en'],
    timezone: 'Asia/Hong_Kong',
  })
});

const connectionURL = `wss://browser.Scrapeless.com/browser?${query.toString()}`;

(async () => {
    const browser = await puppeteer.connect({browserWSEndpoint: connectionURL});
    const page = await browser.newPage();
    await page.goto('https://www.Scrapeless.com');
    console.log(await page.title());
    await browser.close();
})();

Session Replay

After customizing browser fingerprints, page stability significantly improves, and content extraction becomes more reliable.

However, during large-scale scraping operations, unexpected issues may still cause extraction failures. To address this, Scrapeless offers a powerful Session Replay feature.

What is Session Replay?

Session Replay records the entire browser session in detail, capturing all interactions, such as:

Page load process
Network request and response data
JavaScript execution behavior
Dynamically loaded but unparsed content

Why Use Session Replay?

When scraping complex websites like Idealista, Session Replay can greatly improve debugging efficiency.

Benefit	Description
Precise Issue Tracking	Quickly identify failed requests without guesswork
No Need to Re-run Code	Analyze issues directly from the replay instead of rerunning the scraper
Improved Collaboration	Share replay logs with team members for easier troubleshooting
Dynamic Content Analysis	Understand how dynamically loaded data behaves during scraping

Usage Tip

Once Session Replay is enabled, check the replay logs first whenever a scrape fails or data looks incomplete. This helps you diagnose the issue faster and reduce debugging time.

Proxy Configuration

When scraping Idealista, it's important to note that the platform is highly sensitive to non-local IP addresses—especially when accessing listings from specific cities. If your IP originates from outside the country, Idealista may:

Block the request entirely
Return a simplified or stripped-down version of the page
Serve empty or incomplete data, even without triggering a CAPTCHA

Scrapeless Built-in Proxy Support

Scrapeless offers built-in proxy configuration, allowing you to specify your geographic source directly.

You can configure this using either:

proxy_country: A two-letter country code (e.g., 'ES' for Spain)
proxy_url: Your own proxy server URL

Example usage:

proxy_country: 'ES',

High Concurrency

The page we just scraped from Idealista—Alcalá de Henares Real Estate Listings—has as many as 6 pages of listings.

When you're researching industry trends or gathering competitive marketing strategies, you might need to scrape real estate data from 20+ cities daily, covering thousands of pages. In some cases, you may even need to refresh this data every hour.

High-Concurrency Requirements

To handle this volume efficiently, consider the following requirements:

Multiple concurrent connections: To scrape data from hundreds of pages without long wait times.
Automation tools: Use Scrapeless Scraping Browser or similar tools that can handle concurrent requests at scale.
Session management: Maintain persistent sessions to avoid excessive CAPTCHAs or IP blocks.

Scrapeless Scalability

Scrapeless is specifically designed for high-concurrency scraping. It offers:

Parallel browser sessions: Handle multiple requests simultaneously, allowing you to scrape large amounts of data across many cities.
Low-cost, high-efficiency scraping: Scraping in parallel reduces the cost per page scraped while optimizing throughput.
Bypass high-volume anti-bot defenses: Automatically handles CAPTCHA and other verification systems, even during high-load scraping.

Tip: Ensure your requests are spaced out enough to mimic human-like browsing behavior and prevent rate-limiting or bans from Idealista.

Scalability & Cost Efficiency

Regular Puppeteer struggles to efficiently scale sessions and integrate with queuing systems. However, Scrapeless Scraping Browser supports seamless scaling from dozens of concurrent sessions to unlimited concurrent sessions, ensuring zero queue time and zero timeouts even during peak task loads.

Here’s a comparison of various tools for high-concurrency scraping. Even with Scrapeless' high-concurrency browser, you don’t need to worry about costs—in fact, it can help you save nearly 50% in fees.

Tool Comparison

Tool Name	Hourly Rate (USD/hour)	Proxy Fees (USD/GB)	Concurrent Support
Scrapeless	$0.063 – $0.090/hour (depends on concurrency & usage)	$1.26 – $1.80/GB	50 / 100 / 200 / 400 / 600 / 1000 / Unlimited
Browserbase	$0.10 – $0.198/hour (includes 2-5GB free proxies)	$10/GB (after the free allocation)	3 (Basic) / 50 (Advanced)
Brightdata	$0.10/hour	$9.5/GB (Standard); $12.5/GB (Advanced domains)	Unlimited
Zenrows	$0.09/hour	$2.8 – $5.42/GB	Up to 100
Browserless	$0.084 – $0.15/hour (unit-based billing)	$4.3/GB	3 / 10 / 50

Tip: If you require massive-scale scraping and high-concurrency support, Scrapeless offers the best cost-to-performance ratio.

Cost Control Strategies for Web Scraping

Careful users may have noticed that the Idealista pages we scrape often contain large amounts of high-definition property images, interactive maps, video presentations, and ad scripts. While these elements are user-friendly for end users, they are unnecessary for data extraction and significantly increase bandwidth consumption and costs.

To optimize traffic usage, we recommend users employ the following strategies:

Resource Interception: Intercept unnecessary resource requests to reduce traffic consumption.
Request URL Interception: Intercept specific requests based on URL characteristics to further minimize traffic.
Simulate Mobile Devices: Use mobile device configurations to fetch lighter page versions.

Detailed Strategies

1. Resource Interception

Enabling resource interception can significantly improve scraping efficiency. By configuring Puppeteer's setRequestInterception function, we can block resources such as images, media, fonts, and stylesheets, avoiding large content downloads.

2. Request URL Filtering

By examining request URLs, we can filter out irrelevant requests like advertising services and third-party analytics scripts that are unrelated to the data extraction. This reduces unnecessary network traffic.

3. Simulating Mobile Devices

Simulating a mobile device (e.g., setting the user agent to an iPhone) allows you to fetch a lighter, mobile-optimized version of the page. This results in fewer resources being loaded and speeds up the scraping process.

For more information, please refer to the Scrapeless official documentation

Example Code

Here’s an example of combining these three strategies using Scrapeless Cloud Browser + Puppeteer for optimized resource scraping:

import puppeteer from 'puppeteer-core';

const scrapelessUrl = 'wss://browser.scrapeless.com/browser?token=your_api_key&session_ttl=180&proxy_country=ANY';

async function scrapeWithResourceBlocking(url) {
    const browser = await puppeteer.connect({
        browserWSEndpoint: scrapelessUrl,
        defaultViewport: null
    });
    const page = await browser.newPage();

    // Enable request interception
    await page.setRequestInterception(true);

    // Define resource types to block
    const BLOCKED_TYPES = new Set([
        'image',
        'font',
        'media',
        'stylesheet',
    ]);

    // Intercept requests
    page.on('request', (request) => {
        if (BLOCKED_TYPES.has(request.resourceType())) {
            request.abort();
            console.log(`Blocked: ${request.resourceType()} - ${request.url().substring(0, 50)}...`);
        } else {
            request.continue();
        }
    });

    await page.goto(url, {waitUntil: 'domcontentloaded'});

    // Extract data
    const data = await page.evaluate(() => {
        return {
            title: document.title,
            content: document.body.innerText.substring(0, 1000)
        };
    });

    await browser.close();
    return data;
}

// Usage
scrapeWithResourceBlocking('https://www.scrapeless.com')
    .then(data => console.log('Scraping result:', data))
    .catch(error => console.error('Scraping failed:', error));

In this way, you can not only save high traffic costs, but also speed up the crawling speed while ensuring data quality, thereby improving the overall stability and efficiency of the system.

5. Security and Compliance Recommendations

When using Scrapeless for data scraping, developers should pay attention to the following:

Comply with the target website's robots.txt file and relevant laws and regulations: Ensure that your scraping activities are legal and respect the site's guidelines.
Avoid excessive requests that could lead to website downtime: Be mindful of scraping frequency to prevent server overload.
Do not scrape sensitive information: Do not collect user privacy data, payment information, or any other sensitive content.

6. Conclusion

In the age of big data, data collection has become a crucial foundation for digital transformation across industries. Especially in fields such as market intelligence, e-commerce price comparison, competitive analysis, financial risk management, and real estate analysis, the demand for data-driven decision-making has become increasingly urgent. However, with the continuous evolution of web technologies, particularly the widespread use of dynamically loaded content, traditional web scrapers are gradually revealing their limitations. These limitations not only make scraping more difficult but also lead to the escalation of anti-scraping mechanisms, raising the barrier for web scraping.

With the advancement of web technologies, traditional scrapers can no longer meet complex scraping needs. Below are some key challenges and corresponding solutions:

Dynamic Content Loading: Browser-based scrapers, by simulating real browser rendering of JavaScript content, ensure they can scrape dynamically loaded web data.
Anti-Scraping Mechanisms: Using proxy pools, fingerprint recognition, behavior simulation, and other techniques, we can bypass the anti-scraping mechanisms commonly triggered by traditional scrapers.
High-Concurrency Scraping: Headless browsers support high-concurrency task deployment, paired with proxy scheduling, to meet the needs of large-scale data scraping.
Compliance Issues: By using legal APIs and proxy services, scraping activities can be ensured to comply with the terms of the target websites.

As a result, browser-based scrapers have become the new trend in the industry. This technology not only simulates user behavior through real browsers but also flexibly handles modern websites' complex structures and anti-scraping mechanisms, offering developers more stable and efficient scraping solutions.

Scrapeless Scraping Browser embraces this technological trend by combining browser rendering, proxy management, anti-detection technologies, and high-concurrency task scheduling, helping developers efficiently and stably complete data scraping tasks in complex online environments. It improves scraping efficiency and stability through several core advantages:

High-Concurrency Browser Solutions: Scrapeless supports large-scale, high-concurrency tasks, enabling rapid deployment of thousands of scraping tasks to meet long-term scraping demands.
Anti-Detection as a Service: Built-in CAPTCHA Solvers and customizable fingerprints help developers bypass fingerprint and behavior recognition mechanisms, greatly reducing the risk of being blocked.
Visual Debugging Tool - Session Replay: By replaying each browser interaction during the scraping process, developers can easily debug and diagnose issues in the scraping process, especially for handling complex pages and dynamically loaded content.
Compliance and Transparency Assurance: Scrapeless emphasizes compliant data scraping, supporting adherence to website robots.txt rules and providing detailed scraping logs to ensure that users' data scraping activities comply with target websites' policies.
Flexible Scalability: Scrapeless integrates seamlessly with Puppeteer, allowing users to customize their scraping strategies and connect with other tools or platforms for a one-stop data scraping and analysis workflow.

Whether scraping e-commerce platforms for price comparisons, extracting real estate website data, or applying it in financial risk monitoring and market intelligence analysis, Scrapeless provides high-efficiency, intelligent, and reliable solutions for various industries.

With the technical details and best practices covered in this article, you now understand how to leverage Scrapeless for large-scale data scraping. Whether handling dynamic pages, extracting complex interactive data, optimizing traffic usage, or overcoming anti-scraping mechanisms, Scrapeless helps you achieve your scraping goals swiftly and efficiently.

Optimizing Headless Browser Traffic: Cost Reduction Strategies with Puppeteer for Efficient Data Scraping

datacollection — Sun, 27 Apr 2025 02:30:58 +0000

Overview

When using Puppeteer for data scraping, traffic consumption is an important consideration. Especially when using proxy services, traffic costs can increase significantly. To optimize traffic usage, we can adopt the following strategies:

Resource interception: Reduce traffic consumption by intercepting unnecessary resource requests.
Request URL interception: Further reduce traffic by intercepting specific requests based on URL characteristics.
Simulate mobile devices: Use mobile device configurations to obtain lighter page versions.
Comprehensive optimization: Combine the above methods to achieve the best results.

Optimization Scheme 1: Resource Interception

Resource Interception Introduction

In Puppeteer, page.setRequestInterception(true) can capture every network request initiated by the browser and decide to continue (request.continue()), terminate (request.abort()), or customize the response (request.respond()).

This method can significantly reduce bandwidth consumption, especially suitable for crawling, screenshotting, and performance optimization scenarios.

Interceptable Resource Types and Suggestions

Resource Type	Description	Example	Impact After Interception	Recommendation
`image`	Image resources	JPG/PNG/GIF/WebP images	Images will not be displayed	⭐ Safe
`font`	Font files	TTF/WOFF/WOFF2 fonts	System default fonts will be used instead	⭐ Safe
`media`	Media files	Video/audio files	Media content cannot be played	⭐ Safe
`manifest`	Web App Manifest	PWA configuration file	PWA functionality may be affected	⭐ Safe
`prefetch`	Prefetch resources	`<link rel="prefetch">`	Minimal impact on the page	⭐ Safe
`stylesheet`	CSS Stylesheet	External CSS files	Page styles are lost, may affect layout	⚠️ Caution
`websocket`	WebSocket	Real-time communication connection	Real-time functionality disabled	⚠️ Caution
`eventsource`	Server-Sent Events	Server push data	Push functionality disabled	⚠️ Caution
`preflight`	CORS preflight request	OPTIONS request	Cross-origin requests fail	⚠️ Caution
`script`	JavaScript scripts	External JS files	Dynamic functionality disabled, SPA may not render	❌ Avoid
`xhr`	XHR requests	AJAX data requests	Unable to obtain dynamic data	❌ Avoid
`fetch`	Fetch requests	Modern AJAX requests	Unable to obtain dynamic data	❌ Avoid
`document`	Main document	HTML page itself	Page cannot load	❌ Avoid

Recommendation Level Explanation:

⭐ Safe: Interception has almost no impact on data scraping or first-screen rendering; it is recommended to block by default.
⚠️ Caution: May break styles, real-time functions, or cross-origin requests; requires business judgment.
❌ Avoid: High probability of causing SPA/dynamic sites to fail to render or obtain data normally, unless you are absolutely sure you don't need these resources.

Resource Interception Example Code

import puppeteer from 'puppeteer-core';

const scrapelessUrl = 'wss://browser.scrapeless.com/browser?token=your_api_key&session_ttl=180&proxy_country=ANY';

async function scrapeWithResourceBlocking(url) {
    const browser = await puppeteer.connect({
        browserWSEndpoint: scrapelessUrl,
        defaultViewport: null
    });
    const page = await browser.newPage();

    // Enable request interception
    await page.setRequestInterception(true);

    // Define resource types to block
    const BLOCKED_TYPES = new Set([
        'image',
        'font',
        'media',
        'stylesheet',
    ]);

    // Intercept requests
    page.on('request', (request) => {
        if (BLOCKED_TYPES.has(request.resourceType())) {
            request.abort();
            console.log(`Blocked: ${request.resourceType()} - ${request.url().substring(0, 50)}...`);
        } else {
            request.continue();
        }
    });

    await page.goto(url, {waitUntil: 'domcontentloaded'});

    // Extract data
    const data = await page.evaluate(() => {
        return {
            title: document.title,
            content: document.body.innerText.substring(0, 1000)
        };
    });

    await browser.close();
    return data;
}

// Usage
scrapeWithResourceBlocking('https://www.scrapeless.com')
    .then(data => console.log('Scraping result:', data))
    .catch(error => console.error('Scraping failed:', error));

Optimization Scheme 2: Request URL Interception

In addition to intercepting by resource type, more granular interception control can be performed based on URL characteristics. This is particularly effective for blocking ads, analytics scripts, and other unnecessary third-party requests.

URL Interception Strategies

Intercept by domain: Block all requests from a specific domain
Intercept by path: Block requests from a specific path
Intercept by file type: Block files with specific extensions
Intercept by keyword: Block requests whose URLs contain specific keywords

Common Interceptable URL Patterns

URL Pattern	Description	Example	Recommendation
Advertising services	Advertising network domains	`ad.doubleclick.net`, `googleadservices.com`	⭐ Safe
Analytics services	Statistics and analytics scripts	`google-analytics.com`, `hotjar.com`	⭐ Safe
Social media plugins	Social sharing buttons, etc.	`platform.twitter.com`, `connect.facebook.net`	⭐ Safe
Tracking pixels	Pixels that track user behavior	URLs containing `pixel`, `beacon`, `tracker`	⭐ Safe
Large media files	Large video, audio files	Extensions like `.mp4`, `.webm`, `.mp3`	⭐ Safe
Font services	Online font services	`fonts.googleapis.com`, `use.typekit.net`	⭐ Safe
CDN resources	Static resource CDN	`cdn.jsdelivr.net`, `unpkg.com`	⚠️ Caution

URL Interception Example Code

import puppeteer from 'puppeteer-core';

const scrapelessUrl = 'wss://browser.scrapeless.com/browser?token=your_api_key&session_ttl=180&proxy_country=ANY';

async function scrapeWithUrlBlocking(url) {
    const browser = await puppeteer.connect({
        browserWSEndpoint: scrapelessUrl,
        defaultViewport: null
    });
    const page = await browser.newPage();

    // Enable request interception
    await page.setRequestInterception(true);

    // Define domains and URL patterns to block
    const BLOCKED_DOMAINS = [
        'google-analytics.com',
        'googletagmanager.com',
        'doubleclick.net',
        'facebook.net',
        'twitter.com',
        'linkedin.com',
        'adservice.google.com',
    ];

    const BLOCKED_PATHS = [
        '/ads/',
        '/analytics/',
        '/pixel/',
        '/tracking/',
        '/stats/',
    ];

    // Intercept requests
    page.on('request', (request) => {
        const url = request.url();

        // Check domain
        if (BLOCKED_DOMAINS.some(domain => url.includes(domain))) {
            request.abort();
            console.log(`Blocked domain: ${url.substring(0, 50)}...`);
            return;
        }

        // Check path
        if (BLOCKED_PATHS.some(path => url.includes(path))) {
            request.abort();
            console.log(`Blocked path: ${url.substring(0, 50)}...`);
            return;
        }

        // Allow other requests
        request.continue();
    });

    await page.goto(url, {waitUntil: 'domcontentloaded'});

    // Extract data
    const data = await page.evaluate(() => {
        return {
            title: document.title,
            content: document.body.innerText.substring(0, 1000)
        };
    });

    await browser.close();
    return data;
}

// Usage
scrapeWithUrlBlocking('https://www.scrapeless.com')
    .then(data => console.log('Scraping result:', data))
    .catch(error => console.error('Scraping failed:', error));

Optimization Scheme 3: Simulate Mobile Devices

Simulating mobile devices is another effective traffic optimization strategy because mobile websites usually provide lighter page content.

Advantages of Mobile Device Simulation

Lighter page versions: Many websites provide more concise content for mobile devices
Smaller image resources: Mobile versions usually load smaller images
Simplified CSS and JavaScript: Mobile versions usually use simplified styles and scripts
Reduced ads and non-core content: Mobile versions often remove some non-core functionality
Adaptive response: Obtain content layouts optimized for small screens

Mobile Device Simulation Configuration

Here are the configuration parameters for several commonly used mobile devices:

const iPhoneX = {
    viewport: {
        width: 375,
        height: 812,
        deviceScaleFactor: 3,
        isMobile: true,
        hasTouch: true,
        isLandscape: false
    }
};

Or directly use the built-in methods of puppeteer to simulate mobile devices

import { KnownDevices } from 'puppeteer-core';
const iPhone = KnownDevices['iPhone 15 Pro'];

const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.emulate(iPhone);

Mobile Device Simulation Example Code

import puppeteer, {KnownDevices} from 'puppeteer-core';

const scrapelessUrl = 'wss://browser.scrapeless.com/browser?token=your_api_key&session_ttl=180&proxy_country=ANY';

async function scrapeWithMobileEmulation(url) {
    const browser = await puppeteer.connect({
        browserWSEndpoint: scrapelessUrl,
        defaultViewport: null
    });

    const page = await browser.newPage();

    // Set mobile device simulation
    const iPhone = KnownDevices['iPhone 15 Pro'];
    await page.emulate(iPhone);

    await page.goto(url, {waitUntil: 'domcontentloaded'});
    // Extract data
    const data = await page.evaluate(() => {
        return {
            title: document.title,
            content: document.body.innerText.substring(0, 1000)
        };
    });

    await browser.close();
    return data;
}

// Usage
scrapeWithMobileEmulation('https://www.scrapeless.com')
    .then(data => console.log('Scraping result:', data))
    .catch(error => console.error('Scraping failed:', error));

Comprehensive Optimization Example

Here is a comprehensive example combining all optimization schemes:

import puppeteer, {KnownDevices} from 'puppeteer-core';

const scrapelessUrl = 'wss://browser.scrapeless.com/browser?token=your_api_key&session_ttl=180&proxy_country=ANY';

async function optimizedScraping(url) {
    console.log(`Starting optimized scraping: ${url}`);

    // Record traffic usage
    let totalBytesUsed = 0;

    const browser = await puppeteer.connect({
        browserWSEndpoint: scrapelessUrl,
        defaultViewport: null
    });

    const page = await browser.newPage();

    // Set mobile device simulation
    const iPhone = KnownDevices['iPhone 15 Pro'];
    await page.emulate(iPhone);

    // Set request interception
    await page.setRequestInterception(true);

    // Define resource types to block
    const BLOCKED_TYPES = [
        'image',
        'media',
        'font'
    ];

    // Define domains to block
    const BLOCKED_DOMAINS = [
        'google-analytics.com',
        'googletagmanager.com',
        'facebook.net',
        'doubleclick.net',
        'adservice.google.com'
    ];

    // Define URL paths to block
    const BLOCKED_PATHS = [
        '/ads/',
        '/analytics/',
        '/tracking/'
    ];

    // Intercept requests
    page.on('request', (request) => {
        const url = request.url();
        const resourceType = request.resourceType();

        // Check resource type
        if (BLOCKED_TYPES.includes(resourceType)) {
            console.log(`Blocked resource type: ${resourceType} - ${url.substring(0, 50)}...`);
            request.abort();
            return;
        }

        // Check domain
        if (BLOCKED_DOMAINS.some(domain => url.includes(domain))) {
            console.log(`Blocked domain: ${url.substring(0, 50)}...`);
            request.abort();
            return;
        }

        // Check path
        if (BLOCKED_PATHS.some(path => url.includes(path))) {
            console.log(`Blocked path: ${url.substring(0, 50)}...`);
            request.abort();
            return;
        }

        // Allow other requests
        request.continue();
    });

    // Monitor network traffic
    page.on('response', async (response) => {
        const headers = response.headers();
        const contentLength = headers['content-length'] ? parseInt(headers['content-length'], 10) : 0;
        totalBytesUsed += contentLength;
    });

    await page.goto(url, {waitUntil: 'domcontentloaded'});

    // Simulate scrolling to trigger lazy-loading content
    await page.evaluate(() => {
        window.scrollBy(0, window.innerHeight);
    });

    await new Promise(resolve => setTimeout(resolve, 1000))

    // Extract data
    const data = await page.evaluate(() => {
        return {
            title: document.title,
            content: document.body.innerText.substring(0, 1000),
            links: Array.from(document.querySelectorAll('a')).slice(0, 10).map(a => ({
                text: a.innerText,
                href: a.href
            }))
        };
    });

    // Output traffic usage statistics
    console.log(`\nTraffic Usage Statistics:`);
    console.log(`Used: ${(totalBytesUsed / 1024 / 1024).toFixed(2)} MB`);

    await browser.close();
    return data;
}

// Usage
optimizedScraping('https://www.scrapeless.com')
    .then(data => console.log('Scraping complete:', data))
    .catch(error => console.error('Scraping failed:', error));

Optimization Comparison

We try removing the optimized code from the comprehensive example to compare the traffic before and after optimization. Here is the unoptimized example code:

import puppeteer from 'puppeteer-core';

const scrapelessUrl = 'wss://browser.scrapeless.com/browser?token=your_api_key&session_ttl=180&proxy_country=ANY';

async function optimizedScraping(url) {
  console.log(`Starting optimized scraping: ${url}`);

  // Record traffic usage
  let totalBytesUsed = 0;

  const browser = await puppeteer.connect({
    browserWSEndpoint: scrapelessUrl,
    defaultViewport: null
  });

  const page = await browser.newPage();

  // Set request interception
  await page.setRequestInterception(true);

  // Intercept requests
  page.on('request', (request) => {
    request.continue();
  });

  // Monitor network traffic
  page.on('response', async (response) => {
    const headers = response.headers();
    const contentLength = headers['content-length'] ? parseInt(headers['content-length'], 10) : 0;
    totalBytesUsed += contentLength;
  });

  await page.goto(url, {waitUntil: 'domcontentloaded'});

  // Simulate scrolling to trigger lazy-loading content
  await page.evaluate(() => {
    window.scrollBy(0, window.innerHeight);
  });

  await new Promise(resolve => setTimeout(resolve, 1000))

  // Extract data
  const data = await page.evaluate(() => {
    return {
      title: document.title,
      content: document.body.innerText.substring(0, 1000),
      links: Array.from(document.querySelectorAll('a')).slice(0, 10).map(a => ({
        text: a.innerText,
        href: a.href
      }))
    };
  });

  // Output traffic usage statistics
  console.log(`\nTraffic Usage Statistics:`);
  console.log(`Used: ${(totalBytesUsed / 1024 / 1024).toFixed(2)} MB`);

  await browser.close();
  return data;
}

// Usage
optimizedScraping('https://www.scrapeless.com')
  .then(data => console.log('Scraping complete:', data))
  .catch(error => console.error('Scraping failed:', error));

After running the unoptimized code, we can see the traffic difference very intuitively from the printed information:

Scenario	Traffic Used (MB)	Saving Ratio
Unoptimized	6.03	—
Optimized	0.81	≈ 86.6 %

By combining the above optimization schemes, proxy traffic consumption can be significantly reduced, scraping efficiency can be improved, and ensuring that the required core content is obtained.

Scrapeless Scraping Browser - Browser Fingerprint Customization

datacollection — Thu, 24 Apr 2025 09:09:20 +0000

Over the past three decades, browsers have consistently served as the primary gateway to the Internet. From early pioneers like Mosaic and Internet Explorer that transformed how people accessed the web, to today’s mainstream products led by Chrome, browsers have remained the core environment for information retrieval, task execution, and contextual interaction.

With the rapid rise of artificial intelligence, the role of the browser is undergoing an unprecedented transformation. Whether it’s Opera Aria, Perplexity, or products currently incubated by OpenAI, a shared understanding is emerging: AI needs a browser of its own—a platform purpose-built for task execution and contextual understanding, rather than merely functioning as a plugin embedded in traditional browsers.

From the perspective of AI integration, AI browser products can be roughly categorized into three types:

Traditional browsers enhanced with AI, typically in the form of copilot-style assistants, such as browser extensions for Microsoft Edge and Chrome.
Browsers with built-in AI capabilities at the core level, enabling enhanced permissions and interactions—for instance, Arc Max for organizing tabs or Opera Aria for executing tasks.
Dedicated AI-native browsers, which is the foundational vision behind Scrapeless. In this model, users interact with an AI that operates within a browser running in a virtual machine, providing a more complete and autonomous solution.

The Scrapeless Scraping Browser was born from this vision. Designed specifically for AI agents, it not only addresses the challenges of high-concurrency and task automation but also pushes the boundaries of AI execution capabilities. However, through real-world deployment, a critical limitation has become evident: despite having powerful control over commands and web pages, all advantages vanish if the system is flagged as bot traffic by the target website. This reveals a key technical bottleneck in the current generation of AI browsers—the authenticity and diversity of browser fingerprints.

In response, Scrapeless has significantly enhanced its fingerprint customization capabilities in its latest product update. By deeply customizing the Chromium engine, Scrapeless enables highly granular fingerprint strategies, ensuring that each virtual browser instance possesses uniquely “human-like” characteristics. This drastically reduces the risk of being flagged by platform security systems. The upgrade not only improves the stability of AI operations in high-frequency tasks but also provides a safer and more reliable execution environment for future agent-based systems.

In the following sections, we’ll take a deep dive into the technical details behind Scrapeless’ fingerprinting layer and explore how it is becoming a critical component in the infrastructure of the next generation of AI-native browsers.

Scrapeless Scraping Browser: Advantages and Core Features

Scrapeless Scraping Browser is a future-oriented cloud-based browser solution specifically designed for AI agents and automated task execution. It integrates a high-performance concurrent processing architecture, advanced browser fingerprint customization, and intelligent anti-anti-bot logic to provide users with a stable, efficient, and scalable data interaction platform.

Whether used in intelligent agent systems for executing large-scale web tasks, or in complex scenarios like multi-account marketing, dynamic content extraction, and public opinion monitoring, Scrapeless delivers a secure, stealthy, and intelligent environment simulation capability—effectively bypassing traditional anti-bot mechanisms and fingerprint detection limits.

Key Technical Advantages

1. Authentic Browser Environment

Chromium Engine Support: Provides a fully functional browser environment to simulate real user behavior.
TLS Fingerprint Spoofing: Masks TLS fingerprint to bypass conventional bot detection systems and appear as a regular browser.
Dynamic Fingerprint Obfuscation: Randomly adjusts browser environment variables (e.g., User-Agent, Canvas, WebGL) to enhance human-like behavior and evade sophisticated anti-bot strategies.

2. Cloud-Based Architecture and Scalability

Cloud Deployment: Fully cloud-based, requiring no local resources, and supports global distributed deployments.
High Concurrency Support: Scalable from dozens to unlimited concurrent sessions—ideal for large-scale scraping and complex automation.
Easy Integration: Fully compatible with existing automation frameworks (e.g., Playwright and Puppeteer) with no code refactoring required.

3. Purpose-Built for AI Agents

Automation Proxy Support: Offers powerful proxy capabilities to help AI agents execute complex browser automation tasks.
Flexible Invocation: Supports multi-task parallel execution, making it an ideal tool for building intelligent agent systems and AI-driven applications.

Core Features

1. Deep Customization of Browser Fingerprints

Browser fingerprints are unique digital identifiers generated from browser and device configurations, often used to track user activity even without cookies. Scrapeless Scraping Browser allows full customization of these fingerprints—supporting adjustments to User-Agent, timezone, language, screen resolution, and other key parameters—to enhance multi-account management, data collection, and privacy protection.

By enabling controlled adjustments to standardized parameters exposed by the browser, Scrapeless helps users construct highly “authentic” browsing environments. Below are the main fingerprint customization features currently supported:

User-Agent Control

Allows custom User-Agent strings in HTTP request headers to simulate specific browser versions, operating systems, and device environments—enhancing stealth and compatibility.

Screen Resolution Mapping

Permits custom values for screen.width and screen.height to emulate common device display dimensions, supporting responsive rendering and resisting device fingerprinting strategies.

Platform Property Locking

Enables customization of navigator.platform return values to simulate standard platform types (e.g., Windows, macOS, Linux), influencing how websites adapt to different OS environments.

Localization Environment Simulation

Fully supports customization of browser localization settings, affecting website content localization, time format rendering, and language preference inference. Supported parameters include:

localization.timezone: Set IANA-compliant timezone identifiers (e.g., Asia/Shanghai)
localization.locale: Set BCP 47-compliant language-region codes (e.g., zh-CN)
localization.languages: Define prioritized language lists for navigator.languages and the Accept-Language HTTP header

Parameter	Description
`localization.timezone`	Sets the timezone identifier (compliant with IANA format, e.g., `Asia/Shanghai`)
`localization.locale`	Sets the language and region (compliant with BCP 47 format, e.g., `zh-CN`)
`localization.languages`	Defines the language priority list, mapped to `navigator.languages` and the `Accept-Language` HTTP header

For more advanced fingerprint customization (such as Canvas, WebGL, font detection, etc.), Scrapeless is continuously under development. In the future, it will support even more fine-grained environment simulation capabilities—stay tuned.

Detailed Explanation of Scrapeless Scraping Browser Fingerprint Parameters

Parameter Name	Type	Description
`userAgent`	string	Defines the User-Agent string in the browser's HTTP request header, which includes browser engine, version, OS, and other key identifiers. Websites use this for client environment detection, affecting content adaptation and feature availability. Default: Follow browser
`platform`	enum	Specifies the return value of the JavaScript `navigator.platform` property, indicating the OS type of the runtime environment. Optional values: `"Windows"`, `"macOS"`, `"Linux"`. This is used for feature detection and enabling OS-specific behaviors. Default: Windows
`screen`	object	Defines the physical display characteristics reported by the browser, directly mapped to JavaScript's `window.screen` object.
`screen.width`	number	Physical screen width (in pixels), mapped to `screen.width`, affects media queries and responsive layouts. Default: Randomized with fingerprint, minimum 640
`screen.height`	number	Physical screen height (in pixels), mapped to `screen.height`, together with width defines resolution. Default: Randomized with fingerprint, minimum 480
`localization`	object	Controls the browser’s localization settings, including language, region, and timezone. These settings influence formatting and content localization.
`localization.timezone`	string	Timezone identifier compliant with the IANA database (e.g., `"Asia/Shanghai"`), controls JavaScript date object behavior and `Intl.DateTimeFormat` output. A key part of timezone fingerprinting. Default: America/New_York
`localization.languages`	[string]	A prioritized list of supported languages, mapped to `navigator.languages` and HTTP `Accept-Language` header, influencing site language selection. Default: `"en"`, `"en-US"`

2. CAPTCHA Solving Capabilities

Scraping Browser features an advanced CAPTCHA solving solution that can automatically handle most mainstream CAPTCHA types, including reCAPTCHA and Cloudflare Turnstile.

Industry-Leading Success Rate: Scrapeless delivers highly effective CAPTCHA solving with a success rate exceeding 98%.
No Extra Cost: While most competitors charge additional fees for CAPTCHA-solving features, Scrapeless includes this functionality as part of its core service—no extra charges required.
Real-Time Processing: The CAPTCHA solving engine in Scrapeless operates with millisecond-level response times, ensuring smooth task execution.

3. Flexible and Controllable Proxy Integration System

Scraping Browser comes with a highly configurable proxy support system, allowing for fine-grained routing and traffic management in automated workflows.

3.1 Built-in Residential Proxies

With Scrapeless’s built-in, managed residential proxy network, you can instantly route traffic across the globe—perfect for bypassing geo-restrictions and anti-bot measures.

No configuration required – ready to use out of the box
Supports geolocation-based proxies in 195 countries and regions
Stable, high-anonymity proxies suitable for large-scale automation
Easy to test and deploy via the built-in Playground

3.2 Bring Your Own Proxies

If you have your own proxy service or prefer a specific provider, Scrapeless offers flexible proxy integration:

Assign proxies directly to tasks by specifying parameters during session creation
Using your own proxies will not count towards Scrapeless’s proxy usage billing

4. Toolkit Support

Comprehensive Automation Tool Compatibility: Scrapeless supports popular browser automation tools like Puppeteer and Playwright, making it easy for developers to integrate.

AI Integration Capabilities: Scrapeless is planning deep integrations with tools like Browser Use, Computer Use, and LangChain. Future updates will further unlock the potential of large language models in dynamic web interactions.
Ease of Use: Comes with detailed documentation and example code to help users get started quickly.

5. Concurrency Support

Flexible Concurrency Options: Scrapeless supports anywhere from 50 to unlimited concurrent sessions, scalable from small tasks to large-scale automation.
No Extra Concurrency Fees: While competitors often charge for high-concurrency use cases, Scrapeless offers a transparent and flexible pricing model with no hidden costs.

Scrapeless Scraping Browser Fingerprint Parameters Example Code

The following is a simple example code showing how to integrate Scrapeless's browser fingerprint customization function through Puppeteer and Playwright:

Puppeteer Example

const puppeteer = require('puppeteer-core');

// custom browser fingerprint
const fingerprint = {
    userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.1.2.3 Safari/537.36',
    platform: 'Windows',
    screen: {
        width: 1280, height: 1024
    },
    localization: {
        languages: ['zh-HK', 'en-US', 'en'], timezone: 'Asia/Hong_Kong',
    }
}

const query = new URLSearchParams({
  token: 'APIKey', // required
  session_ttl: 180,
  proxy_country: 'ANY',
  fingerprint: encodeURIComponent(JSON.stringify(fingerprint)),
});

const connectionURL = `wss://browser.scrapeless.com/browser?${query.toString()}`;

(async () => {
    const browser = await puppeteer.connect({browserWSEndpoint: connectionURL});
    const page = await browser.newPage();
    await page.goto('https://www.scrapeless.com');
    const info = await page.evaluate(() => {
        return {
            screen: {
                width: screen.width,
                height: screen.height,
            },
            userAgent: navigator.userAgent,
            timeZone: Intl.DateTimeFormat().resolvedOptions().timeZone,
            languages: navigator.languages
        };
    });
    console.log(info);
    await browser.close();
})();

Playwright Example

const { chromium } = require('playwright-core');

// custom browser fingerprint
const fingerprint = {
    userAgent: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.1.2.3 Safari/537.36',
    platform: 'Windows',
    screen: {
        width: 1280, height: 1024
    },
    localization: {
        languages: ['zh-HK', 'en-US', 'en'], timezone: 'Asia/Hong_Kong',
    }
}

const query = new URLSearchParams({
  token: 'APIKey', // required
  session_ttl: 180,
  proxy_country: 'ANY',
  fingerprint: encodeURIComponent(JSON.stringify(fingerprint)),
});

const connectionURL = `wss://browser.scrapeless.com/browser?${query.toString()}`;

(async () => {
    const browser = await chromium.connectOverCDP(connectionURL);
    const page = await browser.newPage();
    await page.goto('https://www.scrapeless.com');
    const info = await page.evaluate(() => {
        return {
            screen: {
                width: screen.width,
                height: screen.height,
            },
            userAgent: navigator.userAgent,
            timeZone: Intl.DateTimeFormat().resolvedOptions().timeZone,
            languages: navigator.languages
        };
    });
    console.log(info);
    await browser.close();
})();

Applicable Scenarios for Scrapeless Scraping Browser Fingerprint Customization

The fingerprint customization feature of Scrapeless Scraping Browser is suitable for a variety of use cases, including but not limited to the following:

1. Basic Multi-Account Isolation and Risk Control

For users who manage multiple accounts—such as those in cross-border e-commerce or social media marketing—Scrapeless enables flexible configuration of browser fingerprint parameters like User-Agent, screen resolution, timezone, and language preferences. This helps avoid environmental overlap between accounts, significantly reducing the risk of platform detection and account linkage.

Typical Applications: Account environment isolation on platforms like Shopify, Facebook, and Google Ads.

2. Lightweight Data Collection and Anti-Bot Evasion

When performing web scraping tasks, Scrapeless Scraping Browser helps users disguise their automation as "real user" traffic rather than bot activity. By simulating mainstream device configurations (e.g., Windows 10 + Chrome 114 + 1080p monitor) and fine-tuning fingerprint details, users can effectively bypass basic anti-bot mechanisms of target websites, such as:

- User-Agent blacklists

Without the need for complex scripts or large-scale IP pool scheduling, users can achieve fast and stable data collection.

Typical Applications: Price monitoring, public opinion tracking, product comparison, SEO data scraping.

3. Compatibility Testing

Frontend developers and QA engineers can use Scrapeless to quickly switch between different operating systems (e.g., Windows/macOS), screen sizes, and other parameters to simulate diverse access environments. This allows for testing rendering behavior and functional integrity across multiple configurations.

Typical Applications: A/B testing for ad campaigns, responsive UI validation.

Ethical Statement

We advocate responsible fingerprint customization:

Only used in legally authorized scenarios (such as corporate data > compliance collection, internal risk control testing).

It is prohibited to commit online fraud or infringe on user privacy by forging fingerprints.

Future Roadmap of Scrapeless Scraping Browser

Looking ahead, Scrapeless Scraping Browser will continue to optimize its core functionalities to meet a wide range of needs—from basic data scraping to advanced AI-driven automation. Our goal is to provide users with even more powerful tools and seamless experiences. The following are our key development directions:

1. Debugging and Monitoring

Live Preview: Real-time view within the Playground to facilitate debugging and task takeover.
Session Management: Support for session replay, inspector tools, and metadata querying to enhance task monitoring and control.

2. File Handling

Upload: Easily upload files to target websites using Playwright, Puppeteer, or Selenium.
Download: Downloaded files are automatically stored in the cloud, with Unix timestamps appended to filenames (e.g., sample-1719265797164.pdf) to avoid conflicts.
Retrieval: Quickly access downloaded files via API—ideal for data extraction and report generation scenarios.

3. Context API & Extension Support

Context API: Enables session persistence to optimize login flows and multi-step automation scenarios.
Extension Support: Enhance browser sessions with your own Chrome extensions.

4. Metadata Query

Use custom tags and metadata queries to filter and locate specific sessions.

5. SDK and API Enhancements

Session API: Offers robust session management capabilities to simplify workflow operations.
CDP Event Enhancements: Broaden support for Chrome DevTools Protocol (CDP) features, including retrieving page HTML, clicking elements, scrolling, and capturing screenshots.

Conclusion

In the previous sections, we discussed the various challenges that current browser automation tools face when supporting AI-driven automation tasks. These issues significantly impact developers' productivity and the feasibility of tasks:

High Concurrency Bottleneck: Traditional browsers often struggle under heavy parallel requests, leading to frequent task failures. In high concurrency scenarios, they cannot effectively support AI-driven automation tasks.
Easily Detected by Anti-Scraping Mechanisms: Traditional browsers exhibit predictable behaviors and lack human-like intelligent behavior simulation, making it easy for websites' anti-scraping systems to detect and block them, preventing them from bypassing these protections.
High Costs: In large-scale tasks, traditional browsers consume significant resources and incur high operational costs, limiting task scale and frequency, thereby reducing efficiency.
Complex Integration and Learning Curve: Integrating traditional browsers for automation tasks typically requires complex configurations and coding, increasing the learning difficulty for developers and reducing development efficiency.

To address these issues, Scrapeless Scraping Browser has redefined the concept of the "browser for AI," aiming to provide a more efficient, intelligent, and cost-effective solution for AI-driven automation tasks. Below are the key innovations we have already implemented:

Breaking the High Concurrency Bottleneck:

Cloud Elastic Scaling: With an innovative cloud architecture, Scrapeless has achieved seamless scaling from fifty to unlimited concurrent sessions, greatly improving throughput and ensuring task stability and efficiency. Even in high concurrency scenarios, tasks can be executed smoothly.

Human-like Behavior and Fingerprint Customization:

Full-Stack Human Protection: Scrapeless deeply customizes the browser engine to simulate real user browsing behaviors, bypassing anti-scraping detection mechanisms. This upgrade particularly enhances fingerprint customization features, allowing developers to fine-tune browser fingerprint attributes, including but not limited to User-Agent, screen resolution, etc., further enhancing the browser's stealth and flexibility.

Significantly Reducing Costs:

Unmatched Cost Efficiency: Compared to other solutions, Scrapeless offers a 60%-80% cost reduction while ensuring compatibility with tools like Playwright and Puppeteer, enabling developers to automate large-scale tasks at a lower cost.

Simplified Integration and Usability:

Compatibility and Ease of Use: Scrapeless lowers the development threshold, reducing integration complexity and allowing developers to quickly get started without facing a steep learning curve. With intuitive APIs and interfaces, Scrapeless makes browser automation simpler and more efficient.

While we have made significant progress, Scrapeless continues to evolve. Future versions will include more intelligent features, such as:

More precise fingerprint spoofing and behavior simulation;
Session Replay Debug and extended support;
SDK and API support;
Deep integration with the Browser Use framework, offering powerful LLM crawling capabilities, full-site extraction, and deep research capabilities to further enhance the efficiency and accuracy of automated data scraping and deep research.

Scrapeless Scraping Browser, as the "browser for AI," not only addresses key current issues but is also continuously improving to meet future challenges. We invite developers and teams to join us on this innovative journey, share your needs and suggestions, and work together to drive browser automation technology into a smarter and more efficient new era.

About Scrapeless

用 10 行代码爬取 Naver 智能商店数据 – 从 API 调用到结构化输出

datacollection — Tue, 22 Apr 2025 07:51:41 +0000

在当今数据驱动的时代，从 Naver Smart Store 等电商平台获取有价值的洞察，可以为企业带来竞争优势。无论您是分析产品趋势、监控竞争对手，还是优化定价策略，高效地抓取数据都是关键。本文将向您展示如何使用 Scrapeless（一款功能强大且开发者友好的工具）抓取 Naver Smart Store 数据，只需 10 行代码即可。

为什么要抓取 Naver Smart Store？

Naver Smart Store 是韩国最大的在线购物平台之一，托管着数百万种不同类别的产品。从中提取数据可以帮助企业：

洞察市场趋势和消费者偏好。
监控竞争对手的定价和产品表现。
识别新兴产品类别和客户情绪。
自动化库存跟踪和销售分析。

然而，手动收集这些数据既费时又低效。Scrapeless 应运而生——这是一款专为简便性、可扩展性和可靠性而设计的尖端抓取工具。

如何抓取 Naver Smart Store 传统方法 vs. 现代解决方案

(1) 传统网页抓取

传统方法需要使用 BeautifulSoup、Selenium 或 Playwright 等工具编写自定义脚本。虽然这些工具功能强大，但也存在一些明显的缺点：

维护成本高：脚本需要频繁更新才能适应网站的变化。
反抓取障碍：验证码解析、IP 地址轮换和 TLS 指纹识别必须手动实现。
可扩展性有限：扩展以处理数千个请求需要大量资源。

(2) 基于现代 API 的解决方案

现代解决方案（例如 Scrapeless Naver Scraping API）消除了传统数据抓取面临的许多挑战。Scrapeless API 提供以下功能：

配备强大的内置基础架构和解锁功能，确保您通过简单的 API 调用即可大规模获取结构化数据。
快速将原始 HTML 转换为 JSON 或 CSV 文件等结构化数据格式。
易于使用，只需极少的设置即可简化结构化数据的提取流程。
与主流编程语言和工具完全兼容。 ## Scrapeless 如何简化流程

Scrapeless 提倡合法合规地抓取公开数据。请确保您获取的信息仅用于合法用途，并避免任何形式的盈利性使用。严格遵守相关法律法规和数据抓取规则，维护健康的数据生态系统。

Scrapeless 提供直观的 API，可在后台处理复杂的数据抓取任务。它具备智能 IP 轮换、验证码绕过和实时数据提取等功能，确保高成功率，同时最大限度地降低被屏蔽的风险。让我们来看看如何仅用 10 行代码使用 Scrapeless 抓取 Naver Smart Store。

分步指南：使用 Scrapeless 抓取 Naver Smart Store 数据

步骤 1：设置您的 Scrapeless 帐户

注册一个Scrapeless免费账户
从仪表板获取您的 API 密钥。此密钥将用于验证您的请求

第2步：选择Naver并进入Scrapeless仪表板界面。

第三步：设置抓取参数

产品 ID 和商店 ID 可以直接在产品 URL 中找到。让我们来看看： [바르닭] 닭і슴살 143종 크런치 소품닭 닭스테ց 소스큐브 골라담기 [원산지:국산(경기도 포천시) 등] 为例：

店铺ID: barudak

产品编号：4469033180

步骤 4：抓取基本商品信息

设置好必要的抓取参数后，点击“开始抓取”，抓取结果将显示在右侧。

以下是一些抓取结果示例：

{"additionalAttributes": {"A/S 안내": ["********","********"],"영수증발급": "신용카드전표, 현금영수증발급"},"adultAuthorizationType": "NOT_LOGIN","afterServiceInfo": {"afterServiceGuideContent": "********","afterServiceTelephoneNumber": "********"},"arrivalGuarantee": false,"authenticationType": "NORMAL","authorizationDisplay": "NORMAL","averageDeliveryLeadTime": {"productAverageDeliveryLeadTime": 1.6511627,"sellerAverageDeliveryLeadTime": 1.6331967},"benefitsPolicy": {"givePresent": true,"managerBankbookAccumulatePolicyNo": 12306300388384,"managerBankbookAccumulateValue": 0.5,"managerBankbookAccumulateValueUnit": "PERCENT","managerMaxBankbookAccumulateAmount": 10000,"managerMaxPaymoneyAccumulateAmount": 30000,"managerMaxPurchasePointAmount": 100000,"managerPaymoneyAccumulatePolicyNo": 439583905,"managerPaymoneyAccumulateValue": 1.5,"managerPaymoneyAccumulateValueUnit": "PERCENT","managerPurchasePointPolicyNo": 10511031105304,"managerPurchasePointValue": 1,"managerPurchasePointValueUnit": "PERCENT","sellerImmediateDiscountPolicyNo": "SE_4460099867","sellerImmediateDiscountValue": 1220,"sellerImmediateDiscountValueUnit": "WON"},"benefitsView": {"afterUsePhotoVideoReviewPoint": 0,"afterUseTextReviewPoint": 0,"discountedRatio": 55,"discountedSalePrice": 990,"generalPurchaseReviewPoint": 0,"givePresent": true,"managerAfterUsePhotoVideoReviewPoint": 0,"managerAfterUseTextReviewPoint": 0,"managerArrivalGuaranteePoint": 0,"managerBankbookAccumulatePoint": 4,"managerGeneralPurchaseReviewPoint": 50,"managerImmediateDiscountAmount": 0,"managerMembershipArrivalGuaranteePoint": 0,"managerPaymoneyAccumulatePoint": 14,"managerPhotoVideoReviewPoint": 150,"managerPremiumPurchaseReviewPoint": 150,"managerPurchaseExtraPoint": 0,"managerPurchasePoint": 9,"managerTextReviewPoint": 50,"mobileDiscountedRatio": 55,"mobileDiscountedSalePrice": 990,"mobileManagerArrivalGuaranteePoint": 0,"mobileManagerBankbookAccumulatePoint": 4,"mobileManagerImmediateDiscountAmount": 0,"mobileManagerMembershipArrivalGuaranteePoint": 0,"mobileManagerPaymoneyAccumulatePoint": 14,"mobileManagerPurchaseExtraPoint": 0,"mobileManagerPurchasePoint": 9,"mobileSellerCustomerManagementPoint": 0,"mobileSellerImmediateDiscountAmount": 1220,"mobileSellerPurchasePoint": 0,"photoVideoReviewPoint": 0,"premiumPurchaseReviewPoint": 0,"sellerCustomerManagementPoint": 0,"sellerImmediateDiscountAmount": 1220,"sellerPurchasePoint": 0,"specialDiscountAmount": {},"storeMemberReviewPoint": 0,"textReviewPoint": 0},"best": false,"cardPromotions": [],"category": {"category1Id": "50000006","category1Name": "식품","category2Id": "50000145","category2Name": "축산물","category3Id": "50001172","category3Name": "닭고기","category4Id": "50013800","category4Name": "닭가슴살","categoryId": "50013800","categoryLevel": 4,"categoryName": "닭가슴살","exceptionalCategoryTypes": ["FREE_RETURN_INSURANCE","ORIGINAREA_PRODUCTS","REGULAR_SUBSCRIPTION","REVIEW_UNEXPOSE","GROUP_PRODUCT_MAX"],

步骤5：抓取Naver产品优惠券信息

从以上抓取结果中，我们可以看到以下信息：

"productNo": "4460099867"

此外，您还可以找到其他与产品相关的唯一标识符，例如：

"productId": "10217226674"

categoryId: 50013800 对应类别 닭가슴살

"wholeCategoryId": "50000006>50000145>50001172>50013800",

"channelUid": "2sWDx0OygJl5sQcE9f6rD"

设置抓取参数后，即可抓取结果。

使用 Scrapeless Naver Scraping API 获取优惠券数据。以下是 Python 请求代码示例：

您只需用您的 API KEY 替换令牌部分。

如何绕过 Naver Shop 的反机器人措施

Scrapeless 提供优质的全球清洁 IP 代理服务，专注于动态住宅 IPv4 代理。Scrapeless 住宅代理网络拥有遍布 195 个国家/地区的超过 7000 万个 IP 地址，提供全面的全球代理支持，助力您的业务增长。

获取代理的步骤：

步骤 1：登陆

登陆 Scrapeless。 ### 步骤 2：点击“代理”并创建频道。

步骤3：获取代码

点击“开始”，然后在操作框中填写您需要的信息，然后点击“生成”。稍等片刻，您将在右侧看到我们为您生成的旋转代理。现在点击“复制”即可使用。

或者，您可以将我们的代理代码集成到您的项目中：

代码：

curl --proxy host:port --proxy-user username:password API_URL

Browser:

Selenium

from seleniumbase import Driver proxy = 'username:password@gw-us.scrapeless.com:8789' driver = Driver(browser="chrome", headless=False, proxy=proxy) driver.get("API_URL") driver.quit()

Puppeteer

const puppeteer =require('puppeteer'); (async() => {const proxyUrl = 'http://gw-us.scrapeless.com:8789';const username = 'username';const password = 'password';const browser = await puppeteer.launch({args: [`--proxy-server=${proxyUrl}`],headless: false });const page = await browser.newPage();await page.authenticate({ username, password });await page.goto('API_URL');await browser.close(); })();

总结

抓取 Naver Smart Store 数据并非易事。使用 Scrapeless，您只需 10 行代码即可提取有价值的数据，节省您的时间和精力。无论您是开发人员、分析师还是企业主，Scrapeless 都能让您专注于获取洞见，而无需费力应对技术挑战。

准备好了吗？立即访问获取所需工具，释放电商数据的全部潜力！

无刮擦抓取浏览器：一种高并发的AI自动化解决方案

datacollection — Fri, 18 Apr 2025 13:34:15 +0000

介绍：升级无缝抓取浏览器的并发能力

作为 Scrapeless 的开发者和创始团队，我们对人工智能自动化的未来充满真诚的热情。我们的使命是创建一个真正为 AI 设计的自动化浏览器。在过去的几年中，从 Browserless.io 到众多云服务供应商推出的“浏览器即服务”（BaaS），市场已经证明 AI 代理急需一种新的交互媒介——一个专为 AI 设计的基于云的浏览器。例如，Auto-GPT 可以自主在 Booking.com 上搜索最佳航班，或自动提交 Google 表单中的调查响应。同样，ChainGPT 的智能客户服务系统可以实时登录电子商务后台以检索订单数据并完成多步骤操作。这些能力背后追求的是高并发和“类人”模拟的极致。

然而，我们观察到现有解决方案常常在两个关键点上出错：

1. 高并发扩展性： 当数百或数千个代理任务同时针对一个网站时，单个节点迅速成为瓶颈。

2. 真实的浏览行为： 多维伪装，如指纹轮换、TLS 特性和鼠标轨迹，如果不够精确，会被电子商务平台和社交媒体的风险控制系统迅速标记。

考虑到这些挑战，我们在产品设计阶段专注于两个关键领域：

云弹性扩展： Scrapeless 支持从十个到无限制的并发会话的无缝扩展，确保在高峰任务负载下零排队和零超时。
全栈类人保护： 通过深度定制 Chromium 内核，Scrapeless 实现了多维指纹模糊、可控的 TLS 握手策略和渐进的鼠标/键盘模拟，使目标网站几乎不可能检测到异常。

更值得注意的是，在提供顶级性能的同时，我们将成本降低到行业标准解决方案的 70%，帮助开发者节省 60%-80% 的大规模测试和长时间运行任务的费用。无论您是需要通过每日抓取监控数千个 SKU，还是驱动数千个客户服务机器人跨多个网站，Scrapeless 都提供了最可靠和最具成本效益的基础设施。

在接下来的章节中，我们将深入探讨 Scrapeless 抓取浏览器的定价优势、核心功能和未来路线图，让您全面了解为什么它是“AI 浏览器”时代的终极选择。

Scrapeless 抓取浏览器价格比较分析

1. 每小时费率和代理费的比较

以下是竞争产品的每小时费率和代理费的价格范围比较。我们提炼出大致的定价范围，以帮助用户快速了解 Scrapeless 的性价比优势。

表：价格范围比较

工具名称	每小时费率范围（美元/小时）	代理费范围（美元/GB）	并发支持	备注
Scrapeless	$0.063 – $0.090 /小时（根据并发和使用情况有所不同）	$1.26 - $1.80 / GB	50 / 100 / 200 / 400 / 600 / 1000 / 无限	- 支持自定义代理 - 免费解决 Cloudflare、reCAPTCHA、AWS WAF 的 CAPTCHA；未来支持 Imagetotext CAPTCHA - 费率根据实际使用情况而异
Browserbase	$0.10 – $0.198 /小时（包括 2-5GB 免费代理）	$10 / GB（超出免费配额后）	3（基础） / 50（高级）	- 支持自定义代理
Brightdata	$0.10 /小时	$9.5 / GB（标准）；$12.5 / GB（优质域名）	无限	- 不支持自定义代理 - 实际并发会话可能受以下因素影响： - 账户计划和使用限制 - 可用带宽和系统资源 - 计费设置和信用余额
Zenrows	每小时 $0.09	每GB $2.8 - $5.42	多达 100	- 可根据需求定制计划，价格为每GB $2.8 - 商业计划支持最高 100 个并发
Browserless	每小时 $0.084 – $0.15（按“单位”计费）	每GB $4.3	3 / 10 / 50	- 支持定制代理 - 每1000个hCaptcha和reCaptcha解决方案$7 - 每个“单位”等于0.00833小时的浏览器时间 - Cloudflare旁路功能免费提供

2. 并发场景下的价格比较

为了更直观地展示Scrapeless的价格优势，我们通过典型使用场景进行比较。

案例 1：单请求（1个浏览器实例）

假设用户发起单个请求（例如，登录ChatGPT），持续1小时，消耗1GB流量：

Scrapeless（基于标准套餐费率）：

每小时费率：$0.072
代理费用：$1.44
总成本 = 0.072 + 1.44 = $1.512

竞争对手（以Brightdata为例）：

每小时费率：$0.10
代理费用：$9.5（标准）
总成本 = 0.10 + 9.5 = $9.6

成本优势：Scrapeless节省了约84.25%的成本。

案例 2：大规模并发场景（100个浏览器实例）

Scrapeless的用户正在建立一个基于LLM的市场排名监控系统，以实时抓取多个网站的数据并生成动态排名报告。他们目前的业务需求要求同时运行100个浏览器实例，持续1小时，消耗40GB流量。

Scrapeless（基于标准套餐费率）：

每小时费率：0.072 × 100 = 7.2
代理费用：1.44 × 40 = 57.6
总成本 = 7.2 + 57.6 = $64.8

竞争对手（以Zenrows为例）：

每小时费率：0.09 × 100 = 9
代理费用：2.8 × 40 = 112
总成本 = 9 + 112 = $121

成本优势：Scrapeless节省了约46.45%的成本。

该用户在项目早期阶段对主流浏览器自动化工具进行了详细的价格和性能比较。他们发现许多竞争对手在处理大规模并发任务时存在以下问题：

高并发支持不足：大多数工具的最大并发限制低，无法满足100个实例的需求。用户未来的并发需求将超过500，而市场上很少有产品能支持这一水平的需求。
高额附加费用：某些产品对高并发任务收取额外费用，导致整体成本暴增。
技术支持有限：在遇到验证码或反抓取机制时，某些工具缺乏内置解决方案，增加了开发复杂性。

经过全面评估，该用户最终选择了Scrapeless Scraping Browser。他们表示Scrapeless不仅提供显著的成本优势（节省了近47%的成本），还确保了数据抓取系统的效率和可靠性。

Scrapeless Scraping Browser：针对AI代理的基于云的浏览器自动化工具

Scrapeless Scraping Browser是一个基于云的浏览器自动化工具，旨在用于数据抓取、AI代理和代理系统。通过深层模拟技术提供真实的浏览器环境，支持动态指纹混淆和TLS指纹欺骗，以确保高度人性化的用户行为。此外，它完全由用户控制，不存储任何数据，确保合规性和隐私保护。

技术优势

1. 真实浏览器环境

Chrome内核支持：提供完整的浏览器环境，模拟真实用户行为。
TLS指纹欺骗：通过伪造TLS指纹打破传统的反抓取机制，伪装成普通浏览器。
动态指纹混淆：动态调整浏览器环境变量（例如，User-Agent、Canvas、WebGL），增强人性化行为并绕过高级反抓取策略。

2. 云部署和可扩展性

云架构：完全基于云，消除对本地资源的需求，支持无缝的全球分布式部署。
高并发支持：支持无限并行任务，适合大规模数据抓取和复杂的自动化场景。
易于集成：可以与现有自动化框架（如Playwright、Puppeteer）无缝集成，而无需代码重构。

3. 专为AI代理设计

自动代理支持：提供强大的代理功能，帮助AI代理执行复杂的浏览器自动化任务。
灵活调用：支持多任务并行处理，使其成为构建智能代理系统和AI驱动应用程序的理想工具。

核心特点

Scrapeless Scraping Browser 的核心竞争力在于其强大的功能和灵活性，特别在以下三个领域表现突出：

(1) CAPTCHA 解决能力

Scrapeless Scraping Browser 具有先进的 CAPTCHA 解决能力，能够自动处理主流的 CAPTCHA 类型，如 reCAPTCHA 和 Cloudflare Turnstile。

行业领先的成功率：Scrapeless 提供了一个高效的 CAPTCHA 解决方案，成功率超过 98%。
无额外费用：虽然大多数竞争对手会收取额外的 CAPTCHA 解决费用，Scrapeless 将这一功能集成到基础服务中，无需额外费用。
实时处理：CAPTCHA 解决引擎在毫秒级别内完成任务，确保任务执行流畅。 #### (2) 工具集成支持
全面自动化工具支持：Scrapeless 支持 Puppeteer 和 Playwright 等流行的浏览器自动化工具，使开发人员能够快速集成。
AI 集成能力：Scrapeless 计划与 Browser Use、Computer Use 和 LangChain 深度集成，探索大型语言模型（LLMs）进一步的能力，以扩展 AI 驱动的动态网络互动用例。
易用性：提供详细的文档和示例代码，帮助用户快速上手。 #### (3) 并发支持
灵活的并发能力：Scrapeless 支持从 50 到无限的并发，满足小型任务和大规模自动化需求。
无额外费用：虽然竞争对手通常在高并发场景下收取额外费用，Scrapeless 提供透明灵活的定价模型。

Scrapeless Scraping Browser 的未来计划

未来，Scrapeless Scraping Browser 将继续优化其核心功能，以满足多样化的需求，从基础抓取到复杂的 AI 驱动自动化，为用户提供更强大的工具。以下是我们更新的重点关注领域：

1. 核心功能增强

指纹配置：支持灵活配置时区、语言、用户代理和屏幕分辨率等环境变量，以增强类人行为。
代理路由规则：推出自定义代理路由功能，允许根据域名或位置将流量引导到不同的代理。提供会话 API 用于会话管理。

2. 调试和监控

直播视图：在 Playground 中提供实时视图，便于调试和任务接管。
会话管理：支持会话重放、检查器和元数据查询，以增强任务监控能力。

3. 文件处理

上传：使用 Playwright、Puppeteer 或 Selenium 轻松上传文件到目标网站。
下载：下载的文件自动存储在云中，并在文件名中附加 Unix 时间戳（例如，sample-1719265797164.pdf），以避免冲突。
检索：通过 API 快速检索文件，适用于数据抓取和报告生成等场景。

4. 上下文 API 和扩展支持

上下文 API：引入上下文会话持久化，以优化登录和多步骤自动化场景。
扩展支持：通过加载您自己的 Chrome 扩展增强浏览器会话。

5. 元数据查询

使用自定义标签和带有元数据的会话查询。

6. SDK 和 API 升级

会话 API：提供会话管理功能，以简化任务操作。
CDP 事件优化：扩展 CDP 支持，包括获取页面 HTML、点击元素、滚动和截屏等功能。

总结

当前的浏览器自动化工具在赋能 AI 驱动场景时面临诸多挑战：

高并发瓶颈导致任务失败。
人类行为不足，使反抓取机制容易检测到自动化。
高成本限制了大规模任务的可行性。
复杂的集成造成陡峭的学习曲线，导致效率低下。

Scrapeless Scraping Browser 通过三项关键创新重新定义了“针对 AI 的浏览器”：

云弹性扩展：支持从几十个到无限的并发会话的无缝扩展，充分释放高吞吐量潜力。
全栈类人保护：对 Chromium 内核进行深度定制，提供指纹混淆、TLS 握手策略和渐进式行为模拟，轻松绕过反抓取限制。
无与伦比的成本效率和兼容性：与其他解决方案相比，成本降低 60%-80%，同时保持与 Playwright 和 Puppeteer 的兼容性，降低开发门槛。我们也在积极探索以AI为中心的下一代技术。我们热烈欢迎开发者和团队分享对我们产品的优化建议或功能请求。您的反馈至关重要，将帮助我们不断改进Scrapeless Scraping Browser，为您提供更好的体验。

了解更多关于Scrapeless

Scrapeless MCP co-creation plan is coming!

datacollection — Thu, 17 Apr 2025 12:15:06 +0000

Scrapeless officially launches the MCP (Model Context Protocol) ecosystem partner program, targeting AI application developers, industry solution providers, and toolchain developers, opening up our next-generation AI real-time enhancement capabilities and sharing market dividends!

What is Scrapeless MCP Server?

Scrapeless MCP Server is an AI-enhanced server built on the MCP protocol , which helps LLM (such as Claude, GPT) call external information. It can directly integrate all Scrapeless tools: Scraping Browser, Scraping API, SerpAPI.
You can click to view the Scrapeless MCP Server info:

MCP.SO
Github
NPM
Glama.ai
Smithery.ai

We need you:

Help Scrapeless Build MCP Server
- You can also submit a PR to our existing Scrapeless Server Github repository
You can also use Scrapeless MCP Server to build tools for specific scenarios You can submit your content to us in the form of a document, or you can publish your content to any platform. You can click to see our sample cases：https://www.scrapeless.com/en/blog/mcp-cursor-ecommerce-assistant

🎁 Rewards

Best Project Award (3 in total)
Scrapeless offers an annual free subscription to recognize the most creative and innovative proposals.
Special Awards (5 in total)
Scrapeless $99 monthly subscription to recognize outstanding MCP application scenarios, tutorials, or documentation submitted.
Share Communication Award (several)

Publicly share your proposal on social media with a hashtag (such as #Scrapeless MCP Server) and we will select lucky participants to receive a free trial of Scrapeless.

How do apply?

Join the Scrapeless Discord community
Contact @liam to submit a cooperation application
We will complete the preliminary assessment and docking within three working days

Related Documents

Scrapeless Documentation: https://apidocs.scrapeless.com/
Scrapeless Discord：https://discord.gg/Np4CAHxB9a

If you have any ideas or suggestions about our products, please feel free to join our Discord community and communicate with us directly. We look forward to hearing from you!

Why Browserless (Scrapeless scraping browser) can be the infrastructure of your AI Agent

datacollection — Thu, 17 Apr 2025 02:07:35 +0000

Intro

In the context of the rapid development of artificial intelligence technology, AI agents are playing an increasingly important role in automating tasks, especially those that involve retrieving web information. For such tasks, efficiently and accurately scraping and parsing web content presents a significant challenge. In this article, we will explore the recently released Browser Use and Scrapeless Scraping Browser and their impact on AI agents.

PART1.Browser Use: Enable AI agents to efficiently parse web pages

On March 23, 2025, the startup Browser Use announced the completion of a $17 million funding round, led by Felicis Ventures with support from several well-known investment firms. Browser Use is an AI-driven browser automation agent capable of efficiently parsing web content and helping AI agents automate a variety of online tasks. The company was founded by Gregor Žunič and Magnus Müller, who initially developed a prototype within four days and successfully launched it on Hacker News, gaining widespread attention.

The core technology of Browser Use is transforming each website into structured text, helping AI agents better understand and interact with webpages without relying on costly and inefficient computer vision methods. This approach allows AI agents to parse webpages as if handling databases, improving task execution efficiency and addressing common issues like IP bans and captchas. With proxy rotation and persistent session support, Browser Use ensures the stability and efficiency of tasks, enhancing the web browsing speed and accuracy of AI agents.

Of course, simply enabling AI agents to "understand webpages" is not enough. In reality, websites are constantly changing and implementing various anti-scraping measures such as IP blocking, captcha triggers, and user behavior detection, creating significant obstacles for AI agents when performing tasks.

While Browser Use addresses some of these issues through proxy rotation and persistent sessions, AI agents may still face challenges like fingerprint detection, dynamic rendering, and TLS anti-detection in more complex scenarios.

This is where the Scrapeless Scraping Browser comes into play. Additionally, it is better suited for large-scale scraping and automation tasks, supporting parallel scraping and efficient management of large data requests to ensure task stability and efficiency.

PART2.Browserless (Scrapeless Scraping Browser): The Ideal Infrastructure for AI Agents

In the previous section, we explored how Browser Use helps AI agents handle web tasks more effectively through efficient webpage parsing and information structuring. However, to truly enable AI agents to perform various online tasks in a stable and intelligent manner, the Scrapeless Scraping Browser offers a more advanced and comprehensive infrastructure.

PART 2.1 How to Achieve Browser Use's Capabilities and Enhance Data Scraping Performance with Scraping Browser

Before diving into a detailed comparison between Scraping Browser and Browser Use, it's important to first understand their respective functionalities and technical implementations. While both involve browser automation and data scraping, they differ significantly in many aspects and are suitable for different use cases. In this section, we will analyze the differences between the two in terms of functionality, technical implementation, use cases, and ease of use, and explore how Scraping Browser can achieve the existing capabilities of Browser Use.

1.1 Functionality Overview

Browser Use:

As a Python library focused on automation, Browser Use is primarily aimed at developers and provides AI agents with browser control to facilitate automated tasks. It offers users a simple API that makes it easy to navigate, interact with, and scrape data from websites.

Its core strength lies in its flexibility, making it ideal for developers who wish to perform customized browser operations.

Scraping browser:

In comparison, Scraping BrowserScraping Browser is more focused on offering efficient web scraping solutions, especially when it comes to bypassing anti-scraping technologies. With cloud fingerprinting technology, Scraping Browser simulates real user behavior to minimize the risk of being detected as a bot by target websites.

Its functionality is better suited for large-scale data scraping, especially in scenarios involving complex anti-scraping measures.

1.2 Technical Implementation

Next, we’ll take a deeper look at the technical differences between the two:

Browser Use:

Browser Use relies on powerful browser automation frameworks (such as Playwright) to perform browser operations locally or on the cloud. Its technical implementation is highly flexible, making it suitable for developers with custom needs.

Users can highly customize operations according to specific requirements, such as simulating different user behaviors or controlling the browser to perform specific tasks.

Scraping browser:

Unlike Browser Use, Scraping Browser uses cloud services and fingerprint technology, employing methods like dynamic IP rotation and user agent masking to ensure simulated user behavior appears more realistic. This allows it to bypass target websites' anti-scraping measures, resulting in more efficient data scraping.

Scraping Browser’s technical advantage lies in its ability to support large-scale scraping tasks, handle complex anti-scraping mechanisms, and ensure successful data scraping even when frequently changing IPs and user agents.

Don’t let complex anti-scraping measures slow you down! Log in now and use Scrapeless Scraping Browser to enhance your web scraping tasks.

1.3 Use Cases

The differences in functionality naturally lead to different use cases:

Browser Use:

Browser Use is more suitable for developers performing small-scale, customized automation tasks, or in scenarios where AI agents are involved. For tasks that don't require large-scale, high-frequency data scraping, Browser Use offers sufficient flexibility and customization options.

For example, developers might use Browser Use to automate data extraction tasks from specific websites or create AI tools that integrate browser control.

Scraping browser:

Scraping Browser shines in its adaptability, particularly in large-scale data scraping tasks that involve overcoming complex anti-scraping technologies. For tasks requiring frequent access and scraping of vast amounts of data, Scraping Browser is undoubtedly the better choice.

It is particularly useful for high-frequency, large-scale scraping tasks, such as e-commerce websites or social media data scraping, where it can effectively bypass stringent anti-scraping measures.

1.4 Ease of Use

While both tools offer automation features, there are notable differences in terms of ease of use:

Browser Use: As a Python library aimed at developers, Browser Use provides extensive documentation, examples, and tutorials to help developers get started quickly. However, it requires users to have a certain level of programming skills to customize operations as needed.

For developers with programming experience, Browser Use's flexibility makes it an attractive choice.

Scraping browser:

Scraping Browser typically offers a more comprehensive service where users don’t need to focus on technical details and can focus more on data scraping itself. It provides a more intuitive user interface and better usability, especially for those without programming skills.

Since it uses cloud fingerprinting technology behind the scenes, users only need to configure scraping tasks without diving deep into the technical implementation.

In summary, Browser Use is more flexible and suited for developers looking to perform customized automation tasks, while Scraping Browser focuses on efficient and secure data scraping, particularly when dealing with anti-scraping technologies. The choice of which tool to use depends on specific needs and use cases.

Start scraping smarter today! No more hassle with complex webpage parsing—use Scrapeless' scraping browser to make your AI agent tasks faster and more accurate. Log in now and begin your journey: Login Here

Overall, Browser Use is more flexible and suitable for developers to perform personalized automation operations, while Cloud Fingerprint Crawling Browser focuses on efficient and secure data capture, especially when dealing with anti-crawling technology. The choice of which tool to use depends on specific needs and usage scenarios.

PART2.2 Scraping Browser vs. Browser Use

In this section, we explore how Scraping Browser achieves the existing capabilities of Browser Use.

Optimize web scraping and boost your productivity! Let Scrapeless' scraping browser become the backbone of your AI agent, solving web scraping challenges. Log in now and experience its powerful features: Login Here

1. Strong anti-blockade capability

For most network tasks, especially data scraping tasks, preventing blocking and bypassing anti-crawling mechanisms is crucial. Scrapeless Scraping Browser provides multiple layers of protection in this regard.

Proxy IP pool and auto-rotation

Scrapeless Scraping Browser provides a richer proxy IP pool that can automatically rotate IPs to avoid being blocked due to frequent requests from the same IP. This dynamic IP switching method greatly reduces the probability of being detected by the target website for crawlers.

Efficient Captcha unlocking technology

Many websites employ CAPTCHA mechanisms such as reCAPTCHA or Cloudflare Turnstile Challenge to block automated tools. Scrapeless Scraping Browser has strong CAPTCHA handling capabilities, using intelligent algorithms and automated unlocking techniques to quickly bypass these challenges, ensuring that AI agents can continue scraping data without interruptions due to CAPTCHAs. This makes Scrapeless Scraping Browser highly effective and stable when working with highly secure websites.

2. Highly personified interactive simulation

To ensure that AI agents can browse a web like a real user, the Scrapeless Scraping Browser integrates multiple, anthropomorphic interaction simulation techniques.

Dynamic Fingerprint Obfuscation Technology

This technology allows Scraping Browser to simulate user behaviors such as mouse tracks, scrolling, clicking, etc. at the Chrome kernel level , thus avoiding being recognized as an automation tool by the target website. In this way, Scrapeless Scraping Browser makes requests from AI agents appear almost identical to the behavior of ordinary users, effectively bypassing common anti-crawling strategies.

Support dynamic rendering of JavaScript-heavy websites

Many modern websites rely on JavaScript to dynamically load content, which poses challenges to traditional crawlers. Scrapeless Scraping Browser can handle JavaScript-heavy websites, ensuring that AI agents can access all dynamically rendered content on the webpage, not just static HTML pages. This enables it to crawl more complex webpage data and meet the needs of modern internet.

3. Advanced anti-detection mechanism

Scrapeless Scraping Browser uses various technologies to hide the crawling features of AI agents, avoiding recognition and blocking by target websites.

TLS Fingerprint Forgery Technology

Through TLS fingerprint forgery, Scrapeless Scraping Browser can disguise itself as a normal browser access, avoiding the detection of crawler tools by target websites. TLS (Transport Layer Security Protocol) fingerprint forgery is an advanced security technology that simulates the unique identity of the browser during connection, increasing the anti-interference ability of anti-crawling technology.

Real browser environment for anti-detection

In order to avoid being recognized by crawlers, Scrapeless Scraping Browser makes the browser environment as close as possible to the behavior of real users, using a real browser environment to perform tasks. Unlike crawlers that use Computer Vision and image recognition, this method can effectively reduce the risk of recognition and interception, ensuring that requests from AI agents are not marked as malicious by the target website.

4. Real-time data statistics and session management

Scrapeless Scraping Browser introduces real-time data statistics to ensure efficient and controllable Data Acquisition process. Users can track session status in real-time, view the progress of each browser session (such as running, success, failure), and intuitively grasp the status of task execution to ensure smooth data capture.

In addition, Scrapeless has enhanced session management capabilities, including:

Chat list and records: Users can easily view historical and current conversations, and easily manage and monitor all conversations.
Session stop function: Through the dashboard, users can directly terminate running sessions without manual intervention, greatly improving operational efficiency and flexibility.

Start scraping smarter today! No more hassle with complex webpage parsing—use Scrapeless' scraping browser to make your AI agent tasks faster and more accurate. Log in now and begin your journey: Login Here

PART 2.3 Use Cases of Scraping Browser

To more clearly demonstrate the powerful capabilities of Scraping Browser, let's look at a few typical use cases and how AI agents enable more intelligent data scraping.

1. E-Commerce Website Data Collection and Price Monitoring

Use Case: Cross-border e-commerce companies need to regularly monitor product prices and stock information on competitor websites to optimize their own pricing strategies.

Challenges: The target website employs strict anti-scraping mechanisms, including dynamic IP blocking, CAPTCHA detection, and JavaScript-rendered pages.

Solution:

Dynamic IP Rotation: Use Scraping Browser’s proxy pool functionality to regularly change IPs and avoid being blocked.
Advanced Fingerprint Simulation: Implement dynamic fingerprint obfuscation to make the browsing behavior resemble that of a real user.
Automatic JavaScript Parsing: Ensure the scraped pages include all dynamically rendered content.

Whether you're monitoring eCommerce prices or collecting real-time travel data, Scrapeless is the solution you need. Log in now and streamline your data scraping with advanced automation.

2. Travel Industry Information Scraping and Price Comparison Analysis

Use Case: A travel booking platform wants to scrape real-time price information from multiple airline and hotel websites to provide the best booking recommendations.

Challenges: Many travel websites use dynamic loading technologies and have strict anti-scraping measures, such as TLS fingerprint detection and CAPTCHA validation.

Solution:

TLS Fingerprint Spoofing: Scraping Browser simulates TLS fingerprints from different devices and browsers, making requests appear to come from real users.
Intelligent CAPTCHA Solving: Use Scraping Browser’s CAPTCHA solution to automatically handle CAPTCHAs during login and query processes.
Parallel Scraping: Improve the speed of data collection through multithreading and distributed architecture.
AI Agent Predictive Analysis: Combine AI Agent to predict price trends and provide users with more accurate booking recommendations.

PART3. Bonus Tip: Bypass Cloudflare using Scraping Browser and Puppeteer

We firmly protect the privacy of the website. All data in this blog is public and is only used as a demonstration of the crawling process. We do not save any information and data.

Scrapeless requires puppeteer-core, a Puppeteer version that doesn't download the Chrome binary. So, ensure you install it:

npm install puppeteer-core

Step 1. Sign up for Scrapeless, click API Key Management > Create API Key to create your Scrapeless API Key.

Sign up for Scrapeless and get a free trial. If you have any questions, you can also contact Liam via Discord

Step 2. Then, go to Scraping Browser and copy your Browser URL.

Integrate the copied browser URL into your Puppeteer script like so:

const puppeteer = require('puppeteer-core');
const connectionURL = 'wss://browser.scrapeless.com/browser?token=<YOUR_Scrapeless_API_KEY>&session_ttl=180&proxy_country=ANY';
(async () => {// set up browser environmentconst browser = await puppeteer.connect({browserWSEndpoint: connectionURL,
    });
// create a new pageconst page = await browser.newPage();
// navigate to a URLawait page.goto('https://www.scrapingcourse.com/cloudflare-challenge', {waitUntil: 'networkidle0',
    });
// wait for the challenge to resolveawait new Promise(function (resolve) {setTimeout(resolve, 10000);
    });
//take page screenshotawait page.screenshot({ path: 'screenshot.png' });// close the browser instanceawait browser.close();
})();

You need to replace https://www.scrapingcourse.com/cloudflare-challenge with any website with cloudflare-challenge; Also replace your Scrapeless API Key in the token part.

The above code accesses and screenshots the protected page. See the result below:

Congratulations 🎉! You've successfully bypassed Cloudflare using Puppeteer and Scrapeless.

Conclusion

Scrapeless's technology is the infrastructure that supports AI agents to efficiently perform online tasks. Whether developers or enterprises, Scrapeless's scraping browser provides a flexible and low-cost solution when building and optimizing AI agents, making it an ideal choice for improving work efficiency, reducing development costs, and accelerating technological progress.

Unlock the power of seamless web scraping! With Scrapeless' scraping browser, you can turn any website into structured data effortlessly, boosting your AI agent’s performance. Log in and start today: Login Here

Faq

1. How does Scraping Browser bypass anti-scraping systems such as Cloudflare?

Scraping Browser combines dynamic IP proxy, automatic JavaScript parsing and browser fingerprint camouflage to bypass most anti-scraping mechanisms. Compared with traditional Puppeteer/Playwright solutions, it can simulate real user behavior and automatically adjust the request frequency through built-in strategies to increase the success rate. For specific methods, please refer to this article: How to Bypass Cloudflare Challenge

2. How to bypass CAPTCHA for web scraping?

You can use Web Unlocker to efficiently bypass CAPTCHA protection and improve scraping success rates. For a detailed guide, check out: How to Use Web Unlocker to Bypass CAPTCHA.

3. What is the best Scraping API？

Answer: Scrapeless Scraping API supports scenarios such as Amazon, Shopee, Walmart, SHEIN, TikTok, Instagram, etc., specifically covering e-commerce, social media and other fields. It also covers SERP APIs for more than 20 Google search scenarios, including Google Flights, Google Maps and Google Trends. See: Scrapeless Scraping API.

4. How to scrape Google SERP data?

Google SERP data covers multiple scenarios, and Scrapeless API can cover over 20 scenarios of Google SERP. You can start scraping with just a simple registration. For more details, refer to: Scrapeless Deepserp API .

Why Browserless (Scrapeless scraping browser) can be the infrastructure of your AI Agent

datacollection — Thu, 03 Apr 2025 10:19:28 +0000

Intro

PART1.Browser Use: Enable AI agents to efficiently parse web pages

PART2.Browserless (Scrapeless Scraping Browser): The Ideal Infrastructure for AI Agents

PART 2.1 How to Achieve Browser Use's Capabilities and Enhance Data Scraping Performance with Scraping Browser

1.1 Functionality Overview

Browser Use:

Its core strength lies in its flexibility, making it ideal for developers who wish to perform customized browser operations.

Scraping browser:

Its functionality is better suited for large-scale data scraping, especially in scenarios involving complex anti-scraping measures.

1.2 Technical Implementation

Next, we’ll take a deeper look at the technical differences between the two:

Browser Use:

Users can highly customize operations according to specific requirements, such as simulating different user behaviors or controlling the browser to perform specific tasks.

Scraping browser:

Don’t let complex anti-scraping measures slow you down! Log in now and use Scrapeless Scraping Browser to enhance your web scraping tasks.

1.3 Use Cases

The differences in functionality naturally lead to different use cases:

Browser Use:

For example, developers might use Browser Use to automate data extraction tasks from specific websites or create AI tools that integrate browser control.

Scraping browser:

It is particularly useful for high-frequency, large-scale scraping tasks, such as e-commerce websites or social media data scraping, where it can effectively bypass stringent anti-scraping measures.

1.4 Ease of Use

While both tools offer automation features, there are notable differences in terms of ease of use:

Browser Use: As a Python library aimed at developers, Browser Use provides extensive documentation, examples, and tutorials to help developers get started quickly. However, it requires users to have a certain level of programming skills to customize operations as needed.

For developers with programming experience, Browser Use's flexibility makes it an attractive choice.

Scraping browser:

Since it uses cloud fingerprinting technology behind the scenes, users only need to configure scraping tasks without diving deep into the technical implementation.

Start scraping smarter today! No more hassle with complex webpage parsing—use Scrapeless' scraping browser to make your AI agent tasks faster and more accurate. Log in now and begin your journey: Login Here

PART2.2 Scraping Browser vs. Browser Use

In this section, we explore how Scraping Browser achieves the existing capabilities of Browser Use.

Optimize web scraping and boost your productivity! Let Scrapeless' scraping browser become the backbone of your AI agent, solving web scraping challenges. Log in now and experience its powerful features: Login Here

1. Strong anti-blockade capability

Proxy IP pool and auto-rotation

Efficient Captcha unlocking technology

2. Highly personified interactive simulation

To ensure that AI agents can browse a web like a real user, the Scrapeless Scraping Browser integrates multiple, anthropomorphic interaction simulation techniques.

Dynamic Fingerprint Obfuscation Technology

Support dynamic rendering of JavaScript-heavy websites

3. Advanced anti-detection mechanism

Scrapeless Scraping Browser uses various technologies to hide the crawling features of AI agents, avoiding recognition and blocking by target websites.

TLS Fingerprint Forgery Technology

Real browser environment for anti-detection

4. Real-time data statistics and session management

In addition, Scrapeless has enhanced session management capabilities, including:

Chat list and records: Users can easily view historical and current conversations, and easily manage and monitor all conversations.
Session stop function: Through the dashboard, users can directly terminate running sessions without manual intervention, greatly improving operational efficiency and flexibility.

Start scraping smarter today! No more hassle with complex webpage parsing—use Scrapeless' scraping browser to make your AI agent tasks faster and more accurate. Log in now and begin your journey: Login Here

PART 2.3 Use Cases of Scraping Browser

To more clearly demonstrate the powerful capabilities of Scraping Browser, let's look at a few typical use cases and how AI agents enable more intelligent data scraping.

1. E-Commerce Website Data Collection and Price Monitoring

Use Case: Cross-border e-commerce companies need to regularly monitor product prices and stock information on competitor websites to optimize their own pricing strategies.

Challenges: The target website employs strict anti-scraping mechanisms, including dynamic IP blocking, CAPTCHA detection, and JavaScript-rendered pages.

Solution:

Dynamic IP Rotation: Use Scraping Browser’s proxy pool functionality to regularly change IPs and avoid being blocked.
Advanced Fingerprint Simulation: Implement dynamic fingerprint obfuscation to make the browsing behavior resemble that of a real user.
Automatic JavaScript Parsing: Ensure the scraped pages include all dynamically rendered content.

Whether you're monitoring eCommerce prices or collecting real-time travel data, Scrapeless is the solution you need. Log in now and streamline your data scraping with advanced automation.

2. Travel Industry Information Scraping and Price Comparison Analysis

Use Case: A travel booking platform wants to scrape real-time price information from multiple airline and hotel websites to provide the best booking recommendations.

Challenges: Many travel websites use dynamic loading technologies and have strict anti-scraping measures, such as TLS fingerprint detection and CAPTCHA validation.

Solution:

TLS Fingerprint Spoofing: Scraping Browser simulates TLS fingerprints from different devices and browsers, making requests appear to come from real users.
Intelligent CAPTCHA Solving: Use Scraping Browser’s CAPTCHA solution to automatically handle CAPTCHAs during login and query processes.
Parallel Scraping: Improve the speed of data collection through multithreading and distributed architecture.
AI Agent Predictive Analysis: Combine AI Agent to predict price trends and provide users with more accurate booking recommendations.

PART3. Bonus Tip: Bypass Cloudflare using Scraping Browser and Puppeteer

We firmly protect the privacy of the website. All data in this blog is public and is only used as a demonstration of the crawling process. We do not save any information and data.

Scrapeless requires puppeteer-core, a Puppeteer version that doesn't download the Chrome binary. So, ensure you install it:

npm install puppeteer-core

Step 1. Sign up for Scrapeless, click API Key Management > Create API Key to create your Scrapeless API Key.

Sign up for Scrapeless and get a free trial. If you have any questions, you can also contact Liam via Discord

Step 2. Then, go to Scraping Browser and copy your Browser URL.

Integrate the copied browser URL into your Puppeteer script like so:

const puppeteer = require('puppeteer-core');
const connectionURL = 'wss://browser.scrapeless.com/browser?token=<YOUR_Scrapeless_API_KEY>&session_ttl=180&proxy_country=ANY';
(async () => {// set up browser environmentconst browser = await puppeteer.connect({browserWSEndpoint: connectionURL,
    });
// create a new pageconst page = await browser.newPage();
// navigate to a URLawait page.goto('https://www.scrapingcourse.com/cloudflare-challenge', {waitUntil: 'networkidle0',
    });
// wait for the challenge to resolveawait new Promise(function (resolve) {setTimeout(resolve, 10000);
    });
//take page screenshotawait page.screenshot({ path: 'screenshot.png' });// close the browser instanceawait browser.close();
})();

You need to replace https://www.scrapingcourse.com/cloudflare-challenge with any website with cloudflare-challenge; Also replace your Scrapeless API Key in the token part.

The above code accesses and screenshots the protected page. See the result below:

Congratulations 🎉! You've successfully bypassed Cloudflare using Puppeteer and Scrapeless.

Conclusion

Unlock the power of seamless web scraping! With Scrapeless' scraping browser, you can turn any website into structured data effortlessly, boosting your AI agent’s performance. Log in and start today: Login Here

Faq

1. How does Scraping Browser bypass anti-scraping systems such as Cloudflare?

2. How to bypass CAPTCHA for web scraping?

You can use Web Unlocker to efficiently bypass CAPTCHA protection and improve scraping success rates. For a detailed guide, check out: How to Use Web Unlocker to Bypass CAPTCHA.

3. What is the best Scraping API？

4. How to scrape Google SERP data?

How to Use Undetected ChromeDriver for Web Scraping

datacollection — Mon, 17 Mar 2025 09:14:49 +0000

Discover how Undetected ChromeDriver helps bypass anti-bot systems for web scraping, along with step-by-step guidance, advanced methods, and key limitations. Plus, learn about Scrapeless - a more robust alternative for professional scraping needs.

In this guide, you will learn:

What Undetected ChromeDriver is and how it can be useful
How it minimizes bot detection
How to use it with Python for web scraping
Advanced usage and methods
Its key limitations and drawbacks
Recommended alternative: Scrapeless
Technical analysis of anti-bot detection mechanisms

Let's dive in!

What Is Undetected ChromeDriver?

Undetected ChromeDriver is a Python library that provides an optimized version of Selenium's ChromeDriver. This has been patched to limit detection by anti-bot services such as:

Imperva
DataDome
Distil Networks
and more ...

It can also help bypass certain Cloudflare protections, although that can be more challenging.

If you have ever used browser automation tools like Selenium, you know they let you control browsers programmatically. To make that possible, they configure browsers differently from regular user setups.

Anti-bot systems look for those differences, or "leaks," to identify automated browser bots. Undetected ChromeDriver patches Chrome drivers to minimize these telltale signs, reducing bot detection. This makes it ideal for web scraping sites protected by anti-scraping measures!

How does Undetected ChromeDriver work?

Undetected ChromeDriver reduces detection from Cloudflare, Imperva, DataDome, and similar solutions by employing the following techniques:

Renaming Selenium variables to mimic those used by real browsers
Using legitimate, real-world User-Agent strings to avoid detection
Allowing the user to simulate natural human interaction
Managing cookies and sessions properly while navigating websites
Enabling the use of proxies to bypass IP blocking and prevent rate limiting

These methods help the browser controlled by the library bypass various anti-scraping defenses effectively.

Using Undetected ChromeDriver for Web Scraping: Step-By-Step Guide

Step #1: Prerequisites and Project Setup

Undetected ChromeDriver has the following prerequisites:

Latest version of Chrome
Python 3.6+: If Python 3.6 or later is not installed on your machine, download it from the official site and follow the installation instructions.

Note: The library automatically downloads and patches the driver binary for you, so there is no need to manually download ChromeDriver.

Create a directory for your project:

mkdir undetected-chromedriver-scraper
cd undetected-chromedriver-scraper
python -m venv env

Activate the virtual environment:

# On Linux or macOS
source env/bin/activate

# On Windows
env\Scripts\activate

Step #2: Install Undetected ChromeDriver

Install Undetected ChromeDriver via the pip package:

pip install undetected_chromedriver

This library will automatically install Selenium, as it is one of its dependencies.

Step #3: Initial Setup

Create a scraper.py file and import undetected_chromedriver:

import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
import json

# Initialize a Chrome instance
driver = uc.Chrome()

# Connect to the target page
driver.get("https://scrapeless.com")

# Scraping logic...

# Close the browser
driver.quit()

Step #4: Implement the Scraping Logic

Now let's add the logic to extract data from the Apple page:

import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
import json
import time

# Create a Chrome web driver instance
driver = uc.Chrome()

# Connect to the Apple website
driver.get("https://www.apple.com/fr/")

# Give the page some time to fully load
time.sleep(3)

# Dictionary to store product info
apple_products = {}

try:
    # Find product sections (using the classes from the provided HTML)
    product_sections = driver.find_elements(By.CSS_SELECTOR, ".homepage-section.collection-module .unit-wrapper")

    for i, section in enumerate(product_sections):
        try:
            # Extract product name (headline)
            headline = section.find_element(By.CSS_SELECTOR, ".headline, .logo-image").get_attribute("textContent").strip()

            # Extract description (subhead)
            subhead_element = section.find_element(By.CSS_SELECTOR, ".subhead")
            subhead = subhead_element.text

            # Get the link if available
            link = ""
            try:
                link_element = section.find_element(By.CSS_SELECTOR, ".unit-link")
                link = link_element.get_attribute("href")
            except:
                pass

            apple_products[f"product_{i+1}"] = {
                "name": headline,
                "description": subhead,
                "link": link
            }
        except Exception as e:
            print(f"Error processing section {i+1}: {e}")

    # Export the scraped data to JSON
    with open("apple_products.json", "w", encoding="utf-8") as json_file:
        json.dump(apple_products, json_file, indent=4, ensure_ascii=False)

    print(f"Successfully scraped {len(apple_products)} Apple products")

except Exception as e:
    print(f"Error during scraping: {e}")

finally:
    # Close the browser and release its resources
    driver.quit()

Run it with:

python scraper.py

Undetected ChromeDriver: Advanced Usage

Now that you know how the library works, you're ready to explore some more advanced scenarios.

Choose a Specific Chrome Version

You can specify a particular version of Chrome for the library to use by setting the version_main argument:

import undetected_chromedriver as uc

# Specify the target version of Chrome
driver = uc.Chrome(version_main=105)

With Syntax

To avoid manually calling the quit() method when you no longer need the driver, you can use the with syntax:

import undetected_chromedriver as uc

with uc.Chrome() as driver:
    driver.get("https://example.com")
    # Rest of your code...

Limitations of Undetected ChromeDriver

While undetected_chromedriver is a powerful Python library, it does have some known limitations:

IP Blocks

The library does not hide your IP address. If you're running a script from a datacenter, chances are high that detection will still occur. Similarly, if your home IP has a poor reputation, you may also be blocked.

To hide your IP, you need to integrate the controlled browser with a proxy server, as demonstrated earlier.

No Support for GUI Navigation

Due to the inner workings of the module, you must browse programmatically using the get() method. Avoid using the browser GUI for manual navigation—interacting with the page using your keyboard or mouse increases the risk of detection.

Limited Support for Headless Mode

Officially, headless mode is not fully supported by the undetected_chromedriver library. However, you can experiment with it using:

driver = uc.Chrome(headless=True)

Stability Issues

Results may vary due to numerous factors. No guarantees are provided, other than continuous efforts to understand and counter detection algorithms. A script that successfully bypasses anti-bot systems today might fail tomorrow if the protection methods receive updates.

Recommended Alternative: Scrapeless

Given the limitations of Undetected ChromeDriver, Scrapeless offers a more robust and reliable alternative for web scraping without getting blocked.

We firmly protect the privacy of the website. All data in this blog is public and is only used as a demonstration of the crawling process. We do not save any information and data.

Why Scrapeless is Superior

Scrapeless is a remote browser service that addresses the inherent problems with the Undetected ChromeDriver approach:

Constant updates: Unlike Undetected ChromeDriver which may stop working after anti-bot system updates, Scrapeless is continuously updated by its team.
Built-in IP rotation: Scrapeless offers automatic IP rotation, eliminating the IP blocking issue of Undetected ChromeDriver.
Optimized configuration: Scrapeless browsers are already optimized to avoid detection, which greatly simplifies the process.
Automatic CAPTCHA solving: Scrapeless can automatically solve CAPTCHAs you might encounter.
Compatible with multiple frameworks: Works with Playwright, Puppeteer, and other automation tools.

Sign in to Scrapeless for a free trial.

Recommended reading: How to Bypass Cloudflare With Puppeteer

How to use Scrapeless to scrape the web (without getting blocked)

Here's how to implement a similar solution with Scrapeless using Playwright:

Step 1: Register and log in to Scrapeless

Step 2: Get the Scrapeless API KEY

Step 3: You can integrate the following code into your project

const {chromium} = require('playwright-core');

// Scrapeless connection URL with your token
const connectionURL = 'wss://browser.scrapeless.com/browser?token=YOUR_TOKEN_HERE&session_ttl=180&proxy_country=ANY';

(async () => {
  // Connect to the remote Scrapeless browser
  const browser = await chromium.connectOverCDP(connectionURL);

  try {
    // Create a new page
    const page = await browser.newPage();

    // Navigate to Apple's website
    console.log('Navigating to Apple website...');
    await page.goto('https://www.apple.com/fr/', {
      waitUntil: 'domcontentloaded',
      timeout: 60000
    });

    console.log('Page loaded successfully');

    // Wait for the product sections to be available
    await page.waitForSelector('.homepage-section.collection-module', { timeout: 10000 });

    // Get featured products from the homepage
    const products = await page.evaluate(() => {
      const results = [];

      // Get all product sections
      const productSections = document.querySelectorAll('.homepage-section.collection-module .unit-wrapper');

      productSections.forEach((section, index) => {
        try {
          // Get product name - could be in .headline or .logo-image
          const headlineEl = section.querySelector('.headline') || section.querySelector('.logo-image');
          const headline = headlineEl ? headlineEl.textContent.trim() : 'Unknown Product';

          // Get product description
          const subheadEl = section.querySelector('.subhead');
          const subhead = subheadEl ? subheadEl.textContent.trim() : '';

          // Get product link
          const linkEl = section.querySelector('.unit-link');
          const link = linkEl ? linkEl.getAttribute('href') : '';

          results.push({
            name: headline,
            description: subhead,
            link: link
          });
        } catch (err) {
          console.error(`Error processing section ${index}: ${err.message}`);
        }
      });

      return results;
    });

    // Display the results
    console.log('Found Apple products:');
    console.log(JSON.stringify(products, null, 2));
    console.log(`Total products found: ${products.length}`);

  } catch (error) {
    console.error('An error occurred:', error);
  } finally {
    // Close the browser
    await browser.close();
    console.log('Browser closed');
  }
})();

You can also join the Scrapeless Discord to participate in the developer support program and receive up to 500k SERP API usage credits for free.

Enhanced Technical Analysis

Bot Detection: How It Works

Anti-bot systems use several techniques to detect automation:

Browser fingerprinting: Collects dozens of browser properties (fonts, canvas, WebGL, etc.) to create a unique signature.
WebDriver detection: Looks for the presence of the WebDriver API or its artifacts.
Behavioral analysis: Analyzes mouse movements, clicks, typing speed that differ between humans and bots.
Navigation anomaly detection: Identifies suspicious patterns like too-fast requests or lack of image/CSS loading.

Recommended reading: How to Bypass Anti Bot

How Undetected ChromeDriver Bypasses Detection

Undetected ChromeDriver circumvents these detections by:

Removing WebDriver indicators: Eliminates the navigator.webdriver property and other WebDriver traces.
Patching Cdc_: Modifies Chrome Driver Controller variables that are known signatures of ChromeDriver.
Using realistic User-Agents: Replaces default User-Agents with up-to-date strings.
Minimizing configuration changes: Reduces changes to Chrome browser's default behavior.

Technical code showing how Undetected ChromeDriver patches the driver:

Simplified extract from Undetected ChromeDriver source code

def _patch_driver_executable():
    """
    Patches the ChromeDriver binary to remove telltale signs of automation
    """
    linect = 0
    replacement = os.urandom(32).hex()
    with io.open(self.executable_path, "r+b") as fh:
        for line in iter(lambda: fh.readline(), b""):
            if b"cdc_" in line.lower():
                fh.seek(-len(line), 1)
                newline = re.sub(
                    b"cdc_.{22}", b"cdc_" + replacement.encode(), line
                )
                fh.write(newline)
                linect += 1
    return linect

Why Scrapeless is More Effective

Scrapeless takes a different approach by:

Pre-configured environment: Using browsers already optimized to mimic human users.
Cloud-based infrastructure: Running browsers in the cloud with proper fingerprinting.
Intelligent proxy rotation: Automatically rotating IPs based on the target site.
Advanced fingerprint management: Maintaining consistent browser fingerprints throughout the session.
WebRTC, Canvas, and Plugin suppression: Blocking common fingerprinting techniques.

Sign in to Scrapeless for a free trial.

Conclusion

In this article, you've learned how to deal with bot detection in Selenium using Undetected ChromeDriver. This library provides a patched version of ChromeDriver for web scraping without getting blocked.

The challenge is that advanced anti-bot technologies like Cloudflare will still be able to detect and block your scripts. Libraries like undetected_chromedriver are unstable—while they may work today, they might not work tomorrow.

For professional scraping needs, cloud-based solutions like Scrapeless offer a more robust alternative. They provide pre-configured remote browsers specifically designed to bypass anti-bot measures, with additional features like IP rotation and CAPTCHA solving.

The choice between Undetected ChromeDriver and Scrapeless depends on your specific needs:

Undetected ChromeDriver: Good for smaller projects, free and open-source, but requires more maintenance and can be less reliable.
Scrapeless: Better for professional scraping needs, more reliable, constantly updated, but comes with a subscription cost.

By understanding how these anti-bot bypass technologies work, you can choose the right tool for your web scraping projects and avoid the common pitfalls of automated data collection.

How to Scrape Google Finance Ticker Quote Data in Python

datacollection — Fri, 14 Mar 2025 10:30:50 +0000

In the fast-paced world of finance, access to up-to-date and accurate stock market data is essential for investors, traders, and analysts. Google Finance is an invaluable resource that provides real-time stock quotes, historical financial data, news, and currency rates. Learning how to scrape this data using Python can be of great benefit to those looking to aggregate data, perform sentiment analysis, make market forecasts, or effectively manage risk.

Why Scrape Google Finance?

Scraping Google Finance can be beneficial for various reasons, including:

Real-Time Stock Data – Access up-to-date stock prices, market trends, and historical performance.
Automated Market Analysis – Collect financial data at scale for trend analysis, portfolio management, or algorithmic trading.
Company Insights – Gather financial summaries, earnings reports, and stock performance for investment research.
Competitor & Industry Research – Monitor competitors’ financial health and industry trends to make data-driven decisions.
News & Sentiment Analysis – Extract news articles and updates related to specific stocks or industries for sentiment tracking.

What will be scraped

How to Scrape Google Finance Ticker Quote Data in Python

Step 1. Configure the environment

Python: The software is the core of running Python. You can download the version we need from the official website as shown below. However, it is not recommended to download the latest version. You can download 1.2 versions before the latest version.
Python IDE: Any IDE that supports Python will work, but we recommend PyCharm. It is a development tool specifically designed for Python. For the PyCharm version, we recommend the free PyCharm Community Edition

Note: If you are a Windows user, do not forget to check the "Add python.exe to PATH" option during the installation wizard. This will allow Windows to use Python and commands in the terminal. Since Python 3.4 or later includes it by default, you do not need to install it manually.

Now you can check if Python is installed by opening the terminal or command prompt and entering the following command:

python --version

Step 2. Install Dependencies

It is recommended to create a virtual environment to manage project dependencies and avoid conflicts with other Python projects. Navigate to the project directory in the terminal and execute the following command to create a virtual environment named google_lens:

python -m venv google_finance

Activate the virtual environment based on your system:

Windows:

google_finance_env\Scripts\activate

MacOS/Linux:

source google_finance_env/bin/activate

After activating the virtual environment, install the required Python libraries for web scraping. The library for sending requests in Python is requests, and the main library for scraping data is BeautifulSoup4. Install them using the following commands:

pip install requests
pip install beautifulsoup4
pip install playwright

Step 3. Scrape Data

To extract stock information from Google Finance, we first need to understand how to use the website's URL to scrape the desired stock. Let's take the Nasdaq index as an example, which contains multiple stocks that we can get information from. To access the symbol of each stock, we can use the Nasdaq stock filter from this link. Now let's target META as our target stock. With the index and stock in hand, we can build the first snippet of the script.

We firmly protect the privacy of the website. All data in this blog is public and is only used as a demonstration of the crawling process. We do not save any information and data.


import requests
from bs4 import BeautifulSoup
BASE_URL = "https://www.google.com/finance"
INDEX = "NASDAQ"
SYMBOL = "META"
LANGUAGE = "en"
TARGET_URL = f"{BASE_URL}/quote/{SYMBOL}:{INDEX}?hl={LANGUAGE}"

Now we can use the Requests library to make an HTTP request on TARGET_URL and create a Beautiful Soup instance to scrape the HTML content.

make an HTTP request
page = requests.get(TARGET_URL)# use an HTML parser to grab the content from "page"
soup = BeautifulSoup(page.content, "html.parser")

Before we start crawling, we first need to process the HTML element (TARGET_URL) by inspecting the web page.

Items describing stocks are represented by the class gyFHrc. Inside each such element, there is a class that represents the item's title (e.g. "Last Closing Price") and the corresponding value (e.g. $597.99). The title can be obtained from the mfs7Fc class, while the value comes from the P6K39c class.

The complete list of items to be crawled is as follows:

Previous Close
Day Range
Year Range
Market Cap
AVG Volume
P/E Ratio
Dividend Yield
Primary Exchange
CEO
Founded
Website
Employees

Now let's see how to fetch these items using Python code.

# get the items that describe the stock
items = soup.find_all("div", {"class": "gyFHrc"})


# create a dictionary to store the stock description
stock_description = {}

# iterate over the items and append them to the dictionary
for item in items:
    item_description = item.find("div", {"class": "mfs7Fc"}).text
    item_value = item.find("div", {"class": "P6K39c"}).text
    stock_description[item_description] = item_value


print(stock_description)

This is just an example of a simple script that can be integrated into a trading bot, application, or a simple dashboard to track your favorite stocks.

Full Code

There are many more data attributes you can grab from the page, but for now, the full code looks a little like this.

import requests
from bs4 import BeautifulSoup
BASE_URL = "https://www.google.com/finance"
INDEX = "NASDAQ"
SYMBOL = "META"
LANGUAGE = "en"
TARGET_URL = f"{BASE_URL}/quote/{SYMBOL}:{INDEX}?hl={LANGUAGE}"# make an HTTP request
page = requests.get(TARGET_URL)# use an HTML parser to grab the content from "page"
soup = BeautifulSoup(page.content, "html.parser")# get the items that describe the stock
items = soup.find_all("div", {"class": "gyFHrc"})# create a dictionary to store the stock description
stock_description = {}# iterate over the items and append them to the dictionaryfor item in items:
for item in items:
    item_description = item.find("div", {"class": "mfs7Fc"}).text
    item_value = item.find("div", {"class": "P6K39c"}).text
    stock_description[item_description] = item_value

The following are some examples of the results：

Limitations when scraping Google Finance

Using the above method, you can create a small scraper, but if you are going to do large-scale scraping, this scraper will not continue to provide you with data. Google is very sensitive about data scraping and will eventually block your IP.

Once your IP is blocked, you will not be able to scrape anything and your data pipeline will eventually break. So, how to overcome this problem? There is a very simple solution and that is to use the Google Finance Scraping API.

Let's see how to scrape unlimited data from Google Finance using this API.

Why use Scrapeless Google Finance Scraping API

Data quality and accuracy

High-precision data: Scrapeless SerpApi always provides accurate, reliable and up-to-date Google Finance data, ensuring that users can obtain the most authentic and useful market information.
Real-time updates: Being able to obtain the latest data on Google Finance in real time, including real-time stock quotes, market trends, etc., is essential for users who need to make timely investment decisions. ### Multi-language and location support
Multi-language support: Supports multiple languages, and users can obtain financial data in different languages according to their needs to meet the needs of users in different regions around the world.
Location customization: You can obtain customized search results based on specified geographic locations, device types and other parameters, which is very useful for analyzing market conditions in different regions or conducting localized market research. ### Performance and cost advantages
Super fast speed: With an average response time of only 1-2 seconds, Scrapeless SerpApi is one of the fastest search crawling APIs on the market, which can quickly provide users with the required data.
Cost-effective: Scrapeless SerpApi provides Google Search APIs at only $0.1 per thousand queries. This pricing model is very cost-effective for large-scale data scraping projects. Integration
Easy integration: Scrapeless SerpApi supports integration with a variety of popular programming languages (such as Python, Node.js, Golang, etc.), and users can easily embed it into their own applications or analysis tools. Stability and reliability
High availability: Scrapeless SerpApi has high service availability and stability, which can ensure uninterrupted service to users during long-term and high-frequency data scraping.
Professional support: Scrapeless SerpApi provides professional technical support and customer service to help users solve problems encountered during use and ensure that users can smoothly obtain and use data.

How to Scrape Google Finance data with Scrapeless

Step 1: Sign up for Scrapeless and get an API key

If you don't have a Scrapeless account yet, visit the Scrapeless website and sign up. You can get 20,000 free search queries.
Once signed up, log in to your dashboard.
In the dashboard, navigate to API Key Management and click Create API Key. Copy the generated API key, which will be your authentication credential when calling the Scrapeless API.

Step 2: Access the Deep SerpApi Playground

Then navigate to the "Deep SerpApi" section.

Step 3: Set search parameters

In the Playground, enter your search keyword, such as "GOOGL:NASDAQ".
Set other parameters, such as Query term, language, time etc. > You can also click to view the official API documentation of Scrapeless to learn about the parameters of Google Finance.

Step 4: Perform a search

Click the "Start Search" button, and the Playground will send a request to the Deep Serp API and return structured JSON data. ### Step 5: View and export data
Browse the returned JSON data to view detailed information.
If necessary, you can click "Copy" in the upper right corner to export the data to CSV or JSON format for further analysis.

Free developer support:
Integrate Scrapeless Deep SerpApi into your AI tool, application or project (we already support Dify, and will support Langchain, Langflow, FlowiseAI and other frameworks in the future).
Share your integration results on social media and you will get 1 to 12 months of free developer support, up to 500K usage per month.
Seize this opportunity to improve your project and enjoy more development support! You can also contact Liam via Discord for more details.

How to integrate the Scrapeless API

Here is the sample code for scraping Google Finance results using the Scrapeless API:

import json
import requests


class Payload:
    def __init__(self, actor, input_data):
        self.actor = actor
        self.input = input_data


def send_request():
    host = "api.scrapeless.com"
    url = f"https://{host}/api/v1/scraper/request"
    token = "your api key"

    headers = {
        "x-api-token": token
    }

    input_data = {
        "q": "GOOG:NASDAQ",
        "window": "MAX",
        .....
    }

    payload = Payload("scraper.google.finance", input_data)

    json_payload = json.dumps(payload.__dict__)

    response = requests.post(url, headers=headers, data=json_payload)

    if response.status_code != 200:
        print("Error:", response.status_code, response.text)
        return

    print("body", response.text)


if __name__ == "__main__":
    send_request()

Adjust the query parameters as needed to get more precise results. For more information on API parameters, you can check out the Scrapeless official API documentation
You must replace YOUR-API-KEY with the API key you copied.

Additional Resources

How to Scrape Google News with Python
How to Bypass Cloudflare With Puppeteer
How to Scrape Google Lens Results with Scrapeless

Conclusion

In conclusion, scraping Google Finance ticker quote data in Python is a powerful technique for accessing real-time financial information. By utilizing libraries like requests and BeautifulSoup, or more advanced tools like Selenium, you can efficiently extract and analyze market data to inform your investment decisions. Remember to respect website terms of service and consider using official APIs when available for sustainable data access.

DEV Community: datacollection

Build Smart Business News Monitoring with Dify + Deep SerpApi

1. Solution Overview

2. Enterprise-Grade Tooling Overview

Dify Intelligent Workflow Platform

Scrapeless Deep SerpApi

3. Environment Setup & Account Registration

3.1 Register a Scrapeless Account and Obtain API Token

3.2 Register a Dify account and install the Deep SerpApi plugin

4. Detailed Configuration Process

Step 1: Add the Deep SerpApi Node

Step 2: Configure Search Parameters

Step 3: Add Template Node to Format Search Results

Step 4: Configure the AI Analysis Node

Step 5: Run and Debug the Workflow

Step 6: Integrate Enterprise Notification Channels (e.g., Discord Webhook) (Optional)

Step 7: Add an end node to complete the workflow configuration

Step 8: Output the results

Workflow Demo

5. Success Stories & Performance Impact

Leading Financial Institution

Global Manufacturing Enterprise

6. FAQs & Best Practices

7. Summary

Building an AI-Powered Web Data Pipeline with n8n, Scrapeless, and Claude

Introduction

What You'll Build

Installation and Setup

Installing n8n

Setting up Claude API

Setting up Scrapeless

Installing Qdrant with Docker

Installing Ollama

Setting Up the n8n Workflow

Workflow Overview

Step 1: Configure Workflow Trigger and Collection Check

Step 2: Configure Scrapeless Web Request

Step 3: Claude Data Extraction

Step 4: Format Claude Output

Step 5: Ollama Embeddings Generation

Step 6: Qdrant Vector Storage

Step 7: Notification System

Troubleshooting Common Issues

n8n Node.js Version Issues

Scrapeless API Connection Issues

Ollama Embedding Errors

Advanced Usage Scenarios

Batch Processing Multiple URLs

Scheduled Data Updates

Custom Extraction Templates

Conclusion

Best Practices for Automation and Web Scraping Using Scrapeless Scraping Browser

Introduction: A New Paradigm of Browser Automation and Data Collection in the AI Era

I. Background: Why Do We Need Scrapeless Scraping Browser?

1.1 The Evolution of Browser Automation

1.2 The Challenge of Anti-Bot Mechanisms

II. Core Capabilities of Scrapeless

2.1 Real Browser Environment

Deep Customization of Browser Fingerprints

2.2 Cloud-Based Deployment and Scalability

Performance Comparison

2.3 CAPTCHA automatic solution and event monitoring mechanism

CAPTCHA solving ability

Event monitoring mechanism

Event Response Data Structure

2.4 Powerful proxy support

2.5 Session replay

3. Code example: Scrapeless integration and use

3.1 Use of Scrapeless Scraping Browser

3.2 Scrapeless Scraping Browser Fingerprint Parameters Example Code

3.3 CAPTCHA event monitoring example

4. Best Practices for Automation and Web Scraping Using Scrapeless Scraping Browser

Bypassing Idealista Cloudflare with Scrapeless Scraping Browser to Collect Real Estate Data

Understanding the Challenge: Idealista's Cloudflare Protection

Bypassing Idealista Cloudflare with Scrapeless Scraping Browser

Prerequisites

Locating the Data

Step 1: Set Up Selenium with ChromeDriver

Step 2: Load the Target URL

Step 3: Parse the Page Content