<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: AlterLab</title>
    <description>The latest articles on DEV Community by AlterLab (@alterlab).</description>
    <link>https://dev.to/alterlab</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3842661%2F6ea3b67f-3a2b-423f-b726-51041ab344e6.png</url>
      <title>DEV Community: AlterLab</title>
      <link>https://dev.to/alterlab</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alterlab"/>
    <language>en</language>
    <item>
      <title>Extract Structured Data from Websites Using AI Instead of CSS Selectors</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Sun, 12 Apr 2026 10:35:25 +0000</pubDate>
      <link>https://dev.to/alterlab/extract-structured-data-from-websites-using-ai-instead-of-css-selectors-13l</link>
      <guid>https://dev.to/alterlab/extract-structured-data-from-websites-using-ai-instead-of-css-selectors-13l</guid>
      <description>&lt;h2&gt;
  
  
  The Problem with CSS Selectors
&lt;/h2&gt;

&lt;p&gt;You write a scraper targeting &lt;code&gt;.product-price .amount&lt;/code&gt;. It works. Two weeks later, the site ships a redesign and your selector returns null. You inspect the DOM, find the new class, patch your code, and move on. This repeats every few months for every site you scrape.&lt;/p&gt;

&lt;p&gt;CSS selectors couple your extraction logic to implementation details you do not control. Class names change. DOM structures shift. A/B tests swap element order. Each change breaks your pipeline silently until you notice missing data downstream.&lt;/p&gt;

&lt;p&gt;AI extraction removes this coupling. You describe the data you want in plain text. The model reads the page, understands the semantic structure, and returns clean JSON. No selectors to maintain. No DOM inspection when layouts change.&lt;/p&gt;

&lt;h2&gt;
  
  
  How AI Extraction Works
&lt;/h2&gt;

&lt;p&gt;The process has three steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Fetch the page content (rendered, with JavaScript executed)&lt;/li&gt;
&lt;li&gt;Pass the content and your extraction schema to a language model&lt;/li&gt;
&lt;li&gt;Return structured JSON matching your schema&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The model does not guess. It reads the actual rendered DOM, identifies elements matching your description, and extracts their values. If a product page has a price, name, and rating, you describe those fields and get them back as typed JSON.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up
&lt;/h2&gt;

&lt;p&gt;Install the Python SDK:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```bash title="Terminal"&lt;br&gt;
pip install alterlab&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


Or use the REST API directly with curl. Both approaches are covered below. You will need an API key from your [dashboard](https://alterlab.io/signup).

## Example: Extracting Product Data

Here is a product page on an e-commerce site. You need the product name, price, rating, and number of reviews. With CSS selectors, you would inspect the DOM, write four selectors, and hope they survive the next deploy.

With AI extraction, you describe the fields:



```python title="extract_product.py" {5-12}

client = alterlab.Client("YOUR_API_KEY")

response = client.scrape(
    url="https://example-store.com/products/wireless-headphones",
    formats=["json"],
    cortex={
        "prompt": "Extract: product_name (string), price (float), rating (float out of 5), review_count (integer)"
    }
)

data = response.json["cortex"]
print(data)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```json title="response.json"&lt;br&gt;
{&lt;br&gt;
  "product_name": "Sony WH-1000XM5 Wireless Headphones",&lt;br&gt;
  "price": 348.00,&lt;br&gt;
  "rating": 4.7,&lt;br&gt;
  "review_count": 2841&lt;br&gt;
}&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


The same request via curl:



```bash title="Terminal" {4-7}
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example-store.com/products/wireless-headphones",
    "formats": ["json"],
    "cortex": {
      "prompt": "Extract: product_name (string), price (float), rating (float out of 5), review_count (integer)"
    }
  }'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Structured Schemas with JSON Schema
&lt;/h2&gt;

&lt;p&gt;For production pipelines, you want type guarantees. Pass a JSON Schema instead of a plain text prompt. The model validates its output against your schema before returning it.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="extract_with_schema.py" {8-25}&lt;/p&gt;

&lt;p&gt;client = alterlab.Client("YOUR_API_KEY")&lt;/p&gt;

&lt;p&gt;schema = {&lt;br&gt;
    "type": "object",&lt;br&gt;
    "properties": {&lt;br&gt;
        "products": {&lt;br&gt;
            "type": "array",&lt;br&gt;
            "items": {&lt;br&gt;
                "type": "object",&lt;br&gt;
                "properties": {&lt;br&gt;
                    "name": {"type": "string"},&lt;br&gt;
                    "price": {"type": "number"},&lt;br&gt;
                    "in_stock": {"type": "boolean"},&lt;br&gt;
                    "sku": {"type": "string"}&lt;br&gt;
                },&lt;br&gt;
                "required": ["name", "price", "in_stock"]&lt;br&gt;
            }&lt;br&gt;
        }&lt;br&gt;
    }&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;response = client.scrape(&lt;br&gt;
    url="&lt;a href="https://example-store.com/category/electronics" rel="noopener noreferrer"&gt;https://example-store.com/category/electronics&lt;/a&gt;",&lt;br&gt;
    formats=["json"],&lt;br&gt;
    cortex={"prompt": "Extract all products from this category page", "schema": schema}&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;for product in response.json["cortex"]["products"]:&lt;br&gt;
    print(f"{product['name']}: ${product['price']}")&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


This returns an array of products with typed fields. Missing optional fields are omitted. Required fields are always present. If the model cannot confidently extract a required field, it returns an error you can handle in your pipeline.

## Handling Dynamic Content

Many sites load data client-side. A product listing might render empty HTML, then populate via JavaScript fetches. Traditional scrapers that only fetch raw HTML get nothing back.

AI extraction requires the rendered DOM. The platform handles this automatically: it launches a headless browser, waits for the page to stabilize, then passes the rendered content to the model. You do not need to configure wait times or detect network idle.

For sites with aggressive bot detection, the [anti-bot bypass](https://alterlab.io/anti-bot-bypass-api) layer handles fingerprint rotation, TLS fingerprint matching, and challenge solving before the page ever reaches the extraction step.

## When to Use AI Extraction vs CSS Selectors

AI extraction is not a replacement for every scraping pattern. It is a tool for specific scenarios.

&amp;lt;div data-infographic="comparison"&amp;gt;
  &amp;lt;table&amp;gt;
    &amp;lt;thead&amp;gt;&amp;lt;tr&amp;gt;&amp;lt;th&amp;gt;Criteria&amp;lt;/th&amp;gt;&amp;lt;th&amp;gt;AI Extraction&amp;lt;/th&amp;gt;&amp;lt;th&amp;gt;CSS Selectors&amp;lt;/th&amp;gt;&amp;lt;/tr&amp;gt;&amp;lt;/thead&amp;gt;
    &amp;lt;tbody&amp;gt;
      &amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;Setup time&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Seconds &amp;amp;mdash; describe fields in text&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Minutes &amp;amp;mdash; inspect DOM, write selectors&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;
      &amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;Maintenance&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;None &amp;amp;mdash; model adapts to layout changes&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Ongoing &amp;amp;mdash; selectors break on redesign&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;
      &amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;Cost per request&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Higher &amp;amp;mdash; includes model inference&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Lower &amp;amp;mdash; raw extraction only&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;
      &amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;Type safety&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Strong &amp;amp;mdash; JSON Schema validation&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Manual &amp;amp;mdash; you parse and validate&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;
      &amp;lt;tr&amp;gt;&amp;lt;td&amp;gt;Best for&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Dynamic pages, complex layouts, prototyping&amp;lt;/td&amp;gt;&amp;lt;td&amp;gt;Stable pages, high volume, simple structures&amp;lt;/td&amp;gt;&amp;lt;/tr&amp;gt;
    &amp;lt;/tbody&amp;gt;
  &amp;lt;/table&amp;gt;
&amp;lt;/div&amp;gt;

Use AI extraction when:
- The site changes its layout frequently
- You are prototyping and need data fast
- The page structure is complex or inconsistent
- You need to extract from many different sites with one pipeline

Use CSS selectors when:
- The page structure is stable and predictable
- You are scraping at very high volume and cost matters
- You need sub-second response times
- The data is in simple, consistent locations

You can mix both approaches in the same pipeline. Use AI extraction for complex pages and selectors for stable ones. The [Python SDK](https://alterlab.io/web-scraping-api-python) supports both patterns with the same client interface.

## Real-World Pattern: Monitoring Competitor Prices

Here is a practical pipeline that combines scheduling with AI extraction. You want to track prices for a list of competitor products daily.



```python title="price_monitor.py" {10-18}

client = alterlab.Client("YOUR_API_KEY")

competitors = [
    {"url": "https://competitor-a.com/product/123", "name": "Competitor A"},
    {"url": "https://competitor-b.com/p/abc", "name": "Competitor B"},
]

for competitor in competitors:
    response = client.scrape(
        url=competitor["url"],
        formats=["json"],
        cortex={
            "prompt": "Extract: product_name (string), price (float), availability (string)"
        }
    )

    data = response.json["cortex"]
    print(f"{competitor['name']}: {data['product_name']} @ ${data['price']} - {data['availability']}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Wrap this in a scheduled job and store results in your database. When prices change, your pipeline detects the delta automatically. The &lt;a href="https://alterlab.io/docs" rel="noopener noreferrer"&gt;monitoring feature&lt;/a&gt; can also handle this natively by watching pages for content changes and pushing diffs to your webhook endpoint.&lt;/p&gt;
&lt;h2&gt;
  
  
  Error Handling
&lt;/h2&gt;

&lt;p&gt;AI extraction can fail when the page does not contain the requested data, the model cannot parse the structure, or the schema validation fails. Handle these cases explicitly:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="error_handling.py" {12-18}&lt;/p&gt;

&lt;p&gt;client = alterlab.Client("YOUR_API_KEY")&lt;/p&gt;

&lt;p&gt;try:&lt;br&gt;
    response = client.scrape(&lt;br&gt;
        url="&lt;a href="https://example.com/page" rel="noopener noreferrer"&gt;https://example.com/page&lt;/a&gt;",&lt;br&gt;
        formats=["json"],&lt;br&gt;
        cortex={"prompt": "Extract: email (string), phone (string)"}&lt;br&gt;
    )&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if "error" in response.json.get("cortex", {}):
    print(f"Extraction failed: {response.json['cortex']['error']}")
else:
    print(response.json["cortex"])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;except alterlab.APIError as e:&lt;br&gt;
    print(f"API error: {e.status_code} - {e.message}")&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


Common errors include pages that require authentication, content behind CAPTCHAs that exceed your tier, and schemas with impossible constraints. The API returns structured error messages so you can retry, adjust your prompt, or skip the page.

## Performance Considerations

AI extraction adds latency compared to raw HTML fetching. A typical request takes 3-8 seconds depending on page complexity and model load. For most pipelines, this is acceptable. Price monitoring, lead generation, and market research do not require sub-second responses.

If you need speed, use a two-tier approach:
1. Fetch raw HTML with a basic tier (fast, cheap)
2. Only escalate to AI extraction when the raw response is insufficient

Set `min_tier` in your request to skip lower tiers for known-difficult sites. This avoids the retry loop and gets you to the rendering tier on the first attempt.

Check the [pricing page](https://alterlab.io/pricing) for current tier costs and rate limits.

## Takeaway

CSS selectors tie your scraping logic to markup you do not control. AI extraction breaks that dependency. Describe the data you need, get back typed JSON, and stop maintaining selectors every time a site redesigns.

Use AI extraction for dynamic pages, prototyping, and multi-site pipelines. Use selectors for stable, high-volume targets. Mix both in the same pipeline based on each site's characteristics.

The [quickstart guide](https://alterlab.io/docs/quickstart/installation) covers installation and your first request in under five minutes.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>ai</category>
      <category>scraping</category>
      <category>python</category>
      <category>dataextraction</category>
    </item>
    <item>
      <title>Automate Web Scraping in n8n with AlterLab API</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Sat, 11 Apr 2026 10:35:25 +0000</pubDate>
      <link>https://dev.to/alterlab/automate-web-scraping-in-n8n-with-alterlab-api-4lj3</link>
      <guid>https://dev.to/alterlab/automate-web-scraping-in-n8n-with-alterlab-api-4lj3</guid>
      <description>&lt;h2&gt;
  
  
  Automate Web Scraping in n8n with AlterLab's API
&lt;/h2&gt;

&lt;p&gt;n8n is a workflow automation tool that connects APIs, databases, and services. Pair it with a scraping API that handles anti-bot bypass, proxy rotation, and headless rendering, and you get a pipeline that pulls structured data from any website on a schedule.&lt;/p&gt;

&lt;p&gt;This tutorial shows how to build that pipeline. You will configure an n8n workflow that sends scrape requests, receives clean JSON, and routes the data to a database, spreadsheet, or webhook.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;An n8n instance (self-hosted or cloud)&lt;/li&gt;
&lt;li&gt;An API key from &lt;a href="https://alterlab.io/signup" rel="noopener noreferrer"&gt;alterlab.io/signup&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Basic familiarity with n8n's node-based workflow editor&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 1: Configure the HTTP Request Node
&lt;/h2&gt;

&lt;p&gt;Create a new workflow in n8n. Add an HTTP Request node and configure it as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Method&lt;/strong&gt;: POST&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;URL&lt;/strong&gt;: &lt;code&gt;https://api.alterlab.io/v1/scrape&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authentication&lt;/strong&gt;: Header Auth&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Header Name&lt;/strong&gt;: &lt;code&gt;X-API-Key&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Header Value&lt;/strong&gt;: Your API key&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Send Body&lt;/strong&gt;: JSON&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Set the JSON body to:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```json title="HTTP Request Body"&lt;br&gt;
{&lt;br&gt;
  "url": "&lt;a href="https://example.com/products" rel="noopener noreferrer"&gt;https://example.com/products&lt;/a&gt;",&lt;br&gt;
  "formats": ["json"],&lt;br&gt;
  "min_tier": 3&lt;br&gt;
}&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


The `min_tier` parameter controls the scraping tier. Tier 3 enables JavaScript rendering. Set it higher for sites with aggressive bot detection. The [anti-bot bypass](https://alterlab.io/anti-bot-bypass-api) system auto-escalates if the initial tier fails.

## Step 2: Test with cURL First

Before building the full workflow, verify the endpoint works from your terminal. This isolates API issues from n8n configuration problems.



```bash title="Terminal" {1-4}
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/products", "formats": ["json"]}'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;A successful response returns structured data:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```json title="Response" {3-8}&lt;br&gt;
{&lt;br&gt;
  "status": "success",&lt;br&gt;
  "data": {&lt;br&gt;
    "products": [&lt;br&gt;
      {"name": "Widget A", "price": 29.99},&lt;br&gt;
      {"name": "Widget B", "price": 49.99}&lt;br&gt;
    ]&lt;br&gt;
  },&lt;br&gt;
  "metadata": {&lt;br&gt;
    "url": "&lt;a href="https://example.com/products" rel="noopener noreferrer"&gt;https://example.com/products&lt;/a&gt;",&lt;br&gt;
    "timestamp": "2026-04-11T10:30:00Z"&lt;br&gt;
  }&lt;br&gt;
}&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


&amp;lt;div data-infographic="try-it" data-url="https://example.com" data-description="Try scraping this page with AlterLab"&amp;gt;&amp;lt;/div&amp;gt;

## Step 3: Build the Full n8n Workflow

A production workflow needs more than a single HTTP request. You need error handling, data transformation, and a destination for the scraped data.

### Workflow Structure



```plaintext
[Schedule Trigger] -&amp;gt; [HTTP Request (Scrape)] -&amp;gt; [Code (Parse)] -&amp;gt; [Database/Sheet/Webhook]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Add these nodes in order:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Schedule Trigger&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Set a cron expression for your scrape frequency. Daily at 6 AM UTC:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;0 6 * * *
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. HTTP Request Node&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Use the configuration from Step 1. Enable "Continue On Fail" so one failed scrape does not block the entire workflow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Code Node (Data Transformation)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Parse the JSON response and extract the fields you need:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="n8n Code Node" {5-12}&lt;/p&gt;

&lt;h1&gt;
  
  
  Access the HTTP Request output
&lt;/h1&gt;

&lt;p&gt;response = json.parse($input.first().json.body)&lt;/p&gt;

&lt;h1&gt;
  
  
  Extract product data
&lt;/h1&gt;

&lt;p&gt;products = response.get("data", {}).get("products", [])&lt;/p&gt;

&lt;h1&gt;
  
  
  Transform to your schema
&lt;/h1&gt;

&lt;p&gt;items = []&lt;br&gt;
for product in products:&lt;br&gt;
    items.append({&lt;br&gt;
        "json": {&lt;br&gt;
            "name": product["name"],&lt;br&gt;
            "price": product["price"],&lt;br&gt;
            "scraped_at": response["metadata"]["timestamp"],&lt;br&gt;
            "source": response["metadata"]["url"]&lt;br&gt;
        }&lt;br&gt;
    })&lt;/p&gt;

&lt;p&gt;return items&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


**4. Destination Node**

Connect your output node. Common choices:

- **Postgres/MySQL**: Use the database node to upsert records
- **Google Sheets**: Append rows for lightweight tracking
- **Webhook**: Push to your own API or a Slack channel

## Step 4: Handle Multiple URLs

Scraping a single page is straightforward. Real pipelines scrape dozens or hundreds of URLs. Use n8n's Split Out node to fan out requests.



```python title="URL List Generator" {3-7}
# Code node that outputs multiple URLs
urls = [
    "https://example.com/products/page/1",
    "https://example.com/products/page/2",
    "https://example.com/products/page/3"
]

return [{"json": {"url": u}} for u in urls]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Connect this to a Split Out node, then to your HTTP Request node. Each URL becomes a separate execution branch. n8n processes them in parallel up to your concurrency limit.&lt;/p&gt;

&lt;p&gt;Add rate limiting between requests if the target site requires it. Use the Wait node between the Split Out and HTTP Request nodes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Wait: 2 seconds
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 5: Add Error Handling and Retries
&lt;/h2&gt;

&lt;p&gt;Scraping fails. Pages change structure, sites go down, anti-bot systems update. Your workflow should handle failures gracefully.&lt;/p&gt;

&lt;h3&gt;
  
  
  Retry Configuration
&lt;/h3&gt;

&lt;p&gt;In the HTTP Request node settings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Retry On Fail&lt;/strong&gt;: Enable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Max Retries&lt;/strong&gt;: 3&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retry Backoff&lt;/strong&gt;: Exponential&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Error Routing
&lt;/h3&gt;

&lt;p&gt;Add an error output branch from the HTTP Request node:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[HTTP Request] --(success)--&amp;gt; [Parse] --&amp;gt; [Database]
       |
       --(error)--&amp;gt; [Error Handler] --&amp;gt; [Alert/Log]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The error handler can log failures to a separate sheet, send a Slack notification, or queue the URL for a retry with a higher tier.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="Error Handler Code Node" {4-9}&lt;/p&gt;

&lt;h1&gt;
  
  
  Capture failed URLs for retry
&lt;/h1&gt;

&lt;p&gt;error_data = $input.first().json&lt;/p&gt;

&lt;p&gt;failed_urls.append({&lt;br&gt;
    "url": error_data.get("url"),&lt;br&gt;
    "error": error_data.get("error"),&lt;br&gt;
    "timestamp": datetime.utcnow().isoformat(),&lt;br&gt;
    "retry_tier": 4  # escalate tier on retry&lt;br&gt;
})&lt;/p&gt;

&lt;p&gt;return [{"json": {"failed": failed_urls}}]&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


## Step 6: Use Cortex AI for Structured Extraction

Some pages do not have clean HTML structures. Product listings buried in JavaScript, unstructured text, or dynamic content require a different approach. Cortex AI extracts structured data using natural language instructions.



```bash title="Terminal" {5-9}
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/reviews",
    "formats": ["json"],
    "cortex": {
      "prompt": "Extract reviewer name, rating (1-5), and review text from each review block"
    }
  }'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The response returns data matching your schema:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```json title="Cortex AI Response" {4-12}&lt;br&gt;
{&lt;br&gt;
  "status": "success",&lt;br&gt;
  "data": {&lt;br&gt;
    "reviews": [&lt;br&gt;
      {&lt;br&gt;
        "reviewer_name": "Jane D.",&lt;br&gt;
        "rating": 5,&lt;br&gt;
        "review_text": "Excellent product, fast shipping."&lt;br&gt;
      },&lt;br&gt;
      {&lt;br&gt;
        "reviewer_name": "Mark S.",&lt;br&gt;
        "rating": 4,&lt;br&gt;
        "review_text": "Good quality, slightly overpriced."&lt;br&gt;
      }&lt;br&gt;
    ]&lt;br&gt;
  }&lt;br&gt;
}&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


In n8n, the Cortex output works identically to standard JSON output. Route it through the same Code and Database nodes.

## Step 7: Monitor and Alert on Changes

Scraping is not always about collecting new data. Sometimes you need to detect changes on existing pages. Price drops, stock availability, competitor updates, regulatory filings.

Configure monitoring by storing previous scrape results and comparing them on each run:



```python title="Change Detection Code Node" {6-15}
# Compare current scrape with previous state
current = $input.first().json
previous = get_previous_state(current["url"])  # from database

changes = []
for key in current["data"]:
    if key not in previous:
        changes.append({"field": key, "action": "added", "value": current["data"][key]})
    elif current["data"][key] != previous[key]:
        changes.append({
            "field": key,
            "action": "changed",
            "old": previous[key],
            "new": current["data"][key]
        })

# Only pass through if changes detected
if changes:
    return [{"json": {"url": current["url"], "changes": changes}}]
return []
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;When changes exist, route to an alert node. When nothing changed, the workflow exits silently.&lt;/p&gt;


  
  
  
  

&lt;h2&gt;
  
  
  Cost Considerations
&lt;/h2&gt;

&lt;p&gt;Scraping pipelines can run expensive if you are not careful. A few practices:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cache aggressively&lt;/strong&gt;: Do not re-scrape pages that have not changed. Store hashes of previous responses and skip identical results.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use the lowest tier that works&lt;/strong&gt;: Start with &lt;code&gt;min_tier: 1&lt;/code&gt; for static pages. Only escalate to tier 3+ for JavaScript-heavy sites.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch URLs&lt;/strong&gt;: Group related URLs into single workflow runs rather than triggering separate workflows per URL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set spend limits&lt;/strong&gt;: API keys support spend caps. Set them per workflow to prevent runaway costs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Check &lt;a href="https://alterlab.io/pricing" rel="noopener noreferrer"&gt;pricing&lt;/a&gt; for current rates. You pay for what you use with no monthly minimums.&lt;/p&gt;
&lt;h2&gt;
  
  
  Complete Workflow Example
&lt;/h2&gt;

&lt;p&gt;Here is the full n8n workflow JSON for a daily product price scrape:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```json title="n8n Workflow Export" {10-20}&lt;br&gt;
{&lt;br&gt;
  "name": "Daily Price Scraper",&lt;br&gt;
  "nodes": [&lt;br&gt;
    {&lt;br&gt;
      "name": "Schedule",&lt;br&gt;
      "type": "n8n-nodes-base.scheduleTrigger",&lt;br&gt;
      "parameters": {&lt;br&gt;
        "rule": { "interval": ["days"], "triggerAtHour": 6 }&lt;br&gt;
      }&lt;br&gt;
    },&lt;br&gt;
    {&lt;br&gt;
      "name": "Scrape Products",&lt;br&gt;
      "type": "n8n-nodes-base.httpRequest",&lt;br&gt;
      "parameters": {&lt;br&gt;
        "method": "POST",&lt;br&gt;
        "url": "&lt;a href="https://api.alterlab.io/v1/scrape" rel="noopener noreferrer"&gt;https://api.alterlab.io/v1/scrape&lt;/a&gt;",&lt;br&gt;
        "authentication": "headerAuth",&lt;br&gt;
        "body": {&lt;br&gt;
          "url": "={{ $json.url }}",&lt;br&gt;
          "formats": ["json"],&lt;br&gt;
          "min_tier": 3&lt;br&gt;
        },&lt;br&gt;
        "options": {&lt;br&gt;
          "retryOnFail": true,&lt;br&gt;
          "maxTries": 3&lt;br&gt;
        }&lt;br&gt;
      }&lt;br&gt;
    },&lt;br&gt;
    {&lt;br&gt;
      "name": "Parse Response",&lt;br&gt;
      "type": "n8n-nodes-base.code",&lt;br&gt;
      "parameters": {&lt;br&gt;
        "jsCode": "const data = $input.first().json.body;\nreturn data.data.products.map(p =&amp;gt; ({ json: p }));"&lt;br&gt;
      }&lt;br&gt;
    },&lt;br&gt;
    {&lt;br&gt;
      "name": "Save to Database",&lt;br&gt;
      "type": "n8n-nodes-base.postgres",&lt;br&gt;
      "parameters": {&lt;br&gt;
        "operation": "upsert",&lt;br&gt;
        "table": "product_prices",&lt;br&gt;
        "columns": "name,price,scraped_at"&lt;br&gt;
      }&lt;br&gt;
    }&lt;br&gt;
  ],&lt;br&gt;
  "connections": {&lt;br&gt;
    "Schedule": { "main": [[{ "node": "Scrape Products", "type": "main" }]] },&lt;br&gt;
    "Scrape Products": { "main": [[{ "node": "Parse Response", "type": "main" }]] },&lt;br&gt;
    "Parse Response": { "main": [[{ "node": "Save to Database", "type": "main" }]] }&lt;br&gt;
  }&lt;br&gt;
}&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


Import this into n8n via the workflow editor, replace the authentication credentials with your API key, and adjust the URL and database schema to match your use case.

## Troubleshooting

**Empty responses**: The page may require a higher tier. Increase `min_tier` to 4 or 5. Check the [API docs](https://alterlab.io/docs) for tier descriptions.

**Rate limit errors**: Add a Wait node between requests. Start with 1-2 seconds and increase if needed.

**CAPTCHA blocks**: Set `min_tier: 5` to enable CAPTCHA solving. This costs more per request but eliminates manual intervention.

**Schema drift**: Websites change their HTML structure. Cortex AI handles this better than CSS selectors since it uses semantic understanding. Switch to Cortex if your selectors break frequently.

**n8n timeout**: Long-running scrapes can exceed n8n's execution timeout. For large batches, use the webhook pattern. Configure AlterLab to push results to an n8n webhook URL instead of polling.

## Takeaway

n8n handles orchestration. AlterLab handles extraction. Together they give you a scraping pipeline that runs on a schedule, handles failures, and delivers clean data to your systems.

Start with a single URL and a basic HTTP Request node. Add error handling, multi-URL support, and change detection as your needs grow. The [quickstart guide](https://alterlab.io/docs/quickstart/installation) covers API setup in under five minutes.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>automation</category>
      <category>scraping</category>
      <category>python</category>
      <category>api</category>
    </item>
    <item>
      <title>Feed Clean Web Data to RAG Pipelines Without Wasting LLM Tokens</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Sat, 04 Apr 2026 10:35:11 +0000</pubDate>
      <link>https://dev.to/alterlab/feed-clean-web-data-to-rag-pipelines-without-wasting-llm-tokens-125p</link>
      <guid>https://dev.to/alterlab/feed-clean-web-data-to-rag-pipelines-without-wasting-llm-tokens-125p</guid>
      <description>&lt;h1&gt;
  
  
  How to Feed Clean Web Data to RAG Pipelines Without Wasting 90% of Your LLM Tokens
&lt;/h1&gt;

&lt;p&gt;Raw HTML is the worst possible input for a RAG pipeline. A single product page carries 15,000 to 25,000 tokens of navigation chrome, analytics scripts, CSS classes, and ad placeholders. Your embedding model processes all of it. Your vector store stores all of it. Your retrieval step searches through all of it.&lt;/p&gt;

&lt;p&gt;You pay for every token.&lt;/p&gt;

&lt;p&gt;The fix is straightforward: extract only the content that matters before it reaches your embedding model. Strip the noise. Keep the signal. Structure it so retrieval actually works.&lt;/p&gt;

&lt;p&gt;Here is how to build that pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Token Math Behind Dirty Web Data
&lt;/h2&gt;

&lt;p&gt;A typical e-commerce product page breaks down like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Product title, description, specs: ~800 tokens&lt;/li&gt;
&lt;li&gt;Navigation menus, footer, sidebar: ~3,000 tokens&lt;/li&gt;
&lt;li&gt;JavaScript bundles, tracking pixels, ad scripts: ~8,000 tokens&lt;/li&gt;
&lt;li&gt;CSS class names, inline styles, layout divs: ~4,000 tokens&lt;/li&gt;
&lt;li&gt;Schema markup, meta tags, Open Graph: ~1,200 tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your RAG pipeline cares about the first line. The rest is infrastructure for a browser, not context for a language model.&lt;/p&gt;

&lt;p&gt;When you embed raw HTML, the noise drowns out the signal. Two product pages with identical descriptions but different ad networks produce wildly different embeddings. Retrieval quality drops. You compensate by increasing chunk overlap and top-k results, which drives costs higher.&lt;/p&gt;

&lt;p&gt;Extract clean content first. Embed only what matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Get Clean Content at the Source
&lt;/h2&gt;

&lt;p&gt;The most efficient place to strip noise is during extraction, not after. Fetching raw HTML and cleaning it locally means you still transfer the full page, parse the full DOM, and run your own selector logic. Doing it server-side through a scraping API cuts the work in half.&lt;/p&gt;

&lt;p&gt;Here is the same operation using the Python SDK and a direct cURL call. Both request Markdown output instead of raw HTML.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="scraper.py" {1,4-6}&lt;br&gt;
from alterlab import AlterLab&lt;/p&gt;

&lt;p&gt;client = AlterLab(api_key="YOUR_API_KEY")&lt;br&gt;
response = client.scrape(&lt;br&gt;
    url="&lt;a href="https://example.com/product/12345" rel="noopener noreferrer"&gt;https://example.com/product/12345&lt;/a&gt;",&lt;br&gt;
    formats=["markdown"]&lt;br&gt;
)&lt;br&gt;
print(response.markdown)&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;




```bash title="Terminal" {3-5}
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -d '{
    "url": "https://example.com/product/12345",
    "formats": ["markdown"]
  }'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The response arrives as clean Markdown. No HTML tags. No script blocks. Just headings, paragraphs, lists, and code blocks in a format embedding models already understand.&lt;/p&gt;

&lt;p&gt;For sites that require JavaScript rendering, set &lt;code&gt;min_tier=3&lt;/code&gt; to skip the basic HTTP fetcher and go straight to a headless browser. The API handles Cloudflare challenges, CAPTCHAs, and rotating proxies automatically. You get the rendered content without managing browser instances.&lt;/p&gt;


  
  
  
  

&lt;h2&gt;
  
  
  Step 2: Structure Data for Retrieval, Not Display
&lt;/h2&gt;

&lt;p&gt;Markdown output works well for articles, documentation, and blog posts. But product pages, job listings, and pricing tables need structure. A flat text blob loses the relationships between fields.&lt;/p&gt;

&lt;p&gt;Use Cortex AI extraction to pull structured data directly from the page. You describe what you want in plain English. The API returns JSON.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="structured_extraction.py" {5-12}&lt;br&gt;
from alterlab import AlterLab&lt;/p&gt;

&lt;p&gt;client = AlterLab(api_key="YOUR_API_KEY")&lt;/p&gt;

&lt;p&gt;response = client.scrape(&lt;br&gt;
    url="&lt;a href="https://example.com/jobs" rel="noopener noreferrer"&gt;https://example.com/jobs&lt;/a&gt;",&lt;br&gt;
    cortex={&lt;br&gt;
        "prompt": "Extract all job listings. For each listing, return: title, department, location, salary_range, posting_date, and apply_url."&lt;br&gt;
    },&lt;br&gt;
    formats=["json"]&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;for job in response.json["listings"]:&lt;br&gt;
    print(f"{job['title']} - {job['location']} ({job['salary_range']})")&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


The JSON output maps directly to your embedding pipeline. Each job listing becomes a single document with typed fields. You can embed the full record, or embed specific fields separately for hybrid search.

Compare this to the alternative: scraping raw HTML, writing CSS selectors for each site, parsing dates from inconsistent formats, and handling layout changes that break your selectors every few weeks. Cortex handles the variation. You get consistent JSON regardless of how the page renders.

## Step 3: Chunk Strategically

Clean content solves the noise problem. Chunking strategy solves the retrieval problem.

Bad chunking cuts sentences in half. It splits tables across chunks. It separates a heading from the paragraphs it governs. Your embedding model sees fragments without context, and retrieval returns partial matches.

Good chunking respects document structure. Markdown makes this straightforward.



```python title="chunker.py" {6-15}

from typing import List

def chunk_markdown(text: str, max_tokens: int = 500) -&amp;gt; List[str]:
    chunks = []
    sections = re.split(r'\n## ', text)

    for section in sections:
        if not section.strip():
            continue

        heading = ""
        if "\n" in section:
            heading, body = section.split("\n", 1)
        else:
            heading, body = section, ""

        current_chunk = f"## {heading}\n" if heading else ""

        paragraphs = body.split("\n\n")
        for para in paragraphs:
            if len(current_chunk) + len(para) &amp;gt; max_tokens * 4:
                chunks.append(current_chunk.strip())
                current_chunk = f"## {heading}\n" if heading else ""
            current_chunk += para + "\n\n"

        if current_chunk.strip():
            chunks.append(current_chunk.strip())

    return chunks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This approach keeps headings attached to their content. It respects paragraph boundaries. It produces chunks that embedding models can reason about as complete units.&lt;/p&gt;

&lt;p&gt;The token estimate uses a 4:1 character-to-token ratio for planning. Your embedding provider's tokenizer gives exact counts. Use that for production.&lt;/p&gt;
&lt;h2&gt;
  
  
  Step 4: Build the Ingestion Pipeline
&lt;/h2&gt;

&lt;p&gt;Tie extraction, cleaning, chunking, and embedding together. The pipeline should handle three scenarios:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Initial index&lt;/strong&gt;: Scrape a list of URLs, extract clean content, chunk, embed, store.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incremental update&lt;/strong&gt;: Monitor pages for changes. Re-extract and re-embed only what changed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scheduled refresh&lt;/strong&gt;: Run on a cron to catch pages that changed without triggering monitoring alerts.
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;```python title="pipeline.py" {8-10,18-22}&lt;br&gt;
from alterlab import AlterLab&lt;br&gt;
from datetime import datetime&lt;/p&gt;

&lt;p&gt;client = AlterLab(api_key="YOUR_API_KEY")&lt;/p&gt;

&lt;p&gt;def ingest_page(url: str, embedding_fn):&lt;br&gt;
    response = client.scrape(&lt;br&gt;
        url=url,&lt;br&gt;
        formats=["markdown"],&lt;br&gt;
        min_tier=3&lt;br&gt;
    )&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if not response.markdown:
    return

chunks = chunk_markdown(response.markdown)

for i, chunk in enumerate(chunks):
    vector = embedding_fn(chunk)
    store_vector(url, i, chunk, vector, datetime.utcnow())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;def ingest_batch(urls: list, embedding_fn):&lt;br&gt;
    for url in urls:&lt;br&gt;
        try:&lt;br&gt;
            ingest_page(url, embedding_fn)&lt;br&gt;
        except Exception as e:&lt;br&gt;
            print(f"Failed {url}: {e}")&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


For incremental updates, use the monitoring feature. Set up watchers on your indexed URLs. When content changes, the API notifies you via webhook. You re-run `ingest_page` for that URL only. No full re-index required.



```python title="monitoring_setup.py" {4-9}
client.monitor(
    url="https://example.com/pricing",
    schedule="0 9 * * 1",
    webhook="https://your-server.com/webhooks/alterlab",
    diff=True
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The webhook payload includes a diff showing what changed. You can decide whether the change warrants a re-embedding. A price update does. A typo fix in the footer does not.&lt;/p&gt;


  &lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
    &lt;thead&gt;&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Tokens per Page&lt;/th&gt;
&lt;th&gt;Retrieval Quality&lt;/th&gt;
&lt;th&gt;Maintenance&lt;/th&gt;
&lt;/tr&gt;&lt;/thead&gt;
    &lt;tbody&gt;
      &lt;tr&gt;
&lt;td&gt;Raw HTML&lt;/td&gt;
&lt;td&gt;15,000-25,000&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;High (selector breaks)&lt;/td&gt;
&lt;/tr&gt;
      &lt;tr&gt;
&lt;td&gt;Local HTML cleaning&lt;/td&gt;
&lt;td&gt;5,000-8,000&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;High (DOM changes)&lt;/td&gt;
&lt;/tr&gt;
      &lt;tr&gt;
&lt;td&gt;Server-side Markdown&lt;/td&gt;
&lt;td&gt;1,500-3,000&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low (handled by API)&lt;/td&gt;
&lt;/tr&gt;
      &lt;tr&gt;
&lt;td&gt;Cortex JSON extraction&lt;/td&gt;
&lt;td&gt;200-800&lt;/td&gt;
&lt;td&gt;Highest&lt;/td&gt;
&lt;td&gt;Lowest (AI adapts)&lt;/td&gt;
&lt;/tr&gt;
    &lt;/tbody&gt;
  &lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Step 5: Handle Anti-Bot Pages Without Infrastructure
&lt;/h2&gt;

&lt;p&gt;Many sites you want to index block automated requests. Cloudflare challenges, CAPTCHAs, rate limits. Managing bypass logic yourself means running browser instances, solving CAPTCHAs through third-party services, rotating proxies, and handling fingerprinting.&lt;/p&gt;

&lt;p&gt;That infrastructure costs more than the scraping itself.&lt;/p&gt;

&lt;p&gt;Use tiered scraping to handle this automatically. Start with a lightweight HTTP request. If the site blocks it, the API escalates to a headless browser with anti-bot bypass. You set the floor with &lt;code&gt;min_tier&lt;/code&gt; to skip the试探 phase for sites you know are protected.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="tiered_scraping.py" {4-7}&lt;br&gt;
response = client.scrape(&lt;br&gt;
    url="&lt;a href="https://protected-site.com/data" rel="noopener noreferrer"&gt;https://protected-site.com/data&lt;/a&gt;",&lt;br&gt;
    min_tier=3,&lt;br&gt;
    formats=["markdown"]&lt;br&gt;
)&lt;br&gt;
print(response.status)&lt;br&gt;
print(response.markdown[:500])&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


Tier 1 handles simple static pages. Tier 3 adds JavaScript rendering and anti-bot bypass. Tier 5 includes CAPTCHA solving. The API picks the right tier for each URL. You get clean content regardless of what stands between you and the data.

&amp;lt;div data-infographic="try-it" data-url="https://alterlab.io/docs" data-description="Try extracting clean Markdown from this documentation page"&amp;gt;&amp;lt;/div&amp;gt;

## Cost Breakdown

Token waste compounds across three stages of a RAG pipeline:

**Embedding**: You pay per token sent to the embedding model. Feeding 20,000 tokens of raw HTML instead of 2,000 tokens of clean Markdown costs 10x more per page. Index 10,000 pages and the difference is measurable.

**Storage**: Vector databases charge by dimension count and record volume. Storing embeddings for noise chunks wastes space. It also degrades query performance as the index grows with low-signal vectors.

**Retrieval**: Each query searches the entire index. A bloated index with noisy chunks returns worse results. You compensate by fetching more candidates (higher top-k), which increases the context window for your generation model. That costs more per query.

Clean extraction at the source addresses all three. Smaller chunks. Better embeddings. Faster retrieval. Lower generation costs because the context window contains relevant content, not navigation footers.

## When to Use Each Output Format

**Markdown**: Articles, documentation, blog posts, help centers. Any page where the content flows as prose with headings and lists. This is your default for knowledge base ingestion.

**JSON with Cortex**: Product catalogs, job boards, pricing tables, real estate listings. Any page with repeating structured elements. The AI extraction handles layout variation across sites without custom selectors.

**Plain text**: Simple pages with minimal formatting. API response pages. Status pages. Use it when you want the smallest possible output and document structure does not matter for retrieval.

**HTML**: Rarely. Only when you need to preserve specific formatting that Markdown cannot represent, like complex tables with merged cells or embedded SVG diagrams. Most RAG pipelines do not need this.

## Putting It Together

A production RAG ingestion pipeline looks like this:

1. Maintain a URL registry with metadata (category, last indexed, change hash).
2. On schedule or webhook trigger, scrape each URL with `formats=["markdown"]` or Cortex extraction.
3. Chunk the output using structure-aware splitting.
4. Embed chunks and upsert into your vector store with URL and timestamp metadata.
5. Monitor URLs for changes. Re-index only what changed.

The scraping layer handles rendering, anti-bot bypass, and format conversion. Your pipeline handles chunking, embedding, and storage. Clean separation. Each layer does one job well.

Check the [Python SDK documentation](https://alterlab.io/web-scraping-api-python) for the full API reference, including webhook configuration and scheduling options. The [quickstart guide](https://alterlab.io/docs/quickstart/installation) covers account setup and your first API call.

## Takeaway

Raw HTML wastes tokens on infrastructure code that embedding models cannot use. Extract clean Markdown or structured JSON before the content reaches your pipeline. Chunk with respect to document boundaries. Monitor for changes and re-index incrementally.

The result: 85 to 90 percent fewer tokens per page, better retrieval accuracy, and lower costs at every stage of the RAG pipeline. The scraping API handles rendering and anti-bot bypass. You handle the data.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>dataextraction</category>
      <category>api</category>
    </item>
    <item>
      <title>Build a Web Scraping Pipeline with n8n and AlterLab</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Tue, 31 Mar 2026 10:36:52 +0000</pubDate>
      <link>https://dev.to/alterlab/build-a-web-scraping-pipeline-with-n8n-and-alterlab-3g3j</link>
      <guid>https://dev.to/alterlab/build-a-web-scraping-pipeline-with-n8n-and-alterlab-3g3j</guid>
      <description>&lt;p&gt;n8n is a workflow automation platform built around HTTP nodes, visual routing, and an in-process JavaScript runtime. When you pair it with AlterLab — a scraping API that handles anti-bot detection, headless rendering, and proxy rotation — you get a complete data extraction pipeline without managing browser pools, proxy credentials, or retry logic from scratch.&lt;/p&gt;

&lt;p&gt;This tutorial builds a production-ready pipeline: URL inputs → scraping API → HTML parsing → structured storage, driven by a cron schedule with proper error handling.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;n8n instance (self-hosted via Docker or n8n Cloud)&lt;/li&gt;
&lt;li&gt;API key — &lt;a href="https://alterlab.io/docs/quickstart/installation" rel="noopener noreferrer"&gt;follow the quickstart guide&lt;/a&gt; to get one in under two minutes&lt;/li&gt;
&lt;li&gt;Familiarity with n8n's workflow editor and basic JavaScript&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Step 1: Store the API Key in n8n Credentials
&lt;/h2&gt;

&lt;p&gt;Never hardcode secrets into HTTP Request nodes. Go to &lt;strong&gt;Settings → Credentials → Add Credential → Header Auth&lt;/strong&gt; and fill in:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Name&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Scraping API Key&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Header Name&lt;/td&gt;
&lt;td&gt;&lt;code&gt;X-API-Key&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Header Value&lt;/td&gt;
&lt;td&gt;&lt;code&gt;YOUR_API_KEY&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Reference this credential in every HTTP Request node in the workflow. Rotating the key means updating one credential, not hunting through nodes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Configure the HTTP Request Node
&lt;/h2&gt;

&lt;p&gt;Drop an &lt;strong&gt;HTTP Request&lt;/strong&gt; node into the canvas. Set &lt;strong&gt;Method&lt;/strong&gt; to &lt;code&gt;POST&lt;/code&gt;, &lt;strong&gt;URL&lt;/strong&gt; to &lt;code&gt;https://api.alterlab.io/v1/scrape&lt;/code&gt;, authenticate with the credential created above, and set &lt;strong&gt;Body Content Type&lt;/strong&gt; to JSON.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```json title="HTTP Request — Payload"&lt;br&gt;
{&lt;br&gt;
  "url": "&lt;a href="https://books.toscrape.com/catalogue/page-1.html" rel="noopener noreferrer"&gt;https://books.toscrape.com/catalogue/page-1.html&lt;/a&gt;",&lt;br&gt;
  "render_js": false,&lt;br&gt;
  "premium_proxy": false,&lt;br&gt;
  "country": "us",&lt;br&gt;
  "timeout": 30000&lt;br&gt;
}&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


For targets protected by Cloudflare, Akamai, or PerimeterX, set `render_js: true` and `premium_proxy: true`. The [anti-bot bypass](https://alterlab.io/anti-bot-bypass-api) layer handles TLS fingerprinting, browser emulation, and CAPTCHA solving transparently — no extra configuration on your end.

The same request in cURL for testing before wiring into n8n:



```bash title="Terminal"
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://books.toscrape.com/catalogue/page-1.html",
    "render_js": false,
    "premium_proxy": false
  }'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The equivalent single-URL call in Python:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="single_scrape.py" {7-12}&lt;/p&gt;

&lt;p&gt;API_KEY = "YOUR_API_KEY"&lt;br&gt;
BASE_URL = "&lt;a href="https://api.alterlab.io/v1/scrape" rel="noopener noreferrer"&gt;https://api.alterlab.io/v1/scrape&lt;/a&gt;"&lt;/p&gt;

&lt;p&gt;def scrape(url: str, render_js: bool = False) -&amp;gt; dict:&lt;br&gt;
    with httpx.Client() as client:                    # synchronous single fetch&lt;br&gt;
        r = client.post(&lt;br&gt;
            BASE_URL,&lt;br&gt;
            headers={"X-API-Key": API_KEY},&lt;br&gt;
            json={"url": url, "render_js": render_js},&lt;br&gt;
            timeout=30.0,&lt;br&gt;
        )&lt;br&gt;
        r.raise_for_status()&lt;br&gt;
        return r.json()&lt;/p&gt;

&lt;p&gt;result = scrape("&lt;a href="https://books.toscrape.com/catalogue/page-1.html%22" rel="noopener noreferrer"&gt;https://books.toscrape.com/catalogue/page-1.html"&lt;/a&gt;)&lt;br&gt;
print(result["status_code"], result["elapsed_ms"], "ms")&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


The API response shape:



```json title="API Response"
{
  "success": true,
  "status_code": 200,
  "url": "https://books.toscrape.com/catalogue/page-1.html",
  "html": "&amp;lt;!DOCTYPE html&amp;gt;...",
  "elapsed_ms": 712
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Try it against a live target to see the response before building the rest of the pipeline:&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Parse HTML in the Code Node
&lt;/h2&gt;

&lt;p&gt;Add a &lt;strong&gt;Code&lt;/strong&gt; node immediately after the HTTP Request. n8n bundles Cheerio in its runtime — use it to walk the DOM and emit structured records.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```javascript title="n8n Code Node — Extract Book Listings" {7-18}&lt;br&gt;
const { load } = require('cheerio');&lt;/p&gt;

&lt;p&gt;const results = [];&lt;/p&gt;

&lt;p&gt;for (const item of $input.all()) {&lt;br&gt;
  const $ = load(item.json.html);&lt;/p&gt;

&lt;p&gt;$('article.product_pod').each((_, el) =&amp;gt; {         // iterate product cards&lt;br&gt;
    const title   = $(el).find('h3 a').attr('title');&lt;br&gt;
    const price   = $(el).find('.price_color').text().trim();&lt;br&gt;
    const rating  = $(el).find('p.star-rating').attr('class')?.split(' ')[1];&lt;br&gt;
    const relHref = $(el).find('h3 a').attr('href');&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;results.push({                                    // emit flat record
  title,
  price,
  rating,
  url: `https://books.toscrape.com/catalogue/${relHref}`,
  scraped_at: new Date().toISOString(),
});
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;});&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;return results.map(r =&amp;gt; ({ json: r }));&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


For targets that return JSON from an XHR endpoint (scraped through the proxy), skip Cheerio and parse directly:



```javascript title="n8n Code Node — Parse JSON from html Field" {2-3}
const raw = $input.first().json.html;
const data = JSON.parse(raw);            // html field contains the raw JSON string
return data.products.map(p =&amp;gt; ({ json: p }));
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;If Cheerio is missing in a self-hosted setup, run &lt;code&gt;npm install cheerio&lt;/code&gt; in the n8n working directory and restart the service.&lt;/p&gt;




  
  
  
  



&lt;h2&gt;
  
  
  Step 4: Scrape Multiple Pages
&lt;/h2&gt;

&lt;p&gt;Use a Code node to generate a URL list, then feed it through &lt;strong&gt;Split In Batches&lt;/strong&gt; → HTTP Request:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```javascript title="n8n Code Node — Generate Paginated URL List" {3-6}&lt;br&gt;
const BASE  = '&lt;a href="https://books.toscrape.com/catalogue/page-" rel="noopener noreferrer"&gt;https://books.toscrape.com/catalogue/page-&lt;/a&gt;';&lt;br&gt;
const PAGES = 50;&lt;/p&gt;

&lt;p&gt;const urls = Array.from(                        // generate range of page URLs&lt;br&gt;
  { length: PAGES },&lt;br&gt;
  (_, i) =&amp;gt; ({ json: { url: &lt;code&gt;${BASE}${i + 1}.html&lt;/code&gt; } })&lt;br&gt;
);&lt;/p&gt;

&lt;p&gt;return urls;&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


Set **Split In Batches** to a batch size of 5 to avoid hammering the target. The HTTP Request node processes each batch item as a separate request automatically.

For high-volume pipelines where n8n acts as the orchestrator and Python handles the heavy lifting, use async fan-out:



```python title="batch_scrape.py" {15-21}

API_KEY  = "YOUR_API_KEY"
ENDPOINT = "https://api.alterlab.io/v1/scrape"

async def fetch(client: httpx.AsyncClient, url: str) -&amp;gt; dict:
    r = await client.post(
        ENDPOINT,
        headers={"X-API-Key": API_KEY},
        json={"url": url, "render_js": False},
        timeout=30.0,
    )
    r.raise_for_status()
    return r.json()

async def scrape_batch(urls: list[str]) -&amp;gt; list[dict]:  # fan-out entry point
    async with httpx.AsyncClient() as client:           # single connection pool
        tasks   = [fetch(client, u) for u in urls]      # build coroutine list
        results = await asyncio.gather(*tasks, return_exceptions=True)
        return results

if __name__ == "__main__":
    pages = [f"https://books.toscrape.com/catalogue/page-{i}.html" for i in range(1, 11)]
    data  = asyncio.run(scrape_batch(pages))

    for i, result in enumerate(data):
        if isinstance(result, Exception):
            print(f"Page {i+1} failed: {result}")
        else:
            print(f"Page {i+1}: {len(result['html']):,} bytes — {result['elapsed_ms']}ms")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The &lt;a href="https://alterlab.io/web-scraping-api-python" rel="noopener noreferrer"&gt;Python scraping API client&lt;/a&gt; wraps this pattern with built-in retry logic, concurrency throttling, and typed responses — worth switching to once you move beyond prototyping.&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 5: Route Data to Storage
&lt;/h2&gt;

&lt;p&gt;Wire the Code node output to whichever storage node fits your stack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Postgres&lt;/strong&gt; — recommended for structured pipelines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Node: &lt;strong&gt;Postgres&lt;/strong&gt;, Operation: &lt;strong&gt;Insert&lt;/strong&gt;, Table: &lt;code&gt;scraped_books&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Map &lt;code&gt;title&lt;/code&gt;, &lt;code&gt;price&lt;/code&gt;, &lt;code&gt;rating&lt;/code&gt;, &lt;code&gt;url&lt;/code&gt;, &lt;code&gt;scraped_at&lt;/code&gt; directly from Code node output fields&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Google Sheets&lt;/strong&gt; — minimal setup for low-volume runs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Node: &lt;strong&gt;Google Sheets&lt;/strong&gt;, Operation: &lt;strong&gt;Append or Update&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Same column mapping&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Webhook forward&lt;/strong&gt; — for downstream microservices or event buses:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```json title="Webhook Payload"&lt;br&gt;
{&lt;br&gt;
  "source": "n8n-book-scraper",&lt;br&gt;
  "run_id": "{{ $execution.id }}",&lt;br&gt;
  "count": 20,&lt;br&gt;
  "records": [&lt;br&gt;
    { "title": "A Light in the Attic", "price": "£51.77", "rating": "Three", "url": "..." }&lt;br&gt;
  ]&lt;br&gt;
}&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


---

## Step 6: Schedule and Add Error Handling

Swap the manual trigger for a **Schedule Trigger** node before going to production.

| Cadence | Cron Expression | Typical Use Case |
|---------|-----------------|------------------|
| Hourly | `0 * * * *` | Price monitoring |
| Daily 06:00 UTC | `0 6 * * *` | News/content aggregation |
| Every 15 minutes | `*/15 * * * *` | Inventory feeds |
| Weekdays 09:00 UTC | `0 9 * * 1-5` | B2B lead enrichment |

For event-driven scraping — e.g., new URLs inserted into a database — replace the Schedule Trigger with a **Postgres Trigger** node watching for new rows.

**Error handling — configure before going live:**

1. HTTP Request node → enable **Retry On Fail**: 3 retries, 2000ms backoff
2. Code node → enable **Continue On Fail** if partial runs are acceptable
3. In **Settings → Error Workflow**, assign a dedicated workflow that captures and routes failures:



```javascript title="Error Workflow — Log Failures to Dead-Letter Table" {5-11}
// Runs inside the error workflow's Code node
const err = $input.first().json;

return [{
  json: {
    workflow:     err.workflow?.name,
    node:         err.execution?.lastNodeExecuted,   // which node threw
    message:      err.execution?.error?.message,
    failed_at:    new Date().toISOString(),
    execution_id: err.execution?.id,
  }
}];
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Route the output to a Postgres &lt;code&gt;scrape_errors&lt;/code&gt; table or a Slack node. Silent failures are harder to diagnose than loud ones.&lt;/p&gt;





&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;br&gt;
    &lt;thead&gt;
&lt;br&gt;
      &lt;tr&gt;
&lt;br&gt;
        &lt;th&gt;Approach&lt;/th&gt;
&lt;br&gt;
        &lt;th&gt;Anti-Bot Handling&lt;/th&gt;
&lt;br&gt;
        &lt;th&gt;Setup Time&lt;/th&gt;
&lt;br&gt;
        &lt;th&gt;Maintenance&lt;/th&gt;
&lt;br&gt;
        &lt;th&gt;Scaling&lt;/th&gt;
&lt;br&gt;
        &lt;th&gt;Cost Model&lt;/th&gt;
&lt;br&gt;
      &lt;/tr&gt;
&lt;br&gt;
    &lt;/thead&gt;
&lt;br&gt;
    &lt;tbody&gt;
&lt;br&gt;
      &lt;tr&gt;
&lt;br&gt;
        &lt;td&gt;DIY Playwright + Proxies&lt;/td&gt;
&lt;br&gt;
        &lt;td&gt;Manual (fingerprinting, stealth)&lt;/td&gt;
&lt;br&gt;
        &lt;td&gt;Days–weeks&lt;/td&gt;
&lt;br&gt;
        &lt;td&gt;High (browser updates, proxy churn)&lt;/td&gt;
&lt;br&gt;
        &lt;td&gt;Complex (concurrency, queueing)&lt;/td&gt;
&lt;br&gt;
        &lt;td&gt;Infrastructure + proxy fees&lt;/td&gt;
&lt;br&gt;
      &lt;/tr&gt;
&lt;br&gt;
      &lt;tr&gt;
&lt;br&gt;
        &lt;td&gt;n8n + Scraping API&lt;/td&gt;
&lt;br&gt;
        &lt;td&gt;Automatic (TLS, CAPTCHA, headers)&lt;/td&gt;
&lt;br&gt;
        &lt;td&gt;&amp;lt;1 hour&lt;/td&gt;
&lt;br&gt;
        &lt;td&gt;Low (API versioned separately)&lt;/td&gt;
&lt;br&gt;
        &lt;td&gt;Batch nodes + API concurrency&lt;/td&gt;
&lt;br&gt;
        &lt;td&gt;Per successful request&lt;/td&gt;
&lt;br&gt;
      &lt;/tr&gt;
&lt;br&gt;
      &lt;tr&gt;
&lt;br&gt;
        &lt;td&gt;Commercial ETL (Apify, etc.)&lt;/td&gt;
&lt;br&gt;
        &lt;td&gt;Varies by actor&lt;/td&gt;
&lt;br&gt;
        &lt;td&gt;Minutes (pre-built actors)&lt;/td&gt;
&lt;br&gt;
        &lt;td&gt;Low but opaque&lt;/td&gt;
&lt;br&gt;
        &lt;td&gt;Platform-managed&lt;/td&gt;
&lt;br&gt;
        &lt;td&gt;Platform subscription + compute&lt;/td&gt;
&lt;br&gt;
      &lt;/tr&gt;
&lt;br&gt;
    &lt;/tbody&gt;
&lt;br&gt;
  &lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Monitoring Pipeline Health
&lt;/h2&gt;

&lt;p&gt;Don't rely solely on n8n's execution log. Instrument your pipeline explicitly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Log &lt;code&gt;success: false&lt;/code&gt; responses&lt;/strong&gt; from the scraping API to a monitoring table — the API returns this field even on 200 responses if the target blocked the request&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Store &lt;code&gt;elapsed_ms&lt;/code&gt; per run&lt;/strong&gt; in a &lt;code&gt;scrape_metrics&lt;/code&gt; table; trend upward means proxy pool degradation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Row count guard&lt;/strong&gt; — after the storage node, add a Code node that alerts if &lt;code&gt;results.length &amp;lt; EXPECTED_MINIMUM&lt;/code&gt;:
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;```javascript title="n8n Code Node — Row Count Guard" {5-9}&lt;br&gt;
const MINIMUM = 15; // expect at least 15 records per page&lt;/p&gt;

&lt;p&gt;const count = $input.all().length;&lt;/p&gt;

&lt;p&gt;if (count &amp;lt; MINIMUM) {                        // trigger alert path&lt;br&gt;
  throw new Error(&lt;code&gt;Low yield: got ${count}, expected &amp;gt;= ${MINIMUM}&lt;/code&gt;);&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;return $input.all(); // pass through if OK&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


Place this node between the Code parser and the storage node. When it throws, n8n's error workflow catches it.

---

## Takeaways

- n8n's HTTP Request node integrates with any REST scraping API in minutes — no custom nodes required
- Use `render_js: true` selectively; static fetches are faster and cheaper than headless browser requests
- Keep parsing logic inside the Code node to maintain self-contained, debuggable workflows
- Cheerio handles the majority of HTML extraction cases; fall back to a dedicated parser service only for complex XPath requirements
- Configure retries on the HTTP node and a global error workflow before scheduling — silent data loss compounds across runs
- For event-driven ingestion triggered by new URLs in a queue or database, swap the Schedule Trigger for a Postgres Trigger or AMQP node without changing the rest of the workflow
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>automation</category>
      <category>dataextraction</category>
      <category>api</category>
      <category>scraping</category>
    </item>
    <item>
      <title>How to Scrape Glassdoor: Complete Guide for 2026</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Sat, 28 Mar 2026 14:25:57 +0000</pubDate>
      <link>https://dev.to/alterlab/how-to-scrape-glassdoor-complete-guide-for-2026-5434</link>
      <guid>https://dev.to/alterlab/how-to-scrape-glassdoor-complete-guide-for-2026-5434</guid>
      <description>&lt;h1&gt;
  
  
  How to Scrape Glassdoor: Complete Guide for 2026
&lt;/h1&gt;

&lt;p&gt;Glassdoor exposes salary data, company reviews, and job listings that are genuinely useful for compensation benchmarking, recruiting analysis, and labor market research. The catch: Glassdoor runs Cloudflare, gates salary data behind authentication, and renders all meaningful content client-side with React. This guide cuts through those obstacles with working Python code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Scrape Glassdoor?
&lt;/h2&gt;

&lt;p&gt;Three use cases that justify the engineering effort:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compensation benchmarking&lt;/strong&gt; — HR teams and SaaS products aggregate salary ranges by role, level, location, and company size. Glassdoor's crowdsourced compensation data is one of the richest publicly accessible sources for this kind of analysis. Refreshing it weekly catches market shifts before they show up in annual survey reports.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Competitive talent intelligence&lt;/strong&gt; — Track hiring velocity at competitors. Which roles are they posting? How quickly are positions closing? Job listing volume is a reliable leading indicator of engineering and product priorities six to nine months out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Employer brand monitoring&lt;/strong&gt; — Tracking review sentiment over time — overall ratings, CEO approval, interview difficulty scores — gives recruiting teams early warning of culture problems before they surface as churn events. Companies also benchmark their own standing against direct competitors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anti-Bot Challenges on glassdoor.com
&lt;/h2&gt;

&lt;p&gt;Glassdoor deploys several overlapping protections that make DIY scraping expensive to maintain:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloudflare WAF and Bot Management&lt;/strong&gt; — Glassdoor sits behind Cloudflare's bot management layer. A standard Python &lt;code&gt;requests&lt;/code&gt; call receives a JS challenge page requiring a valid &lt;code&gt;cf_clearance&lt;/code&gt; cookie before any real HTML is served. This blocks virtually every naive scraper immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Login wall for salary data&lt;/strong&gt; — Salary ranges and detailed compensation breakdowns are gated behind authentication. Unauthenticated sessions see truncated results or get redirected to a signup modal. Full data access requires managing authenticated sessions with valid cookies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Client-side rendering&lt;/strong&gt; — Job listings, reviews, and salary cards are all React components. The initial HTML response from Glassdoor's server is a near-empty shell. You need a JavaScript runtime to execute the page and produce actual content.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Browser fingerprinting and behavioral detection&lt;/strong&gt; — Glassdoor combines static browser fingerprinting with behavioral signals (scroll cadence, mouse movement, click timing) to identify headless browsers. Playwright and Puppeteer with default configurations are reliably flagged within a few page loads.&lt;/p&gt;

&lt;p&gt;Maintaining your own bypass stack — refreshing &lt;code&gt;cf_clearance&lt;/code&gt; cookies, managing residential proxy pools, spoofing browser fingerprints — is a real ongoing engineering commitment. AlterLab's &lt;a href="https://dev.to/anti-bot-bypass-api"&gt;Anti-bot bypass API&lt;/a&gt; handles all of this at the infrastructure level, so your scraping code stays focused on data extraction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Start with AlterLab API
&lt;/h2&gt;

&lt;p&gt;Install the SDK and you can make your first Glassdoor request in under a minute. See the &lt;a href="https://dev.to/docs/quickstart/installation"&gt;getting started guide&lt;/a&gt; for full environment setup, including API key management and optional async configuration.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```bash title="Terminal"&lt;br&gt;
pip install alterlab beautifulsoup4 lxml&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;




```python title="scrape_glassdoor.py" {4-11}

from bs4 import BeautifulSoup

client = alterlab.Client("YOUR_API_KEY")

# Scrape a Glassdoor job search results page
response = client.scrape(
    "https://www.glassdoor.com/Job/python-developer-jobs-SRCH_KO0,16.htm",
    render_js=True,                          # Required: Glassdoor is a React SPA
    wait_for="[data-test='jobListing']",     # Wait for job cards before returning
)

soup = BeautifulSoup(response.html, "html.parser")
job_cards = soup.select("[data-test='jobListing']")
print(f"Found {len(job_cards)} job listings")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The equivalent cURL call for testing from a shell or integrating with non-Python pipelines:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```bash title="Terminal"&lt;br&gt;
curl -X POST &lt;a href="https://api.alterlab.io/v1/scrape" rel="noopener noreferrer"&gt;https://api.alterlab.io/v1/scrape&lt;/a&gt; \&lt;br&gt;
  -H "X-API-Key: YOUR_API_KEY" \&lt;br&gt;
  -H "Content-Type: application/json" \&lt;br&gt;
  -d '{&lt;br&gt;
    "url": "&lt;a href="https://www.glassdoor.com/Job/python-developer-jobs-SRCH_KO0,16.htm" rel="noopener noreferrer"&gt;https://www.glassdoor.com/Job/python-developer-jobs-SRCH_KO0,16.htm&lt;/a&gt;",&lt;br&gt;
    "render_js": true,&lt;br&gt;
    "wait_for": "[data-test=\"jobListing\"]"&lt;br&gt;
  }'&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


&amp;lt;div data-infographic="stats"&amp;gt;
  &amp;lt;div data-stat data-value="99.1%" data-label="Glassdoor Success Rate"&amp;gt;&amp;lt;/div&amp;gt;
  &amp;lt;div data-stat data-value="2.4s" data-label="Avg JS Render Time"&amp;gt;&amp;lt;/div&amp;gt;
  &amp;lt;div data-stat data-value="100%" data-label="Cloudflare Bypass Rate"&amp;gt;&amp;lt;/div&amp;gt;
  &amp;lt;div data-stat data-value="0ms" data-label="Proxy Setup Time"&amp;gt;&amp;lt;/div&amp;gt;
&amp;lt;/div&amp;gt;

## Extracting Structured Data

With fully rendered HTML in hand, here is how to pull the most useful data points from Glassdoor's DOM.

### Job Listings

Glassdoor uses `data-test` attributes on stable semantic elements — always prefer these over generated class names, which change with every React build deployment.



```python title="parse_jobs.py" {9-22}
from bs4 import BeautifulSoup

def parse_job_listings(html: str) -&amp;gt; list[dict]:
    soup = BeautifulSoup(html, "html.parser")
    jobs = []

    for card in soup.select("[data-test='jobListing']"):
        def text(selector):
            el = card.select_one(selector)
            return el.get_text(strip=True) if el else None

        jobs.append({
            "title":    text("[data-test='job-title']"),
            "company":  text("[data-test='employer-name']"),
            "location": text("[data-test='emp-location']"),
            "salary":   text("[data-test='detailSalary']"),
            "rating":   text("[data-test='rating']"),
            "age":      text("[data-test='job-age']"),
        })

    return jobs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Company Reviews
&lt;/h3&gt;

&lt;p&gt;Review pages are paginated at 10 entries per page. The &lt;code&gt;_IP{n}&lt;/code&gt; path segment in the URL controls the page number.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="parse_reviews.py" {6-22}&lt;/p&gt;

&lt;p&gt;from bs4 import BeautifulSoup&lt;/p&gt;

&lt;p&gt;client = alterlab.Client("YOUR_API_KEY")&lt;/p&gt;

&lt;p&gt;def scrape_company_reviews(company_slug: str, pages: int = 5) -&amp;gt; list[dict]:&lt;br&gt;
    """&lt;br&gt;
    company_slug: e.g. 'Google' (as it appears in the Glassdoor URL)&lt;br&gt;
    """&lt;br&gt;
    reviews = []&lt;br&gt;
    slug_len = len(company_slug)&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for page in range(1, pages + 1):
    url = (
        f"https://www.glassdoor.com/Reviews/{company_slug}-reviews"
        f"-SRCH_KE0,{slug_len}_IP{page}.htm"
    )
    response = client.scrape(url, render_js=True, wait_for="[data-test='review']")
    soup = BeautifulSoup(response.html, "html.parser")

    for review in soup.select("[data-test='review']"):
        def text(selector):
            el = review.select_one(selector)
            return el.get_text(strip=True) if el else None

        reviews.append({
            "headline": text("[data-test='review-title']"),
            "rating":   text("[data-test='overall-rating']"),
            "pros":     text("[data-test='pros']"),
            "cons":     text("[data-test='cons']"),
            "date":     text("[data-test='review-date']"),
            "role":     text("[data-test='author-jobTitle']"),
        })

return reviews
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


### Salary Data

Salary pages require an authenticated session. Pass `glassdoor_session` and `tguid` cookies obtained from a logged-in browser profile. The API accepts a `headers` dict for this purpose:



```python title="parse_salaries.py" {5-12}

client = alterlab.Client("YOUR_API_KEY")

response = client.scrape(
    "https://www.glassdoor.com/Salaries/software-engineer-salary-SRCH_KO0,17.htm",
    render_js=True,
    headers={
        "Cookie": "JSESSIONID=YOUR_SESSION_ID; tguid=YOUR_TGUID"
    },
    wait_for="[data-test='salaryRow']",
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Key selectors once authenticated: &lt;code&gt;[data-test='salaryRow']&lt;/code&gt; for each salary entry, &lt;code&gt;[data-test='salary-estimate']&lt;/code&gt; for the reported range, and &lt;code&gt;[data-test='total-compensation']&lt;/code&gt; for the total comp breakdown.&lt;/p&gt;
&lt;h2&gt;
  
  
  Common Pitfalls
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Per-IP rate limiting&lt;/strong&gt; — Glassdoor throttles at the IP level, not just by User-Agent. Exceeding roughly 25–30 requests per minute from a single IP triggers 429 responses or silent result degradation, where fewer listings are returned without any error signal. Distributed requests across rotating proxies are required for sustained collection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session expiry on gated content&lt;/strong&gt; — Glassdoor sessions expire within a few hours. For pipelines that scrape salary or authenticated review data, implement cookie refresh logic. Detect redirects to &lt;code&gt;/profile/login&lt;/code&gt; as the signal that your session has expired and re-authenticate before continuing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hard pagination cap&lt;/strong&gt; — Glassdoor limits job search results to 30 pages (300 results) per query regardless of how many matching listings exist. Paginating past page 30 returns the first page again. The correct approach is to narrow queries by location, &lt;code&gt;fromAge&lt;/code&gt; (days posted), or &lt;code&gt;jobType&lt;/code&gt; parameter rather than paginating deeper on a broad query.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Selector drift&lt;/strong&gt; — Glassdoor ships frontend updates frequently. Class names change with every React build. The &lt;code&gt;data-test&lt;/code&gt; attributes documented above are more stable, but they can also shift. Build result-count validation into your pipeline: if a parse returns zero records, treat that as a selector failure, not an empty result set, and alert.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hydration timing&lt;/strong&gt; — Even with &lt;code&gt;render_js=True&lt;/code&gt;, returning content before React has finished hydrating gives you an empty shell. Always set &lt;code&gt;wait_for&lt;/code&gt; to a CSS selector matching a target element, not just a fixed timeout. The element-based wait adapts to variable page load times automatically.&lt;/p&gt;
&lt;h2&gt;
  
  
  Scaling Up
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Batch Requests
&lt;/h3&gt;

&lt;p&gt;For bulk collection across many search permutations — dozens of cities, multiple job titles, rolling date windows — the AlterLab batch endpoint processes URLs in parallel and is significantly more efficient than sequential requests:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="batch_glassdoor.py" {5-20}&lt;/p&gt;

&lt;p&gt;client = alterlab.Client("YOUR_API_KEY")&lt;/p&gt;

&lt;p&gt;cities = [&lt;br&gt;
    ("new-york-city", "IC1132348"),&lt;br&gt;
    ("san-francisco", "IC1147401"),&lt;br&gt;
    ("austin",        "IC1139761"),&lt;br&gt;
    ("seattle",       "IC1150505"),&lt;br&gt;
    ("chicago",       "IC1128808"),&lt;br&gt;
]&lt;/p&gt;

&lt;p&gt;urls = [&lt;br&gt;
    f"&lt;a href="https://www.glassdoor.com/Job/python-engineer-%7Bslug%7D-jobs-SRCH_IL.0,%7Blen(slug)%7D_IC%7Bcode%7D_IP%7Bpage%7D.htm" rel="noopener noreferrer"&gt;https://www.glassdoor.com/Job/python-engineer-{slug}-jobs-SRCH_IL.0,{len(slug)}_IC{code}_IP{page}.htm&lt;/a&gt;"&lt;br&gt;
    for slug, code in cities&lt;br&gt;
    for page in range(1, 11)   # 10 pages × 5 cities = 50 requests&lt;br&gt;
]&lt;/p&gt;

&lt;p&gt;results = client.batch_scrape(&lt;br&gt;
    urls=urls,&lt;br&gt;
    render_js=True,&lt;br&gt;
    wait_for="[data-test='jobListing']",&lt;br&gt;
    concurrency=10,&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;with open("glassdoor_jobs.jsonl", "w") as f:&lt;br&gt;
    for r in results:&lt;br&gt;
        if r.success:&lt;br&gt;
            f.write(json.dumps({"url": r.url, "html": r.html}) + "\n")&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


### Scheduling Recurring Pipelines

For daily job market snapshots or weekly salary index updates, wire the scraper to a scheduler. APScheduler is lightweight and runs in-process without a separate queue service:



```python title="scheduler.py" {8-16}
from apscheduler.schedulers.blocking import BlockingScheduler

from parse_jobs import parse_job_listings

client = alterlab.Client("YOUR_API_KEY")
scheduler = BlockingScheduler()

@scheduler.scheduled_job("cron", hour=2, minute=0)  # 02:00 daily
def daily_glassdoor_pull():
    roles = ["software-engineer", "data-engineer", "product-manager", "ml-engineer"]
    for role in roles:
        url = f"https://www.glassdoor.com/Salaries/{role}-salary-SRCH_KO0,{len(role)}.htm"
        response = client.scrape(url, render_js=True, wait_for="[data-test='salaryRow']")
        jobs = parse_job_listings(response.html)
        store_to_warehouse(jobs)   # your storage layer here

scheduler.start()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Cost Management at Scale
&lt;/h3&gt;

&lt;p&gt;Not every Glassdoor page requires full JavaScript execution. Company overview pages and some listing shells partially pre-render server-side. Profile your target URLs: attempt a plain HTML fetch first and check whether your target selectors are present. Use &lt;code&gt;render_js=False&lt;/code&gt; wherever possible — it is faster and consumes fewer credits. Reserve JS rendering for pages that require it.&lt;/p&gt;

&lt;p&gt;Review &lt;a href="https://dev.to/pricing"&gt;AlterLab pricing&lt;/a&gt; for credit consumption rates broken down by request type before sizing your pipeline budget.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;JavaScript rendering is not optional&lt;/strong&gt; — Glassdoor's content is React-rendered. A plain HTTP fetch returns a shell. Always set &lt;code&gt;render_js=True&lt;/code&gt; and use &lt;code&gt;wait_for&lt;/code&gt; with a target element selector.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cloudflare is the primary blocker&lt;/strong&gt; — do not spend engineering cycles maintaining your own bypass. It is a dependency, not a competitive advantage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefer &lt;code&gt;data-test&lt;/code&gt; attributes over class names&lt;/strong&gt; — class names change with every build. &lt;code&gt;data-test&lt;/code&gt; attributes are intentionally stable for testing and are your most reliable selection strategy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Salary data requires authentication&lt;/strong&gt; — pass valid session cookies and implement refresh logic for any pipeline running longer than a few hours.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Respect the 30-page cap&lt;/strong&gt; — use query narrowing (location, date posted, job type) rather than deep pagination to collect comprehensive datasets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch and schedule deliberately&lt;/strong&gt; — sequential requests are fine for development; batch endpoints with concurrency control are essential for production pipelines.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Related Guides
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://dev.to/blog/how-to-scrape-linkedin-com"&gt;How to Scrape LinkedIn&lt;/a&gt; — professional profiles, company pages, and job postings behind one of the web's most aggressive anti-bot stacks&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/blog/how-to-scrape-indeed-com"&gt;How to Scrape Indeed&lt;/a&gt; — job listings and employer reviews with simpler authentication requirements than Glassdoor&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.to/blog/how-to-scrape-amazon-com"&gt;How to Scrape Amazon&lt;/a&gt; — product pricing, reviews, and inventory data at scale with dynamic rendering handled&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>proxies</category>
      <category>dataextraction</category>
      <category>python</category>
      <category>scraping</category>
    </item>
    <item>
      <title>Web Scraping API Pricing Compared: Cut Costs 90%</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Sat, 28 Mar 2026 10:35:02 +0000</pubDate>
      <link>https://dev.to/alterlab/web-scraping-api-pricing-compared-cut-costs-90-3g3</link>
      <guid>https://dev.to/alterlab/web-scraping-api-pricing-compared-cut-costs-90-3g3</guid>
      <description>&lt;h2&gt;
  
  
  The Real Cost of Web Scraping at Scale
&lt;/h2&gt;

&lt;p&gt;Most engineering teams overspend on web scraping by 5-10x because they use the same infrastructure for every request. Scraping a static HTML documentation page shouldn't cost the same as extracting data from a JavaScript-heavy e-commerce site with Cloudflare protection.&lt;/p&gt;

&lt;p&gt;The solution: &lt;strong&gt;tiered scraping architecture&lt;/strong&gt;. By matching request complexity to infrastructure level, teams routinely cut scraping costs by 80-90% while maintaining or improving success rates.&lt;/p&gt;

&lt;p&gt;This post breaks down scraping API pricing models, shows how tiered systems work, and provides production-ready code for implementing cost-optimized scraping pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Scraping API Pricing Actually Works
&lt;/h2&gt;

&lt;p&gt;Scraping APIs charge based on infrastructure cost per request. Understanding these tiers is critical for cost optimization:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T1 — Basic HTTP Requests&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No JavaScript execution&lt;/li&gt;
&lt;li&gt;Standard headers and cookies&lt;/li&gt;
&lt;li&gt;Cost: ~$0.001-0.003 per request&lt;/li&gt;
&lt;li&gt;Use case: Static HTML, documentation sites, simple blogs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;T2 — Enhanced HTTP&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Custom headers, cookies, user agents&lt;/li&gt;
&lt;li&gt;Basic anti-detection&lt;/li&gt;
&lt;li&gt;Cost: ~$0.003-0.005 per request&lt;/li&gt;
&lt;li&gt;Use case: Sites with basic bot detection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;T3 — Headless Browser&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full JavaScript execution (Playwright/Puppeteer)&lt;/li&gt;
&lt;li&gt;Browser fingerprint rotation&lt;/li&gt;
&lt;li&gt;Cost: ~$0.01-0.02 per request&lt;/li&gt;
&lt;li&gt;Use case: SPAs, dynamic content, infinite scroll&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;T4 — Advanced Anti-Bot&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All T3 features plus&lt;/li&gt;
&lt;li&gt;Advanced fingerprint spoofing&lt;/li&gt;
&lt;li&gt;Behavioral automation&lt;/li&gt;
&lt;li&gt;Cost: ~$0.02-0.04 per request&lt;/li&gt;
&lt;li&gt;Use case: Cloudflare, PerimeterX, DataDome&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;T5 — CAPTCHA Solving&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All T4 features plus&lt;/li&gt;
&lt;li&gt;Human CAPTCHA solving&lt;/li&gt;
&lt;li&gt;Cost: ~$0.05-0.10 per request&lt;/li&gt;
&lt;li&gt;Use case: Sites with hCaptcha, reCAPTCHA challenges&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cost difference between T1 and T5 is 50-100x. Using T5 for every request when 70% of your targets only need T1 is financial waste.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pricing Model Comparison
&lt;/h2&gt;

&lt;p&gt;Most scraping services use one of three pricing models. Here's how they compare for production workloads:&lt;/p&gt;


&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;br&gt;
    &lt;thead&gt;
&lt;br&gt;
      &lt;tr&gt;
&lt;br&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;br&gt;
&lt;th&gt;Cost at 10K req/mo&lt;/th&gt;
&lt;br&gt;
&lt;th&gt;Cost at 100K req/mo&lt;/th&gt;
&lt;br&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
    &lt;/thead&gt;
&lt;br&gt;
    &lt;tbody&gt;
&lt;br&gt;
      &lt;tr&gt;
&lt;br&gt;
&lt;td&gt;Flat Rate&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;$99/mo&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;$499/mo&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;Predictable, low-volume&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
      &lt;tr&gt;
&lt;br&gt;
&lt;td&gt;Pay-Per-Success&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;$50-300&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;$500-3000&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;Variable success rates&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
      &lt;tr&gt;
&lt;br&gt;
&lt;td&gt;Tiered Usage&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;$30-80&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;$200-600&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;Mixed complexity targets&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
    &lt;/tbody&gt;
&lt;br&gt;
  &lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Flat Rate Plans&lt;/strong&gt; charge a fixed monthly fee for a request quota. Simple to budget, but you pay the same rate regardless of target complexity. Often includes overage charges that spike unexpectedly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pay-Per-Success&lt;/strong&gt; charges only for successful extractions. Transparent, but success rate definitions vary. A 95% success rate means you're paying for 5% failures indirectly through higher per-request pricing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tiered Usage&lt;/strong&gt; (like &lt;a href="https://alterlab.io/pricing" rel="noopener noreferrer"&gt;AlterLab's pricing&lt;/a&gt;) charges based on infrastructure tier used. This is where significant savings happen—you control which tier each request uses, optimizing for cost per target.&lt;/p&gt;

&lt;p&gt;For teams scraping 50+ different domains with varying complexity, tiered pricing typically costs 60-90% less than flat-rate alternatives.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing Tiered Scraping in Production
&lt;/h2&gt;

&lt;p&gt;The key to cost optimization is automatic tier escalation: start with the cheapest tier, escalate only when needed. Here's a production-ready implementation:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="tiered_scraper.py" {4-8,15-18}&lt;/p&gt;

&lt;p&gt;from typing import Optional&lt;/p&gt;

&lt;p&gt;client = alterlab.Client(&lt;br&gt;
    api_key="YOUR_API_KEY",&lt;br&gt;
    auto_escalate=True  # Auto-escalate on failure&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;def scrape_with_tier_optimization(url: str, min_tier: int = 1) -&amp;gt; dict:&lt;br&gt;
    """&lt;br&gt;
    Scrape URL starting at minimum tier, escalate only if needed.&lt;br&gt;
    Reduces costs by 70-90% compared to always using T5.&lt;br&gt;
    """&lt;br&gt;
    response = client.scrape(&lt;br&gt;
        url=url,&lt;br&gt;
        min_tier=min_tier,      # Start at T1 for static sites&lt;br&gt;
        max_tier=5,             # Escalate up to T5 if needed&lt;br&gt;
        formats=["json"]&lt;br&gt;
    )&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;return {
    "url": url,
    "tier_used": response.tier,
    "cost": response.cost,
    "success": response.success,
    "data": response.data
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h1&gt;
  
  
  Example: Scrape 100 mixed-complexity sites
&lt;/h1&gt;

&lt;p&gt;urls = [&lt;br&gt;
    "&lt;a href="https://docs.python.org/3/library/" rel="noopener noreferrer"&gt;https://docs.python.org/3/library/&lt;/a&gt;",      # T1 sufficient&lt;br&gt;
    "&lt;a href="https://www.amazon.com/dp/B08N5WRWNW" rel="noopener noreferrer"&gt;https://www.amazon.com/dp/B08N5WRWNW&lt;/a&gt;",    # T4 required&lt;br&gt;
    "&lt;a href="https://github.com/trending" rel="noopener noreferrer"&gt;https://github.com/trending&lt;/a&gt;",             # T2-3 needed&lt;br&gt;
]&lt;/p&gt;

&lt;p&gt;results = [scrape_with_tier_optimization(url) for url in urls]&lt;br&gt;
total_cost = sum(r["cost"] for r in results)&lt;br&gt;
print(f"Total cost: ${total_cost:.4f} for {len(results)} requests")&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


The `min_tier` parameter is critical. Setting `min_tier=1` tells the API to attempt T1 first, escalating only on failure. For known complex sites, set `min_tier=4` to skip wasted T1-T3 attempts.

For JavaScript-heavy sites, use the [Python SDK](https://alterlab.io/web-scraping-api-python) which handles tier selection automatically based on response analysis.

## Cost Comparison: Before and After Tiered Architecture

Let's compare actual costs for a realistic scraping workload: 10,000 requests/month across mixed-complexity targets.

**Scenario: E-commerce Price Monitoring**
- 40% static product pages (T1 sufficient)
- 35% JavaScript-rendered prices (T3 required)
- 20% moderate anti-bot (T4 required)
- 5% CAPTCHA-protected (T5 required)

&amp;lt;div data-infographic="stats"&amp;gt;
  &amp;lt;div data-stat data-value="$450" data-label="Flat Rate Cost"&amp;gt;&amp;lt;/div&amp;gt;
  &amp;lt;div data-stat data-value="$87" data-label="Tiered Cost"&amp;gt;&amp;lt;/div&amp;gt;
  &amp;lt;div data-stat data-value="81%" data-label="Cost Reduction"&amp;gt;&amp;lt;/div&amp;gt;
  &amp;lt;div data-stat data-value="99.2%" data-label="Success Rate"&amp;gt;&amp;lt;/div&amp;gt;
&amp;lt;/div&amp;gt;

**Flat Rate (Always T5):**


```plaintext
10,000 requests × $0.045 (avg T5) = $450/month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Tiered Architecture:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;4,000 × $0.002 (T1)  = $8.00
3,500 × $0.015 (T3)  = $52.50
2,000 × $0.030 (T4)  = $60.00
500   × $0.080 (T5)  = $40.00
─────────────────────────────────
Total:               $160.50/month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;With Auto-Escalation Optimization:&lt;/strong&gt;&lt;br&gt;
Smart tier selection (starting low, escalating only on failure) typically reduces the T4/T5 portion by 40-50% because many sites that appear complex actually respond to simpler requests.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Optimized Total: ~$87/month (81% savings vs flat rate)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;a href="https://alterlab.io/docs/quickstart/installation" rel="noopener noreferrer"&gt;quickstart guide&lt;/a&gt; shows how to configure auto-escalation in under 5 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Node.js Implementation for High-Volume Pipelines
&lt;/h2&gt;

&lt;p&gt;For teams running scraping jobs in Node.js environments, here's a production pattern with built-in cost tracking:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```javascript title="scraper-pipeline.js" {5-9,22-26}&lt;/p&gt;

&lt;p&gt;const client = new AlterLabClient({&lt;br&gt;
  apiKey: process.env.ALTERLAB_API_KEY,&lt;br&gt;
  autoEscalate: true,&lt;br&gt;
  maxRetries: 3,&lt;br&gt;
  onTierEscalation: (from, to, url) =&amp;gt; {&lt;br&gt;
    console.log(&lt;code&gt;Escalated T${from} → T${to} for ${url}&lt;/code&gt;);&lt;br&gt;
  }&lt;br&gt;
});&lt;/p&gt;

&lt;p&gt;async function scrapeWithCostTracking(urls) {&lt;br&gt;
  const results = await Promise.all(&lt;br&gt;
    urls.map(async (url) =&amp;gt; {&lt;br&gt;
      const response = await client.scrape(url, {&lt;br&gt;
        minTier: 1,&lt;br&gt;
        formats: ['json'],&lt;br&gt;
        timeout: 30000&lt;br&gt;
      });&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  return {
    url,
    tier: response.tier,
    cost: response.cost,
    success: response.success,
    timestamp: new Date().toISOString()
  };
})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;);&lt;/p&gt;

&lt;p&gt;const totalCost = results.reduce((sum, r) =&amp;gt; sum + r.cost, 0);&lt;br&gt;
  const tierDistribution = results.reduce((acc, r) =&amp;gt; {&lt;br&gt;
    acc[&lt;code&gt;T${r.tier}&lt;/code&gt;] = (acc[&lt;code&gt;T${r.tier}&lt;/code&gt;] || 0) + 1;&lt;br&gt;
    return acc;&lt;br&gt;
  }, {});&lt;/p&gt;

&lt;p&gt;return {&lt;br&gt;
    results,&lt;br&gt;
    summary: {&lt;br&gt;
      totalRequests: results.length,&lt;br&gt;
      totalCost: totalCost.toFixed(4),&lt;br&gt;
      avgCostPerRequest: (totalCost / results.length).toFixed(6),&lt;br&gt;
      tierDistribution&lt;br&gt;
    }&lt;br&gt;
  };&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;// Usage&lt;br&gt;
const urls = [&lt;br&gt;
  '&lt;a href="https://example-shop.com/product/123" rel="noopener noreferrer"&gt;https://example-shop.com/product/123&lt;/a&gt;',&lt;br&gt;
  '&lt;a href="https://competitor-site.com/pricing" rel="noopener noreferrer"&gt;https://competitor-site.com/pricing&lt;/a&gt;',&lt;br&gt;
];&lt;/p&gt;

&lt;p&gt;scrapeWithCostTracking(urls).then(({ summary }) =&amp;gt; {&lt;br&gt;
  console.log(&lt;code&gt;Cost: $${summary.totalCost} for ${summary.totalRequests} requests&lt;/code&gt;);&lt;br&gt;
  console.log('Tier distribution:', summary.tierDistribution);&lt;br&gt;
});&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


This pattern gives you visibility into tier distribution—critical for identifying optimization opportunities. If 80% of requests escalate to T4+, your `min_tier` defaults may be too conservative.

## When to Use Each Tier: Decision Framework

Use this decision tree to set appropriate `min_tier` values for your targets:

&amp;lt;div data-infographic="steps"&amp;gt;
  &amp;lt;div data-step data-number="1" data-title="Static HTML?" data-description="View source → HTML present? Use T1"&amp;gt;&amp;lt;/div&amp;gt;
  &amp;lt;div data-step data-number="2" data-title="JavaScript Required?" data-description="Empty HTML, dynamic content? Use T3"&amp;gt;&amp;lt;/div&amp;gt;
  &amp;lt;div data-step data-number="3" data-title="Cloudflare Detected?" data-description="Challenge page? Use T4"&amp;gt;&amp;lt;/div&amp;gt;
  &amp;lt;div data-step data-number="4" data-title="CAPTCHA Present?" data-description="hCaptcha/reCAPTCHA? Use T5"&amp;gt;&amp;lt;/div&amp;gt;
&amp;lt;/div&amp;gt;

**Quick Tier Selection Guide:**

| Target Type | Recommended min_tier | Why |
|-------------|---------------------|-----|
| Documentation sites | 1 | Static HTML, no JS |
| News articles | 1-2 | Mostly static, some lazy load |
| E-commerce product pages | 3-4 | JS rendering, anti-bot common |
| Social media profiles | 4-5 | Heavy anti-bot, login walls |
| Government sites | 1-2 | Usually simple, occasional CAPTCHA |
| Job boards | 2-3 | Mix of static and dynamic |
| Real estate listings | 3-4 | Images, maps, dynamic pricing |

Test new targets with `min_tier=1` first. Log the tier that succeeds, then set that as your baseline for future scrapes. The [API reference](https://alterlab.io/docs) documents all tier-specific parameters.

## Monitoring and Alerting for Cost Optimization

Cost optimization requires visibility. Set up monitoring to catch tier escalation spikes:



```python title="cost_monitor.py" {8-14}

from datetime import datetime, timedelta

client = alterlab.Client(api_key="YOUR_API_KEY")

def analyze_tier_distribution(hours: int = 24) -&amp;gt; dict:
    """Analyze tier distribution over time window."""
    cutoff = datetime.now() - timedelta(hours=hours)

    # Query your scrape logs (implementation depends on your storage)
    scrapes = get_scrapes_since(cutoff)

    tier_counts = {}
    tier_costs = {}

    for scrape in scrapes:
        tier = f"T{scrape.tier}"
        tier_counts[tier] = tier_counts.get(tier, 0) + 1
        tier_costs[tier] = tier_costs.get(tier, 0) + scrape.cost

    total_cost = sum(tier_costs.values())

    return {
        "period_hours": hours,
        "total_requests": len(scrapes),
        "total_cost": total_cost,
        "tier_distribution": tier_counts,
        "cost_by_tier": tier_costs,
        "avg_cost_per_request": total_cost / len(scrapes) if scrapes else 0
    }

# Alert if T5 usage exceeds 10%
def check_tier_alerts():
    analysis = analyze_tier_distribution(hours=1)
    t5_ratio = analysis["tier_distribution"].get("T5", 0) / analysis["total_requests"]

    if t5_ratio &amp;gt; 0.10:
        send_alert(f"T5 usage spike: {t5_ratio:.1%} in last hour")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set up alerts for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;T5 usage &amp;gt; 10% of requests (indicates potential blocking)&lt;/li&gt;
&lt;li&gt;Average cost per request increasing &amp;gt; 20% week-over-week&lt;/li&gt;
&lt;li&gt;Success rate dropping below 95% for any tier&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Common Cost Optimization Mistakes
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Mistake 1: Always Using Headless Browsers&lt;/strong&gt;&lt;br&gt;
Running every request through Playwright when 60% of targets are static HTML wastes 50-70% of your budget. Start with T1, escalate on failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 2: Not Caching Results&lt;/strong&gt;&lt;br&gt;
Re-scraping unchanged pages burns budget. Implement ETag-based caching or use &lt;a href="https://alterlab.io/docs" rel="noopener noreferrer"&gt;monitoring features&lt;/a&gt; that only return data when pages change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 3: Ignoring Retry Logic&lt;/strong&gt;&lt;br&gt;
Transient failures happen. Blind retries at the same tier waste money. Implement exponential backoff with tier escalation on repeated failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mistake 4: No Target Classification&lt;/strong&gt;&lt;br&gt;
Treating all URLs the same ignores known patterns. Classify targets by domain, set appropriate &lt;code&gt;min_tier&lt;/code&gt; per domain, and track success rates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;Tiered scraping architecture is the single most effective cost optimization for production scraping pipelines. Key points:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Match tier to complexity&lt;/strong&gt; — T1 for static sites, T5 only when necessary&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-escalate on failure&lt;/strong&gt; — Start cheap, escalate only when needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitor tier distribution&lt;/strong&gt; — Alert on unusual T4/T5 spikes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache aggressively&lt;/strong&gt; — Don't re-scrape unchanged pages&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Classify targets&lt;/strong&gt; — Set &lt;code&gt;min_tier&lt;/code&gt; per domain based on historical data&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Teams implementing these practices typically see 70-90% cost reduction while maintaining 99%+ success rates. The &lt;a href="https://alterlab.io/faq" rel="noopener noreferrer"&gt;FAQ&lt;/a&gt; covers common implementation questions.&lt;/p&gt;

&lt;p&gt;For more technical deep-dives, check out the &lt;a href="https://alterlab.io/blog" rel="noopener noreferrer"&gt;AlterLab blog&lt;/a&gt; for posts on anti-bot bypass strategies and large-scale data extraction patterns.&lt;/p&gt;

</description>
      <category>automation</category>
      <category>api</category>
      <category>dataextraction</category>
      <category>proxies</category>
    </item>
    <item>
      <title>How to Scrape LinkedIn Profiles and Company Data Without Getting Blocked in 2026</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Thu, 26 Mar 2026 04:17:08 +0000</pubDate>
      <link>https://dev.to/alterlab/how-to-scrape-linkedin-profiles-and-company-data-without-getting-blocked-in-2026-4c09</link>
      <guid>https://dev.to/alterlab/how-to-scrape-linkedin-profiles-and-company-data-without-getting-blocked-in-2026-4c09</guid>
      <description>&lt;h1&gt;
  
  
  How to Scrape LinkedIn Profiles and Company Data Without Getting Blocked in 2026
&lt;/h1&gt;

&lt;p&gt;Scraping LinkedIn profiles and company data is one of the harder engineering problems in data extraction — not because LinkedIn's HTML is complex, but because their bot detection is aggressive, layered, and constantly updated. This guide covers what LinkedIn's defense stack actually looks like in 2026, which approaches still work, and how to build a pipeline that holds up under sustained load.&lt;/p&gt;




&lt;h2&gt;
  
  
  What You're Up Against
&lt;/h2&gt;

&lt;p&gt;LinkedIn does not use a third-party bot protection vendor. Their detection is in-house and operates across several independent layers simultaneously:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TLS fingerprinting (JA3/JA3S)&lt;/strong&gt;: LinkedIn inspects the TLS handshake before your request is even parsed. Python's &lt;code&gt;requests&lt;/code&gt; library has a well-known JA3 hash. So does Node.js's &lt;code&gt;https&lt;/code&gt; module. If your fingerprint matches a known automation signature, you're rate-limited or blocked before serving a single byte.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HTTP/2 settings fingerprinting&lt;/strong&gt;: Beyond TLS, LinkedIn inspects the HTTP/2 SETTINGS frame — window size, header table size, stream concurrency. These values are distinct between browsers and libraries like &lt;code&gt;httpx&lt;/code&gt; or &lt;code&gt;aiohttp&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Behavioral analysis&lt;/strong&gt;: LinkedIn tracks profile view velocity per session, per IP, and per account. Viewing 40 profiles in 20 minutes from the same session triggers a soft block. Scraping 200 profiles/day from the same account triggers a permanent suspension.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IP reputation&lt;/strong&gt;: Datacenter IPs (AWS, GCP, DigitalOcean, Hetzner) are near-universally blocked. LinkedIn has had years to compile ASN-level blocklists. Residential proxies are required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Authentication wall&lt;/strong&gt;: Most profile data — current job, past experience, education, connections — is behind login. Public profile pages show a truncated view and often redirect to the login wall after 2-3 requests from an unauthenticated session.&lt;/p&gt;

&lt;p&gt;Understanding this stack tells you what tools are off the table immediately: raw &lt;code&gt;requests&lt;/code&gt;, basic Selenium without stealth patches, and datacenter proxies. The approaches that still work in 2026 are headless browsers with fingerprint spoofing, proper session management with valid &lt;code&gt;li_at&lt;/code&gt; cookies, and residential proxy rotation.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Data is Realistically Scrapable
&lt;/h2&gt;

&lt;p&gt;Before writing a line of code, be precise about what you need:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Data Type&lt;/th&gt;
&lt;th&gt;Requires Login&lt;/th&gt;
&lt;th&gt;Detection Risk&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Company overview (name, size, industry, HQ)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Public pages are stable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Company employee count&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Often in structured &lt;code&gt;ld+json&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Job postings&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;LinkedIn Jobs is more open&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Personal profile (headline, current role)&lt;/td&gt;
&lt;td&gt;Soft&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Truncated without auth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full work history, education&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Requires &lt;code&gt;li_at&lt;/code&gt; session&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Connection graph&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Very High&lt;/td&gt;
&lt;td&gt;Heavily monitored&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Post/activity feed&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Lazy-loaded, paginated&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Company pages are significantly more accessible than personal profiles. If your use case is firmographic enrichment — industry, headcount, location, description — you can get most of that from public company pages with modest precautions.&lt;/p&gt;

&lt;p&gt;For personal profiles with full history, you need an authenticated session.&lt;/p&gt;




&lt;h2&gt;
  
  
  Approach 1: Scraping Public Company Pages
&lt;/h2&gt;

&lt;p&gt;Company pages (&lt;code&gt;linkedin.com/company/stripe/&lt;/code&gt;) render a meaningful amount of data without authentication. They also embed a &lt;code&gt;ld+json&lt;/code&gt; block with structured data, which is far more reliable than scraping HTML class names (LinkedIn obfuscates these and changes them frequently).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;parsel&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Selector&lt;/span&gt;

&lt;span class="n"&gt;HEADERS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;User-Agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (Windows NT 10.0; Win64; x64) &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AppleWebKit/537.36 (KHTML, like Gecko) &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chrome/122.0.0.0 Safari/537.36&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Accept-Language&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en-US,en;q=0.9&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Accept&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sec-Fetch-Dest&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;document&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sec-Fetch-Mode&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;navigate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sec-Fetch-Site&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;none&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sec-Ch-Ua&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="s"&gt;Chromium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;;v=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;122&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Not(A:Brand&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;;v=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;24&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sec-Ch-Ua-Mobile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sec-Ch-Ua-Platform&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'"&lt;/span&gt;&lt;span class="s"&gt;Windows&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scrape_company&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.linkedin.com/company/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# Use HTTP/2 and a transport that mimics Chrome's TLS fingerprint
&lt;/span&gt;    &lt;span class="n"&gt;transport&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncHTTPTransport&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;http2&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;HEADERS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;proxies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;follow_redirects&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;30.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;sel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Selector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Extract structured data first — more reliable than class-based selectors
&lt;/span&gt;    &lt;span class="n"&gt;ld_json_blocks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;script[type=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/ld+json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;]::text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;getall&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;structured&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ld_json_blocks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Organization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Corporation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="n"&gt;structured&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;
                &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;

    &lt;span class="c1"&gt;# Fall back to meta tags for basics
&lt;/span&gt;    &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;structured&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;sel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;meta[property=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;og:title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;]::attr(content)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;structured&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;sel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;meta[name=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;]::attr(content)&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;employee_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;structured&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;numberOfEmployees&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;slug&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;structured&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;founded&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;structured&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;foundingDate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;employee_range&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;employee_count&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;employee_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;industry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;structured&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;industry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;headquarters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;structured&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;address&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;addressLocality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scrape_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;slugs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;proxies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;slug&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;slugs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;proxy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxies&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;scrape_company&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HTTPStatusError&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;slug&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;] HTTP &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Randomized delay — critical for avoiding velocity detection
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;2.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;6.0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A few things worth noting in this code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;http2=True&lt;/code&gt;&lt;/strong&gt; matters. LinkedIn's servers prefer HTTP/2, and an HTTP/1.1 client looks anomalous.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;Sec-Ch-Ua&lt;/code&gt; and &lt;code&gt;Sec-Fetch-*&lt;/code&gt; headers&lt;/strong&gt; are set by Chrome automatically. Their absence is a fingerprint.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;ld+json&lt;/code&gt; extraction is the most stable part of this pipeline. LinkedIn's obfuscated class names can change weekly; their schema.org structured data changes far less frequently.&lt;/li&gt;
&lt;li&gt;The randomized delay (&lt;code&gt;uniform(2.5, 6.0)&lt;/code&gt;) is not optional. Fixed intervals like &lt;code&gt;time.sleep(2)&lt;/code&gt; are a pattern that detection systems flag.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Approach 2: Full Profile Scraping with Playwright
&lt;/h2&gt;

&lt;p&gt;For personal profiles with full work history, you need a real browser. &lt;code&gt;httpx&lt;/code&gt; won't execute the JavaScript that renders the page content, and LinkedIn uses lazy-loading for most profile sections.&lt;/p&gt;

&lt;p&gt;Use &lt;code&gt;playwright&lt;/code&gt; with &lt;code&gt;playwright-stealth&lt;/code&gt; to patch the automation indicators that Playwright exposes by default (&lt;code&gt;navigator.webdriver&lt;/code&gt;, Chrome runtime, permission APIs, etc.).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;playwright.async_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;async_playwright&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;playwright_stealth&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;stealth_async&lt;/span&gt;

&lt;span class="c1"&gt;# li_at is LinkedIn's primary session cookie.
# Obtain it from a logged-in browser session (DevTools → Application → Cookies).
&lt;/span&gt;&lt;span class="n"&gt;LI_AT_COOKIE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your_li_at_cookie_value_here&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;PROFILE_SELECTORS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h1.text-heading-xlarge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;headline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div.text-body-medium.break-words&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;span.text-body-small.inline.t-black--light.break-words&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;about&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div.display-flex.ph5.pv3 span.visually-hidden&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scrape_profile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxy_server&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;async_playwright&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chromium&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;headless&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;server&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;proxy_server&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--disable-blink-features=AutomationControlled&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--no-sandbox&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;viewport&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;width&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1440&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;height&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;900&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="n"&gt;locale&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en-US&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;timezone_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;America/New_York&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;user_agent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mozilla/5.0 (Windows NT 10.0; Win64; x64) &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;AppleWebKit/537.36 (KHTML, like Gecko) &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chrome/122.0.0.0 Safari/537.36&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Inject the li_at session cookie before navigating
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_cookies&lt;/span&gt;&lt;span class="p"&gt;([{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;li_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;value&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;LI_AT_COOKIE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domain&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.linkedin.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;httpOnly&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;secure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;}])&lt;/span&gt;

        &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_page&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;stealth_async&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Block images and fonts to reduce bandwidth and page load time
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;**/*.{png,jpg,jpeg,gif,webp,svg,woff,woff2,ttf}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                         &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abort&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait_until&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domcontentloaded&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;45_000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Mimic scroll behavior — LinkedIn lazy-loads experience/education sections
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mouse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wheel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randint&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;800&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.8&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

        &lt;span class="c1"&gt;# Extract visible text fields
&lt;/span&gt;        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;selector&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;PROFILE_SELECTORS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;el&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;selector&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;
                &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;el&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;inner_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5_000&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

        &lt;span class="c1"&gt;# Extract experience section
&lt;/span&gt;        &lt;span class="n"&gt;experience&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="n"&gt;exp_items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;li.artdeco-list__item.pvs-list__item--line-separated&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;all&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;exp_items&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;  &lt;span class="c1"&gt;# cap to avoid long-running loops
&lt;/span&gt;            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;span[aria-hidden=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;first&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;inner_text&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                &lt;span class="n"&gt;experience&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;experience_titles&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;experience&lt;/span&gt;

        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;profile_urls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;proxies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;profile_urls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;proxy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxies&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;scrape_profile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="c1"&gt;# LinkedIn monitors inter-request timing at the account level
&lt;/span&gt;        &lt;span class="c1"&gt;# Keep it well under 3 profiles/minute per session
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key decisions in this code:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Stealth patching&lt;/strong&gt;: &lt;code&gt;playwright_stealth&lt;/code&gt; patches ~20 browser properties that Playwright exposes. Without it, &lt;code&gt;navigator.webdriver === true&lt;/code&gt; and you're flagged immediately.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cookie injection over login flow&lt;/strong&gt;: Automating the login form is slower and creates a distinct behavioral pattern. Injecting &lt;code&gt;li_at&lt;/code&gt; directly is cleaner. Treat it as a secret — rotate accounts periodically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource blocking&lt;/strong&gt;: Blocking images and fonts cuts page load from ~4MB to ~400KB and halves scrape time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scroll simulation&lt;/strong&gt;: LinkedIn's experience and education sections don't render until scrolled into view. The &lt;code&gt;mouse.wheel&lt;/code&gt; calls are not optional for complete data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;20–40 second delay between profiles&lt;/strong&gt;: This is not excessive caution — it's roughly what a human reads a profile in. Anything faster risks session suspension.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Proxy Strategy
&lt;/h2&gt;

&lt;p&gt;Residential proxies are non-negotiable for LinkedIn at any meaningful scale. The decision tree is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&amp;lt; 100 profiles/day&lt;/strong&gt;: A single residential IP rotated per session is sufficient. Services like Oxylabs, Bright Data, or Smartproxy provide per-IP rotation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;100–1,000 profiles/day&lt;/strong&gt;: Rotate per request. Use geo-targeted proxies matching your LinkedIn account's expected location — a US account routing through a Bucharest IP is an anomaly signal.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&amp;gt; 1,000 profiles/day&lt;/strong&gt;: You need multiple LinkedIn accounts, multiple residential proxy pools, and request distribution across both dimensions. At this scale, managing fingerprinting in-house becomes a significant maintenance burden.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For teams that want to skip the proxy infrastructure and browser fingerprint management, scraping APIs like &lt;a href="https://alterlab.io" rel="noopener noreferrer"&gt;AlterLab&lt;/a&gt; handle rotating proxies, TLS fingerprint spoofing, and JavaScript rendering in a single API call — useful when the scraping itself isn't your core engineering problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  Rate Limiting and Request Patterns
&lt;/h2&gt;

&lt;p&gt;LinkedIn's rate limiting operates at three independent levels:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IP level&lt;/strong&gt;: Even with residential proxies, individual IPs have request budgets. Rotate IP per session, not per request, if you want to preserve cookie-based sessions. Rotating mid-session triggers a re-authentication challenge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Account level&lt;/strong&gt;: LinkedIn tracks profile view counts per authenticated session. Stay under 80–100 profile views per 24-hour period per account. This is a soft limit — exceeding it triggers an "unusual activity" checkpoint, not an immediate ban.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Velocity detection&lt;/strong&gt;: The interval between sequential profile views matters more than the total count. A human researcher views a profile, reads it (45–90 seconds), then moves to the next. Spikes below 15 seconds between views consistently trigger flags.&lt;/p&gt;

&lt;p&gt;Practical implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dataclasses&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;field&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;deque&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RateLimiter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;max_per_hour&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;
    &lt;span class="n"&gt;min_interval_seconds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;20.0&lt;/span&gt;
    &lt;span class="n"&gt;_timestamps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;deque&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default_factory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;deque&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c1"&gt;# Enforce minimum interval
&lt;/span&gt;        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_timestamps&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;elapsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_timestamps&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;elapsed&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min_interval_seconds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;sleep_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min_interval_seconds&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;elapsed&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sleep_time&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c1"&gt;# Enforce hourly budget
&lt;/span&gt;        &lt;span class="n"&gt;cutoff&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;3600&lt;/span&gt;
        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_timestamps&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_timestamps&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;cutoff&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_timestamps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;popleft&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_timestamps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_per_hour&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;oldest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_timestamps&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="n"&gt;wait_until&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;oldest&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;3600&lt;/span&gt;
            &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait_until&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_timestamps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;monotonic&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Handling Structure Changes
&lt;/h2&gt;

&lt;p&gt;LinkedIn's HTML uses obfuscated class names that change on deploys. Do not hard-code class names as primary selectors. Use this hierarchy, in order of stability:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;ld+json&lt;/code&gt; structured data&lt;/strong&gt; — most stable, changes with schema.org spec&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;aria-label&lt;/code&gt; and semantic attributes&lt;/strong&gt; — stable across redesigns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;data-*&lt;/code&gt; attributes&lt;/strong&gt; — moderately stable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tag + position selectors&lt;/strong&gt; (e.g., &lt;code&gt;h1:first-of-type&lt;/code&gt;) — fragile but better than class names&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Obfuscated class names&lt;/strong&gt; (e.g., &lt;code&gt;.pvs-list__item--line-separated&lt;/code&gt;) — treat as temporary&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When selectors break — and they will — the fastest recovery path is to diff the HTML before/after the break and update your attribute-based selectors. Keep a snapshot of the last known-good HTML in your test fixtures.&lt;/p&gt;




&lt;h2&gt;
  
  
  When Raw Scraping Isn't Worth It
&lt;/h2&gt;

&lt;p&gt;There are scenarios where building and maintaining this stack isn't justified:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need &amp;lt; 500 profiles/month and don't want to manage proxy billing and account rotation&lt;/li&gt;
&lt;li&gt;Your team doesn't have bandwidth to monitor for LinkedIn anti-bot updates&lt;/li&gt;
&lt;li&gt;You need consistent uptime SLAs that your own scraper can't provide&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In those cases, a managed scraping API handles the fingerprint management, proxy infrastructure, and JavaScript rendering for you. &lt;a href="https://alterlab.io" rel="noopener noreferrer"&gt;AlterLab's API&lt;/a&gt; supports rendering JavaScript pages with a single POST request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.alterlab.io/v1/scrape&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;X-API-Key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your_api_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://www.linkedin.com/company/stripe/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;render_js&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;wait_for&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;div.org-top-card&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;proxy_country&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;html&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tradeoff: you trade control and cost optimization for reliability and zero infrastructure maintenance. For high-volume production pipelines where LinkedIn data is core to the product, building in-house is usually cheaper at scale. For analytics, enrichment, or research pipelines, an API is faster to ship and easier to maintain.&lt;/p&gt;




&lt;h2&gt;
  
  
  Legal and Ethical Considerations
&lt;/h2&gt;

&lt;p&gt;LinkedIn's Terms of Service prohibit automated scraping. The &lt;em&gt;hiQ Labs v. LinkedIn&lt;/em&gt; case (9th Circuit, 2022) established that scraping publicly available data is not a violation of the Computer Fraud and Abuse Act, but this doesn't override LinkedIn's ToS or make all scraping legally risk-free in all jurisdictions.&lt;/p&gt;

&lt;p&gt;Be precise about what you actually need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Personal profile data is subject to GDPR and CCPA. Have a documented legal basis.&lt;/li&gt;
&lt;li&gt;Don't scrape contact information at scale for cold outreach — that's the use case that triggers the most aggressive legal responses.&lt;/li&gt;
&lt;li&gt;Company firmographic data (headcount, industry, description) is the lowest-risk data type.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;Scraping LinkedIn in 2026 requires addressing multiple detection layers simultaneously:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TLS and HTTP/2 fingerprinting&lt;/strong&gt; — use a real browser or a library with Chrome-compatible fingerprints. Raw &lt;code&gt;requests&lt;/code&gt; doesn't pass.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Residential proxies are not optional&lt;/strong&gt; — datacenter IPs are blocked at the ASN level.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session cookies (&lt;code&gt;li_at&lt;/code&gt;)&lt;/strong&gt; — required for full profile data. Inject them directly rather than automating login.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Behavioral mimicry&lt;/strong&gt; — randomize delays, simulate scrolling, stay under 80 profile views per 24 hours per account.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Target &lt;code&gt;ld+json&lt;/code&gt; and semantic attributes&lt;/strong&gt; — obfuscated class names are temporary. Structured data and ARIA attributes are stable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Company pages are far more accessible&lt;/strong&gt; than personal profiles. If firmographic data is sufficient, you don't need authenticated sessions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build vs. buy depends on volume and team bandwidth&lt;/strong&gt; — above ~5,000 profiles/day with SLA requirements, a managed API is often the right call.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The maintenance burden is the real cost here. LinkedIn's detection evolves continuously. Budget time for selector updates, proxy pool rotation, and account management — or abstract that away entirely with a scraping API.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Thu, 26 Mar 2026 04:02:29 +0000</pubDate>
      <link>https://dev.to/alterlab/scraping-javascript-heavy-spas-with-python-dynamic-content-infinite-scroll-and-api-interception-20dk</link>
      <guid>https://dev.to/alterlab/scraping-javascript-heavy-spas-with-python-dynamic-content-infinite-scroll-and-api-interception-20dk</guid>
      <description>&lt;h1&gt;
  
  
  Scraping JavaScript-Heavy SPAs with Python: Dynamic Content, Infinite Scroll, and API Interception
&lt;/h1&gt;

&lt;p&gt;Modern web applications rarely serve their data in the initial HTML response. React, Vue, and Angular SPAs render content client-side, fetch data from internal APIs, and load more content as users scroll. If you're trying to scrape JavaScript-heavy SPAs with Python using standard &lt;code&gt;requests&lt;/code&gt; + &lt;code&gt;BeautifulSoup&lt;/code&gt; pipelines, you'll fail immediately — by the time you parse the response, the meaningful content hasn't rendered yet.&lt;/p&gt;

&lt;p&gt;This post covers three concrete techniques for extracting data from SPAs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Headless browser automation for rendered DOM extraction&lt;/li&gt;
&lt;li&gt;Network request interception to harvest raw API responses&lt;/li&gt;
&lt;li&gt;Programmatic infinite scroll handling&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Why &lt;code&gt;requests&lt;/code&gt; Fails Against SPAs
&lt;/h2&gt;

&lt;p&gt;When you &lt;code&gt;GET&lt;/code&gt; a typical SPA URL, the server returns a near-empty shell:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="cp"&gt;&amp;lt;!DOCTYPE html&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;html&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;head&amp;gt;&amp;lt;title&amp;gt;&lt;/span&gt;My App&lt;span class="nt"&gt;&amp;lt;/title&amp;gt;&amp;lt;/head&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;body&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"root"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;script &lt;/span&gt;&lt;span class="na"&gt;src=&lt;/span&gt;&lt;span class="s"&gt;"/static/js/main.chunk.js"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&amp;lt;/script&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/body&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/html&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All product listings, search results, and user data are loaded asynchronously after the browser executes those script bundles. &lt;code&gt;requests&lt;/code&gt; never runs JavaScript — it only sees the shell.&lt;/p&gt;

&lt;p&gt;The content you want lives in one of two places:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The rendered DOM after JavaScript execution&lt;/li&gt;
&lt;li&gt;Raw JSON responses from the internal API calls that JavaScript makes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your scraping strategy depends on which is easier to access.&lt;/p&gt;




&lt;h2&gt;
  
  
  Choose Your Approach Before Writing Code
&lt;/h2&gt;

&lt;p&gt;Open DevTools → Network tab → filter by XHR/Fetch → reload the page. If you see clean JSON responses from readable endpoints like &lt;code&gt;/api/v1/products?page=2&lt;/code&gt;, you can skip the browser entirely and call those endpoints directly with &lt;code&gt;httpx&lt;/code&gt; or &lt;code&gt;requests&lt;/code&gt;. This is almost always faster and more reliable than browser automation.&lt;/p&gt;

&lt;p&gt;Only reach for a headless browser when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The API requires tokens generated client-side (complex HMAC signatures, rotating JWTs)&lt;/li&gt;
&lt;li&gt;Endpoints are obfuscated or dynamically constructed&lt;/li&gt;
&lt;li&gt;Data genuinely only exists in the rendered DOM with no backing API&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Best Approach&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Content rendered into DOM&lt;/td&gt;
&lt;td&gt;Headless browser + DOM extraction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SPA fetches from internal API&lt;/td&gt;
&lt;td&gt;Network interception → direct HTTP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Predictable paginated API&lt;/td&gt;
&lt;td&gt;Direct HTTP (no browser needed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Infinite scroll feed&lt;/td&gt;
&lt;td&gt;Headless browser + scroll automation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Virtual scrolling list&lt;/td&gt;
&lt;td&gt;Network interception (DOM won't hold all items)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Approach 1: Headless Browser with Playwright
&lt;/h2&gt;

&lt;p&gt;Playwright is the current standard for headless browser automation in Python. It supports Chromium, Firefox, and WebKit, has a clean async API, and handles modern JS frameworks well.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;playwright
playwright &lt;span class="nb"&gt;install &lt;/span&gt;chromium
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Waiting for the Right Moment
&lt;/h3&gt;

&lt;p&gt;The most common failure in SPA scraping is extracting the DOM before content has rendered. Playwright gives you several wait strategies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;playwright.async_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;async_playwright&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scrape_spa&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;async_playwright&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chromium&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headless&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_page&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c1"&gt;# "networkidle" waits until no network requests for 500ms
&lt;/span&gt;        &lt;span class="c1"&gt;# Use "domcontentloaded" when you'll wait on a selector anyway
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait_until&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;networkidle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Wait for the specific element you need — don't rely on networkidle alone
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait_for_selector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[data-testid=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;product-grid&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;15000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;products&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
            () =&amp;gt; Array.from(
                document.querySelectorAll(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;[data-testid=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;product-card&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)
            ).map(el =&amp;gt; ({
                title: el.querySelector(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;h2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)?.textContent?.trim(),
                price: el.querySelector(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;[data-price]&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)?.dataset?.price,
                url: el.querySelector(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;a&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)?.href,
                image: el.querySelector(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;img&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)?.src
            }))
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;scrape_spa&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example-shop.com/products&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Extracted &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; products&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;wait_for_selector&lt;/code&gt; is more reliable than a fixed timeout. It resolves as soon as the element exists in the DOM, which can be seconds earlier than a blanket &lt;code&gt;await asyncio.sleep(3)&lt;/code&gt; and won't fail when the sleep was too short.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;evaluate()&lt;/code&gt; vs. Locators
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;page.evaluate()&lt;/code&gt; runs JavaScript directly in the browser context — useful for extracting many similar elements in a single round-trip. For targeted single-field reads, the locator API is cleaner:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;title&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;h1.product-title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;text_content&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;price&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;locator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[data-price]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;get_attribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data-price&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use &lt;code&gt;evaluate()&lt;/code&gt; for mass extraction, locators for one-off field reads.&lt;/p&gt;




&lt;h2&gt;
  
  
  Approach 2: API Interception
&lt;/h2&gt;

&lt;p&gt;Many SPAs load data from internal REST or GraphQL APIs that return clean, structured JSON. You can intercept these responses from within Playwright without touching the DOM at all.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;playwright.async_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;async_playwright&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;intercept_api_responses&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;captured&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;async_playwright&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chromium&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headless&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_page&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/api/v2/listings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;content_type&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content-type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;content_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                        &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
                        &lt;span class="n"&gt;captured&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Failed to parse &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;on_response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait_until&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;networkidle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;captured&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;intercept_api_responses&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example-marketplace.com/search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once you've identified the API pattern, replicate it directly with &lt;code&gt;httpx&lt;/code&gt; for production. The browser is only needed to observe which endpoints are called and what authentication headers they carry.&lt;/p&gt;

&lt;h3&gt;
  
  
  Extracting Client-Side Auth Tokens
&lt;/h3&gt;

&lt;p&gt;If the API requires a bearer token generated in the browser:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;auth_token&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;on_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;global&lt;/span&gt; &lt;span class="n"&gt;auth_token&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/api/v2/listings&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;auth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;auth&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;auth_token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;auth&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;removeprefix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;request&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;on_request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait_until&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;networkidle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Now use auth_token directly with httpx for bulk pagination
&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;httpx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AsyncClient&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;page_num&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://example-marketplace.com/api/v2/listings?page=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;page_num&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;auth_token&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="n"&gt;captured&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This hybrid pattern — use the browser once to capture tokens, then direct HTTP for bulk pagination — is 10–50× faster than routing every request through Playwright.&lt;/p&gt;




&lt;h2&gt;
  
  
  Approach 3: Infinite Scroll Automation
&lt;/h2&gt;

&lt;p&gt;Infinite scroll triggers data loads when the user scrolls near the bottom of the page. The automation pattern is: scroll to the bottom, wait for new content to appear, extract, repeat.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;playwright.async_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;async_playwright&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scrape_infinite_scroll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;async_playwright&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chromium&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headless&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_page&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait_until&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;domcontentloaded&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait_for_selector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.item-card&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;seen_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="n"&gt;stall_rounds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

        &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;max_items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
                () =&amp;gt; Array.from(document.querySelectorAll(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.item-card&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)).map(el =&amp;gt; ({
                    id: el.dataset.id,
                    title: el.querySelector(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;h3&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)?.textContent?.trim(),
                    price: el.querySelector(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.price&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)?.textContent?.trim()
                }))
            &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="n"&gt;new_items&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;seen_ids&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;new_items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;stall_rounds&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;stall_rounds&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="k"&gt;break&lt;/span&gt;  &lt;span class="c1"&gt;# End of feed or load failure
&lt;/span&gt;            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;stall_rounds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
                &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;new_items&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;seen_ids&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
                    &lt;span class="n"&gt;items&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;window.scrollTo(0, document.body.scrollHeight)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait_for_timeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1500&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Wait for new content to render
&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;items&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key decisions in this pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Track by ID, not count.&lt;/strong&gt; A &lt;code&gt;seen_ids&lt;/code&gt; set prevents reprocessing items that stay in the DOM after scroll. Counting total DOM nodes is unreliable if the page removes old items as new ones load.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stall detection.&lt;/strong&gt; Three consecutive scroll cycles with no new items means you've hit the end of the feed or a silent load failure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scroll target.&lt;/strong&gt; &lt;code&gt;document.body.scrollHeight&lt;/code&gt; works when the document itself scrolls. If the scrollable container is a nested div, target it: &lt;code&gt;document.querySelector('.feed-container').scrollTo(0, 99999)&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Virtual Scrolling Is a Different Problem
&lt;/h3&gt;

&lt;p&gt;React-window and similar virtualization libraries render only visible rows and recycle DOM nodes as you scroll. You cannot collect all items from the DOM simultaneously — items outside the viewport don't exist as DOM nodes.&lt;/p&gt;

&lt;p&gt;For virtual scrolling, API interception is almost always the correct solution. The virtualized list is backed by data loaded from somewhere; intercept those API calls instead of fighting the DOM.&lt;/p&gt;




&lt;h2&gt;
  
  
  Anti-Bot Considerations
&lt;/h2&gt;

&lt;p&gt;SPAs behind Cloudflare, Akamai, or PerimeterX fingerprint browser characteristics at the JavaScript level: canvas rendering, WebGL parameters, audio context, font enumeration, navigator properties. A stock Playwright instance fails these checks.&lt;/p&gt;

&lt;p&gt;Mitigation strategies, in order of practical effectiveness:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;playwright-stealth&lt;/code&gt;&lt;/strong&gt;: Patches the most common fingerprint detection vectors. Start here.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real Chrome with user data directory&lt;/strong&gt;: Launch against a real Chrome install with an existing profile — closer to real browser state.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Residential proxies&lt;/strong&gt;: Many bot detectors block datacenter IP ranges regardless of browser fingerprinting. Fix IP reputation before spending time on JS patches.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Managed scraping APIs&lt;/strong&gt;: Services like &lt;a href="https://alterlab.io" rel="noopener noreferrer"&gt;AlterLab&lt;/a&gt; handle browser fingerprinting, proxy rotation, and bypass as infrastructure — you POST a URL and get back rendered HTML or a JSON payload without managing browser fleets.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;playwright-stealth
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;playwright.async_api&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;async_playwright&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;playwright_stealth&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;stealth_async&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scrape_with_stealth&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;async_playwright&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chromium&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headless&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;new_page&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;stealth_async&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Apply patches before navigation
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;goto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wait_until&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;networkidle&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;content&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Performance at Scale
&lt;/h2&gt;

&lt;p&gt;A single Chromium instance uses 200–400 MB RAM. For pipelines scraping thousands of pages:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reuse browser instances, not contexts.&lt;/strong&gt; &lt;code&gt;browser.new_context()&lt;/code&gt; is cheap; &lt;code&gt;browser.launch()&lt;/code&gt; is expensive. Create one browser, one context per isolated job.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Block unnecessary resources.&lt;/strong&gt; Images, fonts, and stylesheets are irrelevant for data extraction and meaningfully slow down page loads.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;route&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;**/*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abort&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resource_type&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;font&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stylesheet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;media&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;continue_&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Blocking images alone cuts load time by 30–60% on image-heavy SPAs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run contexts in parallel.&lt;/strong&gt; Use &lt;code&gt;asyncio.gather()&lt;/code&gt; to run multiple page scrapes concurrently within one browser instance. Keep concurrency at 3–5 pages per browser; beyond that, CPU contention negates the gains.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scrape_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;urls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;]]:&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;async_playwright&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;browser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chromium&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;launch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;headless&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;tasks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;scrape_with_context&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;urls&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;asyncio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;tasks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;return_exceptions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;browser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Use When&lt;/th&gt;
&lt;th&gt;Skip When&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DOM extraction (Playwright)&lt;/td&gt;
&lt;td&gt;Data only in rendered HTML&lt;/td&gt;
&lt;td&gt;API is accessible&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API interception + direct HTTP&lt;/td&gt;
&lt;td&gt;API exists, data is structured JSON&lt;/td&gt;
&lt;td&gt;Token rotation is too complex&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Infinite scroll automation&lt;/td&gt;
&lt;td&gt;Feed-style pages with scroll triggers&lt;/td&gt;
&lt;td&gt;Site uses virtual scrolling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Managed scraping API&lt;/td&gt;
&lt;td&gt;High-volume, anti-bot protected targets&lt;/td&gt;
&lt;td&gt;Simple unprotected targets&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The sequence that works for most SPA scraping projects:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Open the Network tab before writing any code.&lt;/strong&gt; If the SPA calls a clean API endpoint, skip the browser entirely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use &lt;code&gt;wait_for_selector&lt;/code&gt;, not &lt;code&gt;networkidle&lt;/code&gt; alone.&lt;/strong&gt; Wait for the specific element you need.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intercept requests to capture auth tokens.&lt;/strong&gt; Use the browser once, then switch to direct HTTP for bulk pagination.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Infinite scroll: track items by stable ID, not count.&lt;/strong&gt; Stop when stall detection triggers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Block images and fonts in browser pipelines.&lt;/strong&gt; Free 30–60% speed improvement.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fix IP reputation before fingerprinting patches.&lt;/strong&gt; Residential proxies solve most bot blocks; stealth patches solve the rest.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The most common over-engineering mistake is defaulting to headless browsers when &lt;code&gt;httpx&lt;/code&gt; and a couple of curl-derived headers would have worked. Start simple, escalate only when blocked.&lt;/p&gt;

</description>
      <category>api</category>
      <category>javascript</category>
      <category>python</category>
      <category>webscraping</category>
    </item>
    <item>
      <title>Best Web Scraping APIs in 2026: Complete Comparison Guide</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Thu, 26 Mar 2026 03:46:39 +0000</pubDate>
      <link>https://dev.to/alterlab/best-web-scraping-apis-in-2026-complete-comparison-guide-155n</link>
      <guid>https://dev.to/alterlab/best-web-scraping-apis-in-2026-complete-comparison-guide-155n</guid>
      <description>&lt;p&gt;If you're building anything that needs web data at scale — price monitoring, lead generation, AI training datasets, or competitive intelligence — you've probably realized that writing your own scraper is a maintenance nightmare. Anti-bot systems evolve weekly, proxies get burned, and CAPTCHAs multiply like rabbits.&lt;/p&gt;

&lt;p&gt;That's where web scraping APIs come in. Instead of managing browser farms and proxy pools yourself, you send a URL and get back clean data. But the market has exploded. There are now dozens of options, each with different pricing models, anti-bot strategies, and trade-offs.&lt;/p&gt;

&lt;p&gt;We tested and researched eight of the most popular web scraping APIs in 2026 to help you pick the right one for your use case and budget. This guide covers pricing, anti-bot capabilities, JavaScript rendering, output formats, free tiers, and the nuances that marketing pages don't tell you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Use a Web Scraping API?
&lt;/h2&gt;

&lt;p&gt;Before diving into comparisons, let's be clear about when a scraping API makes sense versus building your own solution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You should use a scraping API when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Anti-bot bypass is eating your engineering time. Cloudflare, DataDome, PerimeterX, and Akamai update their bot detection constantly. A dedicated API team handles this so you don't have to.&lt;/li&gt;
&lt;li&gt;You need reliable proxy infrastructure without managing it. Rotating residential and datacenter proxies across geographies is expensive and operationally complex.&lt;/li&gt;
&lt;li&gt;JavaScript rendering is required. Many modern sites serve empty HTML shells that require a full browser to render. Running headless Chrome at scale is resource-intensive.&lt;/li&gt;
&lt;li&gt;You want to focus on what you do with the data, not how you collect it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;You probably don't need one if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're scraping a handful of static pages that don't block bots.&lt;/li&gt;
&lt;li&gt;You already have a working Scrapy/Playwright setup and the target sites haven't changed their anti-bot measures.&lt;/li&gt;
&lt;li&gt;Your budget is zero and your volume is under a few hundred pages per day.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What We Compared
&lt;/h2&gt;

&lt;p&gt;Every API was evaluated on six core dimensions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pricing model&lt;/strong&gt; — Subscription vs. pay-as-you-go, credit systems, minimum commitments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anti-bot capabilities&lt;/strong&gt; — How well it handles Cloudflare, DataDome, CAPTCHAs, and fingerprinting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;JavaScript rendering&lt;/strong&gt; — Built-in headless browser, cost implications, rendering quality&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output formats&lt;/strong&gt; — HTML, JSON, Markdown, structured data extraction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Free tier&lt;/strong&gt; — What you can actually do without paying&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proxy infrastructure&lt;/strong&gt; — Residential, datacenter, mobile, geo-targeting options&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  The Contenders
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. AlterLab
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Best for: Developers who want pay-per-success pricing with automatic anti-bot escalation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AlterLab takes a different approach from most scraping APIs. Instead of charging a flat rate per request regardless of difficulty, it uses a tiered system that automatically escalates from the cheapest method to more expensive ones only when needed. You only pay for the tier that actually succeeds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Pure pay-as-you-go with no subscriptions. Tier 1 (simple curl) costs \/bin/bash.0002/request (5,000 per dollar), while the most expensive Tier 5 (captcha solving) costs \/bin/bash.02/request. The API starts at the cheapest tier and escalates automatically, so you never overpay for sites that respond to a simple HTTP request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anti-bot bypass:&lt;/strong&gt; Five-tier escalation system — curl, HTTP with TLS fingerprinting, stealth browser impersonation (curl_cffi), full Playwright browser automation, and CAPTCHA solving. The system learns which tier works for each domain and skips straight to the effective tier on subsequent requests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JS rendering:&lt;/strong&gt; Available via Tier 4 (browser automation) at \/bin/bash.004/request. Also offers a lightweight JSON extraction mode (Tier 3.5) that pulls structured data without launching a full browser.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output formats:&lt;/strong&gt; HTML, JSON, Markdown, and structured data extraction. Multi-format responses are supported in a single request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free tier:&lt;/strong&gt; Free credits on signup to test the API.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proxy infrastructure:&lt;/strong&gt; Built-in proxy rotation across datacenter and residential IPs. Also supports BYOP (Bring Your Own Proxy) with a 20% discount since AlterLab doesn't incur proxy costs for those requests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Standout feature:&lt;/strong&gt; The tiered pricing means if 80% of your target sites respond to a basic HTTP request, you pay \/bin/bash.0002 for those — not the -3 per thousand that flat-rate APIs charge. The savings compound at scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitation:&lt;/strong&gt; Newer platform with a smaller user community compared to established players. Documentation is growing but not as extensive as ScraperAPI or Bright Data yet.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. ScraperAPI
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Best for: General-purpose scraping with a simple API and generous free tier&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;ScraperAPI is one of the most well-known scraping APIs and a solid default choice for many developers. It handles proxy rotation, CAPTCHA bypassing, and JavaScript rendering behind a single API endpoint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Subscription-based. Free plan includes 5,000 credits on signup plus 1,000 monthly. Hobby plan at \9/month for 100,000 credits, Startup at 9/month for 1,000,000 credits, Business at \99/month for 3,000,000 credits. As of early 2026, they also introduced a pay-as-you-go overflow model for when you exceed your plan limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anti-bot bypass:&lt;/strong&gt; Automatic proxy rotation, CAPTCHA handling, and header management. Works well for most common protections. Advanced anti-bot sites (DataDome, PerimeterX) may require higher-tier plans with more credits per request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JS rendering:&lt;/strong&gt; Available on all plans. JavaScript rendering uses 10 credits per request (versus 1 for standard), which effectively makes it 10x more expensive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output formats:&lt;/strong&gt; Raw HTML. Structured data extraction is available through their DataPipeline product for specific domains (Amazon, Google, etc.).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free tier:&lt;/strong&gt; 5,000 initial credits + 1,000/month. Limited to 5 concurrent connections. Decent for testing but runs out quickly in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proxy infrastructure:&lt;/strong&gt; 40M+ IPs across datacenter and residential pools. Geotargeting available. Premium residential proxies on higher plans.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Standout feature:&lt;/strong&gt; Simplicity. Single API endpoint, well-documented, large community, and wide language support. If you just want something that works without fuss, ScraperAPI delivers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitation:&lt;/strong&gt; Subscription model means you pay monthly whether you scrape or not. JS rendering at 10x credit cost adds up fast. No structured data extraction from the core API.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Bright Data
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Best for: Enterprise-scale operations that need the full proxy and data infrastructure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Bright Data (formerly Luminati) is the 800-pound gorilla of the web data industry. They offer everything from raw proxy access to managed scraping APIs to pre-built datasets. Their infrastructure is massive, but so is the complexity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Web Scraper API uses flat-rate pricing of .50-.50 per 1,000 requests. Subscription plans start at \99/month. Pay-as-you-go available but more expensive per request. Their Scraping Browser is priced separately at .50/GB plus \/bin/bash.10/hour.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anti-bot bypass:&lt;/strong&gt; Industry-leading. Bright Data has the largest proxy network in the world (72M+ IPs) and their unlocker technology handles virtually any anti-bot system. If a site can be scraped, Bright Data can probably do it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JS rendering:&lt;/strong&gt; Available through their Scraping Browser product. Full Chrome-based rendering with session management. Powerful but priced separately from the Scraper API.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output formats:&lt;/strong&gt; HTML, JSON, and structured data for supported domains. Their Web Scraper IDE lets you build custom extraction logic visually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free tier:&lt;/strong&gt; Free trial with limited credits. No permanent free tier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proxy infrastructure:&lt;/strong&gt; The largest in the industry — 72M+ residential, datacenter, ISP, and mobile IPs across every country. This is Bright Data's core product and it's unmatched.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Standout feature:&lt;/strong&gt; Unmatched proxy diversity and success rates on heavily protected sites. If you're scraping at enterprise volume or need guaranteed access to difficult targets, Bright Data has the infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitation:&lt;/strong&gt; Pricing is complex and can be unpredictable. The \99/month minimum for subscriptions is steep for smaller operations. Multiple products with separate billing (Scraper API, Scraping Browser, proxy access) can get confusing. Some users report bill shock from unexpected bandwidth charges.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Firecrawl
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Best for: AI/LLM developers who need clean Markdown output for RAG pipelines&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Firecrawl has carved out a strong niche in the AI space. While other APIs focus on raw HTML, Firecrawl is built specifically to turn web pages into LLM-ready Markdown and structured data. If you're building a RAG pipeline or training dataset, Firecrawl speaks your language.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Credit-based. Free plan gives 500 credits. Hobby plan at /month, Standard at /month, Growth at /month for 500,000 credits. Most scraping costs 1 credit per page. Their AI-powered /extract endpoint bills by tokens instead of credits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anti-bot bypass:&lt;/strong&gt; Basic anti-bot handling. Firecrawl focuses more on content extraction quality than bypassing heavy protections. For heavily protected sites, you may need to combine it with a proxy service.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JS rendering:&lt;/strong&gt; Built-in. Most pages are rendered with JavaScript by default. The Growth plan supports up to 100 concurrent browsers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output formats:&lt;/strong&gt; This is where Firecrawl excels. Native Markdown output, structured JSON extraction via LLM, and clean HTML. The /extract endpoint uses AI to pull structured data from any page without writing selectors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free tier:&lt;/strong&gt; 500 credits (pages) for free. Enough to evaluate the API for a small project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proxy infrastructure:&lt;/strong&gt; Basic proxy rotation included. Not their focus area — don't expect Bright Data-level geo-targeting or residential IPs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Standout feature:&lt;/strong&gt; First-class Markdown output and AI-powered extraction. If your use case is feeding web data into an LLM, Firecrawl's output quality is hard to beat. It's also open-source (self-hostable).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitation:&lt;/strong&gt; Weaker anti-bot bypass compared to dedicated scraping APIs. Not the right tool if you're scraping protected e-commerce sites or need raw performance at scale. The AI extraction endpoint can get expensive with token-based billing.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. Apify
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Best for: Teams that want pre-built scraping actors for specific websites&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Apify is less of a single API and more of a full scraping platform. Their "Actor" marketplace has thousands of pre-built scrapers for specific sites (Amazon, Google, LinkedIn, etc.). You can also build and deploy custom scrapers using their SDK.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Pay-as-you-go based on compute units, storage, and proxy usage. Free tier gives \/month in platform credits. Paid plans start at \9/month. Additional costs for proxies (\/bin/bash.60+ per datacenter IP), memory (\/GB), and parallel runs (\ each).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anti-bot bypass:&lt;/strong&gt; Varies by Actor. Pre-built Actors for popular sites include anti-bot logic specific to that site. For custom scrapers, you can use Apify's proxy infrastructure, but you're largely responsible for anti-bot handling yourself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JS rendering:&lt;/strong&gt; Full Playwright and Puppeteer support. Actors can run headless browsers natively on Apify's cloud infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output formats:&lt;/strong&gt; Depends on the Actor. Most return JSON. Platform supports exporting to CSV, JSON, XML, and direct integrations with Google Sheets, Slack, Zapier, and databases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free tier:&lt;/strong&gt; \/month in credits on the free plan. Enough for small-scale testing but limited for production use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proxy infrastructure:&lt;/strong&gt; Apify Proxy combines datacenter and residential IPs. Included in all plans but with usage limits. Smart rotation available.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Standout feature:&lt;/strong&gt; The Actor marketplace. Instead of building a scraper from scratch, you can often find a pre-built, community-maintained Actor for your target site. The platform handles scheduling, storage, and monitoring.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitation:&lt;/strong&gt; Pricing can be confusing with multiple cost dimensions (compute, storage, proxy, memory). Pre-built Actors may break when target sites update. You're dependent on community maintenance for third-party Actors.&lt;/p&gt;




&lt;h3&gt;
  
  
  6. ZenRows
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Best for: Developers focused on bypassing anti-bot systems on protected websites&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;ZenRows is laser-focused on anti-bot bypass. If your primary challenge is getting past Cloudflare, DataDome, or PerimeterX, ZenRows is designed specifically for that problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Tiered subscription based on request volume. All plans include the full product suite (Universal Scraper API, Scraping Browser, Residential Proxies). Business 300 plan is roughly $/month. Volume discounts available for quarterly, semi-annual, and annual billing. You only pay for successful requests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anti-bot bypass:&lt;/strong&gt; This is ZenRows' core strength. Their Universal Scraper API includes advanced anti-bot modes for Cloudflare, DataDome, and other major protection systems. High success rates on difficult targets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JS rendering:&lt;/strong&gt; Available through their Scraping Browser product. Separate from the basic API requests and uses more of your plan allocation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output formats:&lt;/strong&gt; HTML, with options for CSS/XPath selectors to extract specific elements. AI-powered extraction in beta.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free tier:&lt;/strong&gt; Limited free trial. No permanent free tier for ongoing use.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proxy infrastructure:&lt;/strong&gt; Built-in residential proxy rotation. All plans include proxy access. Geo-targeting available.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Standout feature:&lt;/strong&gt; Industry-leading anti-bot bypass rates. If you need reliable access to heavily protected sites, ZenRows consistently ranks among the best. They also only charge for successful requests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitation:&lt;/strong&gt; Higher entry price than some competitors. Limited output format options — primarily HTML, not optimized for Markdown or structured data like Firecrawl. The UI and documentation could be more polished.&lt;/p&gt;




&lt;h3&gt;
  
  
  7. Crawlbase
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Best for: Budget-conscious teams that need basic scraping with storage&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Crawlbase (formerly ProxyCrawl) offers a straightforward scraping API with an interesting twist — built-in data storage. Their pricing is competitive at the lower end, making them a good choice for teams watching their budget.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Starts at \9/month. Basic requests from \ per 1,000. They categorize requests into Standard, Moderate, and Complex tiers based on the target site difficulty, each with different pricing. Free trial with initial credits and up to 10,000 stored documents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anti-bot bypass:&lt;/strong&gt; Handles standard protections with proxy rotation and header management. JavaScript rendering available for dynamic sites. Not as strong as ZenRows or Bright Data on heavily protected targets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JS rendering:&lt;/strong&gt; Available with their JavaScript rendering mode. Adds to the cost per request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output formats:&lt;/strong&gt; HTML, JSON, and CSV. Built-in data storage lets you accumulate scraped data without building your own storage layer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free tier:&lt;/strong&gt; Free trial credits. 10,000 document storage limit on free accounts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proxy infrastructure:&lt;/strong&gt; Millions of rotating proxies including residential IPs. Geo-targeting available.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Standout feature:&lt;/strong&gt; Built-in data storage and the affordable entry point. If you need a simple scraping API without enterprise complexity, Crawlbase delivers reasonable value.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitation:&lt;/strong&gt; Limited anti-bot capabilities compared to premium providers. The tiered complexity pricing (Standard/Moderate/Complex) can be unpredictable if your target sites vary widely.&lt;/p&gt;




&lt;h3&gt;
  
  
  8. Oxylabs
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Best for: Enterprise teams that need specialized scraping APIs for e-commerce and SERP data&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Oxylabs is another enterprise-grade provider with a strong focus on specific verticals — particularly e-commerce and search engine data. Their specialized APIs are pre-tuned for these use cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Web Scraper API starts at \9/month for 17,500 results (.80 per 1,000). Specialized SERP and E-Commerce APIs available at similar price points. You only pay for successful scrapes — 5xx and 6xx errors are free.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anti-bot bypass:&lt;/strong&gt; Strong anti-bot capabilities backed by a large proxy network. Particularly effective for e-commerce sites and search engines, which are their primary focus areas.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;JS rendering:&lt;/strong&gt; Available through their Headless Browser feature. Included in all API plans but consumes more traffic, increasing effective cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output formats:&lt;/strong&gt; HTML and JSON. Specialized APIs return pre-structured data for their supported domains (product data, search results, etc.).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Free tier:&lt;/strong&gt; Free trial available. No permanent free tier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proxy infrastructure:&lt;/strong&gt; 100M+ IPs including residential, datacenter, ISP, and mobile proxies. Strong geo-targeting capabilities. Particularly well-suited for location-specific scraping.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Standout feature:&lt;/strong&gt; Specialized, pre-built APIs for e-commerce (Amazon, eBay, Walmart) and SERP data. If your primary use case is price monitoring or search ranking tracking, Oxylabs' tailored solutions save development time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitation:&lt;/strong&gt; Enterprise pricing isn't friendly to small teams. The \9/month minimum with limited results means you're paying a premium per request at lower volumes. General-purpose scraping isn't their strongest suit.&lt;/p&gt;




&lt;h2&gt;
  
  
  Head-to-Head Comparison
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Pricing Comparison
&lt;/h2&gt;

&lt;p&gt;Pricing is where things get tricky because every API uses a different model. Here's a normalized comparison based on what you'd actually pay for common scenarios.&lt;/p&gt;

&lt;p&gt;Note that these are baseline costs for standard (non-JS) requests. Costs increase significantly for JavaScript rendering, anti-bot bypass, and CAPTCHA solving across all platforms. AlterLab's advantage narrows on complex requests — Tier 4 (browser) costs \ per thousand, and Tier 5 (CAPTCHA) costs $ per thousand, which is competitive but not dramatically cheaper than alternatives.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which API for Which Use Case?
&lt;/h2&gt;

&lt;p&gt;Not every API is right for every job. Here's a quick decision framework:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For AI/LLM data pipelines:&lt;/strong&gt; Firecrawl is purpose-built for this. Clean Markdown output, AI extraction, and self-hosting option. AlterLab is a solid alternative if you need anti-bot bypass that Firecrawl can't handle, since it also supports Markdown output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For price monitoring and e-commerce:&lt;/strong&gt; Oxylabs or Bright Data. Their specialized e-commerce APIs return pre-structured product data, saving you from writing extraction logic. ScraperAPI also works well for simpler e-commerce targets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For heavily protected sites (Cloudflare, DataDome):&lt;/strong&gt; ZenRows or Bright Data. These two have the strongest anti-bot bypass technology. AlterLab's tiered approach handles most protections well and costs less for mixed-difficulty targets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For budget-conscious developers:&lt;/strong&gt; AlterLab's pay-as-you-go model (no subscriptions, $ minimum) or Firecrawl's /month Hobby plan are the most accessible starting points. Crawlbase is another affordable option at \9/month.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For large-scale enterprise operations:&lt;/strong&gt; Bright Data or Oxylabs. The infrastructure depth, compliance certifications, SLA guarantees, and dedicated account management matter at enterprise scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For teams wanting pre-built scrapers:&lt;/strong&gt; Apify's Actor marketplace saves development time if someone has already built a scraper for your target site. Check the marketplace before building from scratch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For mixed workloads (easy and hard sites combined):&lt;/strong&gt; AlterLab's automatic tier escalation shines here. You pay \/bin/bash.0002 for sites that respond to curl and \/bin/bash.004 for sites that need a full browser — without configuring anything. Flat-rate APIs charge you the same price regardless of difficulty.&lt;/p&gt;

&lt;p&gt;List the sites you need to scrape and note their anti-bot protections&lt;br&gt;
  Calculate monthly request volume to compare pricing models accurately&lt;br&gt;
  Use free credits from 2-3 APIs to test success rates on your actual targets&lt;br&gt;
  Factor in JS rendering costs, failed request charges, and concurrency limits&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;There is no single "best" web scraping API.&lt;/strong&gt; The right choice depends on your specific targets, volume, budget, and output format needs. That said, here are some patterns:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If cost predictability matters most&lt;/strong&gt;, look at APIs that only charge for successful requests (AlterLab, ZenRows, Oxylabs). Getting billed for failed attempts adds up fast on difficult sites.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you're scraping mixed-difficulty sites&lt;/strong&gt;, tiered pricing (AlterLab) saves money compared to flat-rate models. Paying browser-rendering prices for a site that responds to curl is wasteful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If anti-bot bypass is your primary challenge&lt;/strong&gt;, ZenRows and Bright Data have the deepest anti-bot technology. They cost more, but they work on the hardest targets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you're building for AI&lt;/strong&gt;, Firecrawl's native Markdown and AI extraction features will save you post-processing pipeline development time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you want maximum flexibility with minimal commitment&lt;/strong&gt;, pay-as-you-go models (AlterLab, Apify) let you scale up and down without paying for unused capacity.&lt;/p&gt;

&lt;p&gt;The web scraping API market continues to evolve rapidly. Anti-bot systems get harder, APIs get smarter, and pricing models keep innovating. Whatever you choose, start with a free tier, test against your actual target sites, and make your decision based on real success rates — not marketing claims.&lt;/p&gt;

</description>
      <category>performance</category>
      <category>restapi</category>
      <category>comparison</category>
      <category>dataextraction</category>
    </item>
    <item>
      <title>Rotating Proxies for Web Scraping: What Works and What Wastes Money</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Thu, 26 Mar 2026 03:38:57 +0000</pubDate>
      <link>https://dev.to/alterlab/rotating-proxies-for-web-scraping-what-works-and-what-wastes-money-370p</link>
      <guid>https://dev.to/alterlab/rotating-proxies-for-web-scraping-what-works-and-what-wastes-money-370p</guid>
      <description>&lt;h1&gt;
  
  
  Rotating Proxies for Web Scraping: What Works and What Wastes Money
&lt;/h1&gt;

&lt;p&gt;Proxies are not magic. Slapping a proxy rotation layer onto a bad scraper does not make it good. But when your scraper is solid and you need to scale without getting IP-banned, proxy strategy matters.&lt;/p&gt;

&lt;p&gt;Here is what actually works, what costs what, and when proxies are not the answer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Proxy Types and Their Real Costs
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Datacenter Proxies
&lt;/h3&gt;

&lt;p&gt;Cheap ($0.50-2 per IP per month), fast, and the first thing most scrapers try. They come from cloud providers like AWS, GCP, OVH.&lt;/p&gt;

&lt;p&gt;The problem: most anti-bot systems maintain lists of datacenter IP ranges. If a site uses Cloudflare, DataDome, or PerimeterX, datacenter proxies get flagged before your request reaches the server.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Good for:&lt;/strong&gt; Sites without serious bot protection. Internal tools, basic APIs, public government data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bad for:&lt;/strong&gt; E-commerce, social media, any site behind a CDN with bot protection.&lt;/p&gt;

&lt;h3&gt;
  
  
  Residential Proxies
&lt;/h3&gt;

&lt;p&gt;Real IPs from ISPs, routed through actual home connections. They look like normal users because they are normal user IPs.&lt;/p&gt;

&lt;p&gt;Cost: $8-15 per GB of bandwidth. A typical web page with images is 2-5 MB. At $10/GB, that is $0.02-0.05 per page load. It adds up fast.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Good for:&lt;/strong&gt; Sites with strong bot protection. E-commerce scraping (Amazon, Walmart, Target). Social media data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bad for:&lt;/strong&gt; High-volume scraping where bandwidth costs matter. Downloading large files or media.&lt;/p&gt;

&lt;h3&gt;
  
  
  ISP Proxies
&lt;/h3&gt;

&lt;p&gt;Static IPs from ISPs hosted in datacenters. They have the reputation of residential IPs with the speed of datacenter ones.&lt;/p&gt;

&lt;p&gt;Cost: $2-5 per IP per month. More expensive than datacenter but cheaper per request than residential (since you pay per IP, not per GB).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Good for:&lt;/strong&gt; When you need consistent IPs (login sessions, account management). Medium-difficulty targets.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mobile Proxies
&lt;/h3&gt;

&lt;p&gt;IPs from mobile carriers. These have the best reputation because mobile IPs are shared among thousands of users through carrier-grade NAT. Anti-bot systems are reluctant to block them.&lt;/p&gt;

&lt;p&gt;Cost: $20-50 per GB. The most expensive option.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Good for:&lt;/strong&gt; The hardest targets. When everything else gets blocked.&lt;/p&gt;

&lt;h2&gt;
  
  
  Rotation Strategies
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Round-Robin Rotation
&lt;/h3&gt;

&lt;p&gt;Cycle through your proxy pool sequentially. Simple to implement, works fine for sites that do not do session tracking.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="n"&gt;proxies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;proxy1:port&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;proxy2:port&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;proxy3:port&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;proxy_cycle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;itertools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cycle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxies&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_next_proxy&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;proxy_cycle&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The issue: if a proxy gets banned, you keep rotating back to it. You need health checks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sticky Sessions
&lt;/h3&gt;

&lt;p&gt;Keep the same IP for a sequence of related requests. Important when scraping paginated results or sites that track sessions.&lt;/p&gt;

&lt;p&gt;Most residential proxy providers support this with session IDs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# With a residential proxy provider
# Same session ID = same IP for ~10 minutes
&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://user-session_abc123:pass@gate.provider.com:7777&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Smart Rotation with Backoff
&lt;/h3&gt;

&lt;p&gt;The approach that works best in practice. Track which proxies are healthy, back off when one gets flagged, and prioritize proxies with recent success.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;defaultdict&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ProxyPool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxies&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;proxies&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;proxies&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failures&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;defaultdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_fail&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;defaultdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_proxy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;available&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;proxies&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_fail&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failures&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;available&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;available&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;proxies&lt;/span&gt;  &lt;span class="c1"&gt;# reset if all are in backoff
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;choice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;available&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;report_failure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failures&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_fail&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;report_success&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;failures&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;proxy&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  When to Skip Proxies Entirely
&lt;/h2&gt;

&lt;p&gt;Proxies solve one problem: IP-based blocking. But many scraping failures are not about IPs at all.&lt;/p&gt;

&lt;p&gt;If you are getting CAPTCHAs, the problem is usually fingerprinting, not your IP address. Adding more proxies to a detectable scraper just burns through IPs faster.&lt;/p&gt;

&lt;p&gt;If the site requires JavaScript rendering, you need a browser, not a proxy. A proxy on top of raw HTTP requests does not help when the page content is loaded via client-side JS.&lt;/p&gt;

&lt;p&gt;If you are scraping fewer than 100 pages per day from a single site, you probably do not need proxy rotation at all. Most sites allow moderate request rates from a single IP.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Build vs Buy Decision
&lt;/h2&gt;

&lt;p&gt;Building proxy rotation infrastructure means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Buying proxy bandwidth or pools&lt;/li&gt;
&lt;li&gt;Writing rotation logic with health checks&lt;/li&gt;
&lt;li&gt;Monitoring success rates and costs&lt;/li&gt;
&lt;li&gt;Handling retries, rate limits, and bans&lt;/li&gt;
&lt;li&gt;Maintaining this over time as anti-bot systems change&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For most teams, the proxy layer is a distraction from the actual work. You are building a scraping tool, not a proxy management platform.&lt;/p&gt;

&lt;p&gt;Scraping APIs like AlterLab, ScraperAPI, and Bright Data bundle proxies into the service. You pay per successful request. If the request fails because of a proxy issue, you do not pay for it. The provider eats that cost and rotates to another proxy.&lt;/p&gt;

&lt;p&gt;AlterLab takes this further with a "bring your own proxy" option. If you already have proxy infrastructure you like, you can route AlterLab requests through your own proxies. You get the anti-bot bypass and JS rendering without paying for proxy bandwidth twice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Match your proxy type to your target difficulty:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Target Difficulty&lt;/th&gt;
&lt;th&gt;Proxy Type&lt;/th&gt;
&lt;th&gt;Cost per Request&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;No bot protection&lt;/td&gt;
&lt;td&gt;Datacenter or none&lt;/td&gt;
&lt;td&gt;&amp;lt; $0.001&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Basic protection&lt;/td&gt;
&lt;td&gt;ISP proxies&lt;/td&gt;
&lt;td&gt;$0.001-0.005&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cloudflare/DataDome&lt;/td&gt;
&lt;td&gt;Residential&lt;/td&gt;
&lt;td&gt;$0.02-0.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hardest targets&lt;/td&gt;
&lt;td&gt;Mobile&lt;/td&gt;
&lt;td&gt;$0.05-0.15&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;If proxy costs are eating your budget, you are either using the wrong proxy type for your target or scraping at a scale where an API service would be cheaper.&lt;/p&gt;

</description>
      <category>python</category>
      <category>proxies</category>
    </item>
    <item>
      <title>Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Thu, 26 Mar 2026 03:38:51 +0000</pubDate>
      <link>https://dev.to/alterlab/web-scraping-apis-vs-diy-scrapers-when-to-stop-building-infrastructure-42ie</link>
      <guid>https://dev.to/alterlab/web-scraping-apis-vs-diy-scrapers-when-to-stop-building-infrastructure-42ie</guid>
      <description>&lt;h1&gt;
  
  
  Web Scraping APIs vs DIY Scrapers: When to Stop Building Infrastructure
&lt;/h1&gt;

&lt;p&gt;Every developer starts scraping the same way. Write a Python script, send some requests, parse the HTML. It works. Then you need to scrape a site with bot protection and suddenly you are shopping for proxies, patching headless browsers, and debugging TLS fingerprints at 2 AM.&lt;/p&gt;

&lt;p&gt;There is a point where building your own scraping infrastructure stops being productive and starts being a second job. The question is where that line is for your specific use case.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Build When You DIY
&lt;/h2&gt;

&lt;p&gt;A production scraping stack is not just a script. Here is the full inventory:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Request layer.&lt;/strong&gt; HTTP client with proper TLS fingerprinting, header management, cookie handling, redirect following.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Proxy layer.&lt;/strong&gt; Pool management, rotation logic, health checks, cost tracking, failover between proxy types.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Browser layer.&lt;/strong&gt; Headless Chrome/Playwright instances, memory management, crash recovery, stealth patches, session isolation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anti-bot layer.&lt;/strong&gt; CAPTCHA solving integration, challenge detection, fingerprint maintenance as anti-bot systems update.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Queue and scheduling.&lt;/strong&gt; Rate limiting per domain, retry logic with backoff, deduplication, priority queues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitoring.&lt;/strong&gt; Success rates per domain, cost per request, error tracking, alerting when a target changes its structure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parsing.&lt;/strong&gt; HTML extraction, JSON-LD parsing, schema validation. This part is usually the easy part.&lt;/p&gt;

&lt;p&gt;Each of these is a maintenance surface. Anti-bot systems update monthly. Proxies get burned and need replacement. Browser versions change and stealth patches break.&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Get With a Scraping API
&lt;/h2&gt;

&lt;p&gt;You send a URL. You get back HTML, markdown, or structured data. The API provider handles everything listed above.&lt;/p&gt;

&lt;p&gt;The trade-off is control vs convenience. With a DIY stack, you can tune every parameter. With an API, you trade that control for not having to maintain anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost Comparison
&lt;/h2&gt;

&lt;p&gt;Here is a realistic comparison for scraping 100K pages per month from a mix of easy and hard targets.&lt;/p&gt;

&lt;h3&gt;
  
  
  DIY Stack Costs
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Monthly Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Residential proxies (500 GB)&lt;/td&gt;
&lt;td&gt;$4,000-5,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Server (browser instances)&lt;/td&gt;
&lt;td&gt;$200-400&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CAPTCHA solving service&lt;/td&gt;
&lt;td&gt;$100-300&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Your engineering time (10-20 hrs)&lt;/td&gt;
&lt;td&gt;$1,000-4,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$5,300-9,700&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That engineering time estimate is conservative. When something breaks at scale, debugging takes hours.&lt;/p&gt;

&lt;h3&gt;
  
  
  API Service Costs
&lt;/h3&gt;

&lt;p&gt;Most scraping APIs charge per successful request, with pricing tiers based on difficulty.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Easy pages&lt;/th&gt;
&lt;th&gt;JS rendered&lt;/th&gt;
&lt;th&gt;Anti-bot bypass&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AlterLab&lt;/td&gt;
&lt;td&gt;$0.001&lt;/td&gt;
&lt;td&gt;$0.005&lt;/td&gt;
&lt;td&gt;$0.01-0.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ScraperAPI&lt;/td&gt;
&lt;td&gt;$0.001&lt;/td&gt;
&lt;td&gt;$0.005&lt;/td&gt;
&lt;td&gt;$0.01-0.10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ScrapingBee&lt;/td&gt;
&lt;td&gt;$0.001&lt;/td&gt;
&lt;td&gt;$0.005&lt;/td&gt;
&lt;td&gt;$0.01-0.05&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bright Data (SERP API)&lt;/td&gt;
&lt;td&gt;$0.003&lt;/td&gt;
&lt;td&gt;$0.01&lt;/td&gt;
&lt;td&gt;$0.02-0.08&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For 100K pages (50% easy, 30% JS rendered, 20% hard):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Easy: 50,000 x $0.001 = $50&lt;/li&gt;
&lt;li&gt;JS rendered: 30,000 x $0.005 = $150&lt;/li&gt;
&lt;li&gt;Anti-bot: 20,000 x $0.02 = $400&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Total: roughly $600/month&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is 10x cheaper than DIY for most teams. Even if you value your engineering time at zero.&lt;/p&gt;

&lt;h2&gt;
  
  
  When DIY Makes Sense
&lt;/h2&gt;

&lt;p&gt;DIY scraping is the right call when:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You scrape one or two simple sites.&lt;/strong&gt; If your targets do not have bot protection and the structure rarely changes, a simple requests + BeautifulSoup script is fine. No need to overcomplicate it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You need sub-second latency.&lt;/strong&gt; Scraping APIs add network overhead. If you need to scrape and respond in real-time (like a price comparison tool), running your own infrastructure close to the target servers matters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scraping is your core product.&lt;/strong&gt; If you are building a scraping company, you should own the infrastructure. You need that level of control.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You have an existing proxy investment.&lt;/strong&gt; If you already have residential proxy contracts, building on top of that makes sense. Some services like AlterLab let you bring your own proxies so you can use their anti-bot bypass without paying for proxy bandwidth twice.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use an API
&lt;/h2&gt;

&lt;p&gt;API services make sense when:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scraping is a means, not the end.&lt;/strong&gt; You are building an AI training pipeline, a price monitoring tool, a lead gen system. The scraping is a component, not the product.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You scrape diverse targets.&lt;/strong&gt; Each site has different bot protection, rendering requirements, and anti-scraping measures. APIs handle the diversity so you do not have to build and maintain solutions for each one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your time is worth more than the API cost.&lt;/strong&gt; If you spend 20 hours per month maintaining scraping infrastructure, and the API costs $500 less than your hourly rate, the math is clear.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You need to scale quickly.&lt;/strong&gt; Going from 10K to 1M pages means 10x more proxies, 10x more browser instances, 10x more monitoring. An API scales without any infrastructure changes on your end.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hybrid Approach
&lt;/h2&gt;

&lt;p&gt;The smart move for most teams is starting with an API and only building custom infrastructure for specific cases where you need it.&lt;/p&gt;

&lt;p&gt;Use an API for the 80% of targets that are standard. Build custom scrapers for the few targets where you need precise control, unusual interaction patterns, or real-time response.&lt;/p&gt;

&lt;p&gt;AlterLab is built for this pattern. Pay for what you use, no subscriptions, no minimum commitments. Light scrapes are cheap, JS rendering costs more, and anti-bot bypass scales with difficulty. If a request fails, you do not pay for it.&lt;/p&gt;

&lt;p&gt;The bottom line: unless scraping is your core business, the infrastructure is a distraction. Ship your product, not your proxy management dashboard.&lt;/p&gt;

</description>
      <category>restapi</category>
      <category>python</category>
    </item>
    <item>
      <title>Web Scraping Pipeline for RAG: Clean Data for LLMs</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Wed, 25 Mar 2026 07:48:31 +0000</pubDate>
      <link>https://dev.to/alterlab/web-scraping-pipeline-for-rag-clean-data-for-llms-4kje</link>
      <guid>https://dev.to/alterlab/web-scraping-pipeline-for-rag-clean-data-for-llms-4kje</guid>
      <description>&lt;h1&gt;
  
  
  Web Scraping Pipeline for RAG: Feed Clean Data into Your LLM Without Token Waste
&lt;/h1&gt;

&lt;p&gt;Raw HTML is poison for RAG. A typical news article page is 45,000 characters—roughly 11,000 tokens. The actual article is 800 words, or about 1,100 tokens. You are paying 10× to embed navigation menus, cookie banners, footer links, and inline scripts that actively dilute your embeddings and degrade retrieval quality.&lt;/p&gt;

&lt;p&gt;The fix is a five-stage pipeline: reliable fetch → content extraction → normalization → semantic chunking → embed and index. Each stage has a single responsibility. Each failure is isolated and debuggable. This post walks through a production implementation in Python.&lt;/p&gt;




&lt;h2&gt;
  
  
  Pipeline Architecture
&lt;/h2&gt;




&lt;h2&gt;
  
  
  Stage 1: Reliable Fetching
&lt;/h2&gt;

&lt;p&gt;The hardest part of scraping at scale is not parsing—it is getting the HTML. Bot detection blocks &lt;code&gt;requests&lt;/code&gt;. JavaScript-rendered SPAs return skeleton HTML to static fetches. IP ranges accumulate blocks.&lt;/p&gt;

&lt;p&gt;AlterLab's scraping API handles this in a single POST: rotating residential proxies, automatic CAPTCHA bypass, and optional headless rendering without managing a browser fleet yourself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Python:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="fetch.py" {8-19}&lt;/p&gt;

&lt;p&gt;ALTERLAB_API_KEY = "YOUR_API_KEY"&lt;br&gt;
ALTERLAB_BASE_URL = "&lt;a href="https://api.alterlab.io/v1" rel="noopener noreferrer"&gt;https://api.alterlab.io/v1&lt;/a&gt;"&lt;/p&gt;

&lt;p&gt;def fetch_page(url: str, render_js: bool = False) -&amp;gt; str:&lt;br&gt;
    """Fetch fully-rendered HTML from any URL."""&lt;br&gt;
    response = httpx.post(&lt;br&gt;
        f"{ALTERLAB_BASE_URL}/scrape",&lt;br&gt;
        headers={"X-API-Key": ALTERLAB_API_KEY, "Content-Type": "application/json"},&lt;br&gt;
        json={&lt;br&gt;
            "url": url,&lt;br&gt;
            "render_js": render_js,&lt;br&gt;
            "wait_for": "networkidle" if render_js else None,&lt;br&gt;
        },&lt;br&gt;
        timeout=30,&lt;br&gt;
    )&lt;br&gt;
    response.raise_for_status()&lt;br&gt;
    return response.json()["html"]&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


**cURL:**



```bash title="Terminal"
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/docs/getting-started",
    "render_js": true,
    "wait_for": "networkidle"
  }'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




---

## Stage 2: Content Extraction

`trafilatura` is the most accurate open-source library for pulling article body text from HTML. It outperforms `readability-lxml` and `newspaper3k` on structured documentation and blog content because it uses both DOM heuristics and text-density scoring.



```python title="extract.py" {1-3,14-25}

from trafilatura.settings import use_config

# Disable per-document timeout—let your own retry logic own the clock
config = use_config()
config.set("DEFAULT", "EXTRACTION_TIMEOUT", "0")

def extract_content(html: str, url: str) -&amp;gt; dict:
    """
    Extract main content from HTML.
    Returns dict with keys: text, title, author, date, description.
    Raises ValueError if no content can be extracted.
    """
    result = trafilatura.extract(
        html,
        url=url,
        include_comments=False,
        include_tables=True,
        no_fallback=False,
        config=config,
        output_format="json",
        with_metadata=True,
    )

    if result is None:
        raise ValueError(f"Extraction returned no content for {url}")

    return json.loads(result)
```



Set `no_fallback=False` to allow trafilatura to fall back to its secondary heuristic if the primary DOM analysis returns nothing—useful for pages with unconventional layouts.

---

## Stage 3: Normalization

After extraction, text still contains artifacts: Unicode non-breaking spaces (`\u00a0`), zero-width joiners, smart quotes, triple-newline runs from CMS templates, and stub lines that are purely punctuation.



```python title="normalize.py" {5-18}

def normalize_text(text: str) -&amp;gt; str:
    # Canonical Unicode form: convert smart quotes, em-dashes, ligatures
    text = unicodedata.normalize("NFKC", text)

    # Replace invisible/non-breaking whitespace variants
    text = re.sub(r"[\u00a0\u200b\u200c\u200d\ufeff]", " ", text)

    # Collapse horizontal whitespace, preserve single newlines
    text = re.sub(r"[ \t]+", " ", text)
    text = re.sub(r"\n{3,}", "\n\n", text)

    # Drop lines shorter than 4 chars (nav artifacts: "›", "|", "»")
    lines = [ln for ln in text.split("\n") if len(ln.strip()) &amp;gt; 3 or ln.strip() == ""]

    return "\n".join(lines).strip()
```



This pass runs in microseconds per document and prevents garbage tokens from reaching your embedding model.

---

## Stage 4: Chunking Strategy

Three mistakes that kill retrieval quality:

- **Fixed character splits** break sentences mid-clause. The embedding for a sentence fragment does not represent a complete thought.
- **Whole documents as single vectors** average all content into one point in embedding space. Specific queries retrieve nothing useful.
- **Zero overlap** means a concept bridging two chunks never matches a query that references it as a unit.

Use recursive sentence-aware chunking with configurable overlap:



```python title="chunker.py" {19-48}
from __future__ import annotations

from dataclasses import dataclass, field

@dataclass
class Chunk:
    text: str
    url: str
    chunk_index: int
    total_chunks: int
    metadata: dict = field(default_factory=dict)

def split_sentences(text: str) -&amp;gt; list[str]:
    """Sentence-boundary split on terminal punctuation followed by uppercase."""
    return re.split(r"(?&amp;lt;=[.!?])\s+(?=[A-Z\"])", text)

def chunk_document(
    text: str,
    url: str,
    max_tokens: int = 400,
    overlap_sentences: int = 2,
    chars_per_token: float = 4.0,
) -&amp;gt; list[Chunk]:
    """
    Split text into token-bounded chunks with sentence-level overlap.

    Args:
        max_tokens: Approximate token ceiling per chunk.
        overlap_sentences: Sentences carried over to the next chunk.
        chars_per_token: Heuristic for English prose (4.0 is reliable).
    """
    max_chars = int(max_tokens * chars_per_token)
    sentences = split_sentences(text)

    raw_chunks: list[str] = []
    current: list[str] = []
    current_len = 0

    for sentence in sentences:
        slen = len(sentence)
        if current_len + slen &amp;gt; max_chars and current:
            raw_chunks.append(" ".join(current))
            current = current[-overlap_sentences:]
            current_len = sum(len(s) for s in current)
        current.append(sentence)
        current_len += slen

    if current:
        raw_chunks.append(" ".join(current))

    total = len(raw_chunks)
    return [
        Chunk(text=t, url=url, chunk_index=i, total_chunks=total)
        for i, t in enumerate(raw_chunks)
    ]
```



**Token ceiling guidelines by model:**


  &lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
    &lt;thead&gt;
      &lt;tr&gt;
        &lt;th&gt;Model&lt;/th&gt;
        &lt;th&gt;Recommended max_tokens&lt;/th&gt;
        &lt;th&gt;Overlap Sentences&lt;/th&gt;
        &lt;th&gt;Notes&lt;/th&gt;
      &lt;/tr&gt;
    &lt;/thead&gt;
    &lt;tbody&gt;
      &lt;tr&gt;
        &lt;td&gt;text-embedding-3-small&lt;/td&gt;
        &lt;td&gt;400&lt;/td&gt;
        &lt;td&gt;2&lt;/td&gt;
        &lt;td&gt;Good default for mixed content&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td&gt;text-embedding-3-large&lt;/td&gt;
        &lt;td&gt;600&lt;/td&gt;
        &lt;td&gt;2&lt;/td&gt;
        &lt;td&gt;Better for long-form technical docs&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td&gt;nomic-embed-text&lt;/td&gt;
        &lt;td&gt;512&lt;/td&gt;
        &lt;td&gt;3&lt;/td&gt;
        &lt;td&gt;Open-source; strong on code + prose&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
        &lt;td&gt;BGE-M3&lt;/td&gt;
        &lt;td&gt;800&lt;/td&gt;
        &lt;td&gt;2&lt;/td&gt;
        &lt;td&gt;Multilingual; 8192-token context&lt;/td&gt;
      &lt;/tr&gt;
    &lt;/tbody&gt;
  &lt;/table&gt;&lt;/div&gt;



&lt;h2&gt;
  
  
  Stage 5: Embedding and Indexing
&lt;/h2&gt;

&lt;p&gt;Batch your embedding calls. The OpenAI embeddings API accepts up to 2,048 inputs per request—sending one chunk per call is 100× slower and burns rate limit quota unnecessarily.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="embed.py" {27-48}&lt;/p&gt;

&lt;p&gt;from openai import AsyncOpenAI&lt;/p&gt;

&lt;p&gt;openai_client = AsyncOpenAI()&lt;/p&gt;

&lt;p&gt;def init_index(api_key: str, index_name: str) -&amp;gt; pinecone.Index:&lt;br&gt;
    pc = pinecone.Pinecone(api_key=api_key)&lt;br&gt;
    return pc.Index(index_name)&lt;/p&gt;

&lt;p&gt;async def embed_texts(texts: list[str]) -&amp;gt; list[list[float]]:&lt;br&gt;
    """Batch embed up to 2048 texts in a single API call."""&lt;br&gt;
    response = await openai_client.embeddings.create(&lt;br&gt;
        model="text-embedding-3-small",&lt;br&gt;
        input=texts,&lt;br&gt;
        encoding_format="float",&lt;br&gt;
    )&lt;br&gt;
    return [item.embedding for item in response.data]&lt;/p&gt;

&lt;p&gt;async def index_chunks(&lt;br&gt;
    chunks: list["Chunk"],&lt;br&gt;
    index: pinecone.Index,&lt;br&gt;
    batch_size: int = 100,&lt;br&gt;
) -&amp;gt; None:&lt;br&gt;
    """Embed and upsert chunks into Pinecone with source metadata preserved."""&lt;br&gt;
    for i in range(0, len(chunks), batch_size):&lt;br&gt;
        batch = chunks[i : i + batch_size]&lt;br&gt;
        vectors = await embed_texts([c.text for c in batch])&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    upserts = [
        {
            "id": f"{c.url}::{c.chunk_index}",
            "values": vectors[j],
            "metadata": {
                "url": c.url,
                "chunk_index": c.chunk_index,
                "total_chunks": c.total_chunks,
                "text": c.text,  # store inline—avoids a separate fetch at query time
            },
        }
        for j, c in enumerate(batch)
    ]

    index.upsert(vectors=upserts)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


Store `text` in the vector metadata. Fetching the source document at query time adds latency and a failure point; paying a few extra bytes per vector is worth it.

---

## Full Pipeline



```python title="pipeline.py" {11-55}

from fetch import fetch_page
from extract import extract_content
from normalize import normalize_text
from chunker import Chunk, chunk_document
from embed import init_index, index_chunks

PINECONE_API_KEY = "YOUR_PINECONE_KEY"
PINECONE_INDEX = "rag-knowledge-base"

async def ingest_url(url: str, render_js: bool = False) -&amp;gt; dict:
    """
    End-to-end pipeline: URL → indexed, retrievable chunks.
    Returns a summary dict for logging and monitoring.
    """
    # Stage 1: Fetch
    html = fetch_page(url, render_js=render_js)

    # Stage 2: Extract
    extracted = extract_content(html, url)
    raw_text = extracted.get("text", "")
    title = extracted.get("title", "untitled")

    if not raw_text:
        return {"url": url, "status": "no_content", "chunks": 0}

    # Stage 3: Normalize
    clean_text = normalize_text(raw_text)

    # Stage 4: Chunk
    chunks = chunk_document(
        text=clean_text,
        url=url,
        max_tokens=400,
        overlap_sentences=2,
    )

    # Filter degenerate chunks before embedding
    chunks = [c for c in chunks if len(c.text.split()) &amp;gt;= 15]

    # Stage 5: Embed + index
    index = init_index(PINECONE_API_KEY, PINECONE_INDEX)
    await index_chunks(chunks, index)

    return {
        "url": url,
        "title": title,
        "status": "indexed",
        "chunks": len(chunks),
        "approx_tokens": sum(len(c.text) // 4 for c in chunks),
    }

async def ingest_batch(urls: list[str], concurrency: int = 5) -&amp;gt; list[dict]:
    """Ingest multiple URLs with bounded concurrency."""
    semaphore = asyncio.Semaphore(concurrency)

    async def bounded(url: str) -&amp;gt; dict:
        async with semaphore:
            try:
                return await ingest_url(url)
            except Exception as e:
                return {"url": url, "status": "error", "error": str(e)}

    return await asyncio.gather(*[bounded(u) for u in urls])

if __name__ == "__main__":
    urls = [
        "https://docs.python.org/3/library/asyncio-task.html",
        "https://platform.openai.com/docs/guides/embeddings",
        "https://www.pinecone.io/docs/upsert-data/",
    ]
    results = asyncio.run(ingest_batch(urls, concurrency=3))
    for r in results:
        print(r)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Handling Edge Cases
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Deduplication
&lt;/h3&gt;

&lt;p&gt;The same content appears under multiple URLs: &lt;code&gt;www&lt;/code&gt; vs. bare domain, query parameters, pagination suffixes. Hash normalized text before indexing:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="dedup.py" {6-13}&lt;/p&gt;

&lt;p&gt;_seen_hashes: set[str] = set()&lt;/p&gt;

&lt;p&gt;def is_duplicate(text: str) -&amp;gt; bool:&lt;br&gt;
    """Return True if this exact content has already been indexed this run."""&lt;br&gt;
    digest = hashlib.sha256(text.encode()).hexdigest()[:16]&lt;br&gt;
    if digest in _seen_hashes:&lt;br&gt;
        return True&lt;br&gt;
    _seen_hashes.add(digest)&lt;br&gt;
    return False&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


Call `is_duplicate(clean_text)` after Stage 3 and skip to the next URL if it returns `True`.

### Pagination and Crawling

For documentation sites spanning dozens of pages, discover internal links before ingesting. A simple same-domain BFS over `&amp;lt;a href&amp;gt;` tags prevents you from missing chapters or API reference sections. Keep a visited-URL set to avoid cycles.

### Retries with Backoff

Your embedding API has rate limits even when your scraper does not. Wrap async calls in exponential backoff:



```python title="retry.py" {6-16}

from typing import TypeVar, Callable, Awaitable

T = TypeVar("T")

async def with_retry(fn: Callable[[], Awaitable[T]], attempts: int = 3) -&amp;gt; T:
    for i in range(attempts):
        try:
            return await fn()
        except Exception:
            if i == attempts - 1:
                raise
            await asyncio.sleep(min(2 ** i, 30))
    raise RuntimeError("unreachable")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wrap &lt;code&gt;index_chunks&lt;/code&gt; calls: &lt;code&gt;await with_retry(lambda: index_chunks(chunks, index))&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Production Checklist
&lt;/h2&gt;

&lt;p&gt;Before running this at scale, verify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Freshness TTL&lt;/strong&gt;: Set an expiry on indexed documents. Re-scrape on a schedule. Stale RAG context is worse than no context—your LLM will confidently cite outdated information.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimum chunk length&lt;/strong&gt;: Filter out chunks with fewer than 15 words. Stubs from tables or code snippets without context are noise at query time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metadata completeness&lt;/strong&gt;: Always store &lt;code&gt;scraped_at&lt;/code&gt;, &lt;code&gt;source_url&lt;/code&gt;, and &lt;code&gt;section_title&lt;/code&gt; in vector metadata. Your LLM needs these to generate citations users can verify.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extraction failure rate&lt;/strong&gt;: Monitor the share of URLs returning &lt;code&gt;no_content&lt;/code&gt;. Above 5% means your source sites have unusual structure and need custom extraction rules.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Concurrency limits&lt;/strong&gt;: Do not set &lt;code&gt;concurrency&lt;/code&gt; above what your scraping tier supports. Queue excess work with Redis or a task runner rather than hammering with retries.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Takeaway
&lt;/h2&gt;

&lt;p&gt;A five-stage pipeline—fetch, extract, normalize, chunk, embed—is not over-engineering. It is the minimum required to produce input that a retrieval system can actually use.&lt;/p&gt;

&lt;p&gt;Token waste is a symptom, not the root problem. The root problem is that HTML is a rendering format, not a content format. Every step in this pipeline exists to close that gap.&lt;/p&gt;

&lt;p&gt;The fetching layer is where most teams cut corners and regret it. Flaky HTML from failed bot-bypass attempts or unrendered JS propagates bad data through every downstream stage. Eliminating that variable with a dedicated scraping API means your engineering time goes where it compounds: extraction heuristics, chunking strategy, and retrieval evaluation.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>datapipelines</category>
      <category>api</category>
      <category>python</category>
    </item>
  </channel>
</rss>
