<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: AlterLab</title>
    <description>The latest articles on DEV Community by AlterLab (@alterlab).</description>
    <link>https://dev.to/alterlab</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3842661%2F6ea3b67f-3a2b-423f-b726-51041ab344e6.png</url>
      <title>DEV Community: AlterLab</title>
      <link>https://dev.to/alterlab</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alterlab"/>
    <language>en</language>
    <item>
      <title>Etsy Data API: Extract Structured JSON in 2026</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Thu, 18 Jun 2026 19:19:25 +0000</pubDate>
      <link>https://dev.to/alterlab/etsy-data-api-extract-structured-json-in-2026-2c21</link>
      <guid>https://dev.to/alterlab/etsy-data-api-extract-structured-json-in-2026-2c21</guid>
      <description>&lt;p&gt;&lt;em&gt;Disclaimer: This guide covers extracting publicly accessible data. Always review a site's robots.txt and Terms of Service before scraping.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;To get structured etsy data via API, pass a public listing URL and a strictly defined JSON schema to the AlterLab Extract API. The platform handles browser rendering and proxy routing automatically, returning validated, typed JSON fields like price, title, and availability without requiring fragile CSS selectors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building a Resilient E-Commerce Data API
&lt;/h2&gt;

&lt;p&gt;Extracting structured data from modern e-commerce platforms requires navigating frequent DOM changes and complex front-end frameworks. Writing custom HTML parsers with tools like BeautifulSoup or Cheerio works for static sites but breaks immediately when class names change or content is rendered client-side. &lt;/p&gt;

&lt;p&gt;If you need a reliable etsy data api for your applications, the architecture must decouple the target page structure from your data requirements. We will build a pipeline that treats Etsy public listings as a programmatic data source. By defining the exact shape of the data we want via a JSON schema, we can offload the visual and structural parsing to AI models. Before proceeding, ensure you have set up your API keys by reviewing the &lt;a href="https://dev.to/docs/quickstart/installation"&gt;Getting started guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Extract Etsy Data?
&lt;/h2&gt;

&lt;p&gt;Engineers and data scientists extract etsy data for several distinct public data use cases. &lt;/p&gt;

&lt;p&gt;First, market intelligence platforms track pricing trends and product availability across specific vintage or handmade categories. Analyzing price fluctuations helps sellers price their own inventory competitively. &lt;/p&gt;

&lt;p&gt;Second, AI researchers compile specialized training datasets. Product descriptions on handmade items often contain unique, highly descriptive text suitable for fine-tuning domain-specific language models.&lt;/p&gt;

&lt;p&gt;Third, supply chain analysts monitor inventory levels and availability statuses across high-volume shops to forecast demand in niche markets. All of these pipelines require reliable etsy api structured data.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Data Can You Extract?
&lt;/h2&gt;

&lt;p&gt;Focusing purely on publicly accessible information, a typical e-commerce listing contains highly structured data points masquerading as unstructured visual elements. &lt;/p&gt;

&lt;p&gt;You can extract:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Title&lt;/strong&gt;: The exact product name as listed by the seller.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Price&lt;/strong&gt;: The numeric value of the item.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Currency&lt;/strong&gt;: The currency code (USD, EUR, GBP) to normalize pricing data.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;SKU or Listing ID&lt;/strong&gt;: Unique identifiers for tracking items across time.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Availability&lt;/strong&gt;: Stock status, often represented as "In Stock" or a specific remaining quantity.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Rating&lt;/strong&gt;: The aggregated review score for the product or seller.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Extraction Approach: Schema over Selectors
&lt;/h2&gt;

&lt;p&gt;The traditional web scraping approach is inherently fragile. You send an HTTP GET request, download the raw HTML, and run XPath or CSS selectors against the DOM. If the target site ships an update that changes &lt;code&gt;&amp;lt;div class="price-text-123"&amp;gt;&lt;/code&gt; to &lt;code&gt;&amp;lt;span class="product-cost-abc"&amp;gt;&lt;/code&gt;, your pipeline fails silently or throws errors. &lt;/p&gt;

&lt;p&gt;An AI-driven data extraction API flips this paradigm. You do not tell the API &lt;em&gt;how&lt;/em&gt; to find the data. You tell the API &lt;em&gt;what&lt;/em&gt; data you need. &lt;/p&gt;

&lt;p&gt;By utilizing an LLM to interpret the rendered page visually and contextually, the extraction remains stable even if the underlying HTML completely changes. This approach is significantly more resilient for maintaining an e-commerce data api.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Start with AlterLab Extract API
&lt;/h2&gt;

&lt;p&gt;To begin pulling etsy json extraction data, we will use the AlterLab Extract endpoint. You can interact with this API using raw HTTP requests or the official Python SDK. For complete endpoint parameters, reference the &lt;a href="https://dev.to/docs/api/extract"&gt;Extract API docs&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here is how you execute a request using standard command line tools.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```bash title="Terminal"&lt;br&gt;
curl -X POST &lt;a href="https://api.alterlab.io/v1/extract" rel="noopener noreferrer"&gt;https://api.alterlab.io/v1/extract&lt;/a&gt; \&lt;br&gt;
  -H "X-API-Key: YOUR_API_KEY" \&lt;br&gt;
  -H "Content-Type: application/json" \&lt;br&gt;
  -d '{&lt;br&gt;
    "url": "&lt;a href="https://etsy.com/listing/123456789/example-vintage-item" rel="noopener noreferrer"&gt;https://etsy.com/listing/123456789/example-vintage-item&lt;/a&gt;",&lt;br&gt;
    "schema": {&lt;br&gt;
      "type": "object",&lt;br&gt;
      "properties": {&lt;br&gt;
        "title": {"type": "string"}, &lt;br&gt;
        "price": {"type": "string"}, &lt;br&gt;
        "currency": {"type": "string"}&lt;br&gt;
      }&lt;br&gt;
    }&lt;br&gt;
  }'&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


For production applications, the Python SDK provides better error handling and type checking. Install it via pip, initialize the client, and define your schema.



```python title="extract_etsy-com.py" {5-22}

client = alterlab.Client("YOUR_API_KEY")

schema = {
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "description": "The exact product title"
    },
    "price": {
      "type": "number",
      "description": "The numeric price value without currency symbols"
    },
    "currency": {
      "type": "string",
      "description": "The 3-letter currency code, e.g. USD"
    },
    "sku": {
      "type": "string",
      "description": "The unique listing identifier"
    },
    "availability": {
      "type": "boolean",
      "description": "True if in stock, false if sold out"
    },
    "rating": {
      "type": "number",
      "description": "The 5-star rating value, e.g. 4.8"
    }
  },
  "required": ["title", "price", "currency"]
}

try:
    result = client.extract(
        url="https://etsy.com/listing/123456789/example-vintage-item",
        schema=schema,
    )
    print(json.dumps(result.data, indent=2))
except Exception as e:
    print(f"Extraction failed: {e}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Notice the use of the &lt;code&gt;description&lt;/code&gt; field in the schema. Because AlterLab relies on LLMs for extraction, providing clear semantic descriptions improves the accuracy of the output. If a price is embedded in a complex string, defining the type as &lt;code&gt;number&lt;/code&gt; and instructing it to exclude currency symbols ensures clean, database-ready data.&lt;/p&gt;
&lt;h2&gt;
  
  
  Extracting Nested Variations
&lt;/h2&gt;

&lt;p&gt;Many e-commerce listings contain variations, such as different sizes, colors, or materials, each with potentially different prices or stock levels. Your etsy data extraction python scripts can handle this by defining nested arrays within your JSON schema.&lt;/p&gt;

&lt;p&gt;Expand your schema to include an &lt;code&gt;items&lt;/code&gt; array.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="schema_variations.py" {3-17}&lt;br&gt;
variations_schema = {&lt;br&gt;
  "type": "object",&lt;br&gt;
  "properties": {&lt;br&gt;
    "title": {"type": "string"},&lt;br&gt;
    "options": {&lt;br&gt;
      "type": "array",&lt;br&gt;
      "description": "A list of all available product variations",&lt;br&gt;
      "items": {&lt;br&gt;
        "type": "object",&lt;br&gt;
        "properties": {&lt;br&gt;
          "name": {"type": "string", "description": "Name of the option, e.g. Large, Red"},&lt;br&gt;
          "price_modifier": {"type": "number", "description": "Additional cost for this option"}&lt;br&gt;
        }&lt;br&gt;
      }&lt;br&gt;
    }&lt;br&gt;
  }&lt;br&gt;
}&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


The Extract API will iterate through the available options on the page and return an array of objects matching this exact structure.

## Handle Pagination and Scale

Extracting a single listing is trivial. Building a full pipeline requires handling scale. When pulling data from hundreds of listings, you must manage concurrency limits and rate limiting. 

For high-volume operations, you need an asynchronous approach. Standard sequential requests will block your application and waste resources. Python's `asyncio` combined with an async client allows you to process multiple URLs concurrently. 

Before scaling your infrastructure, calculate your expected volume and review [AlterLab pricing](/pricing) to optimize your extraction batch sizes.



```python title="batch_extract.py" {16-24}

from typing import List

async def fetch_listing_data(client: alterlab.AsyncClient, url: str, schema: dict) -&amp;gt; dict:
    try:
        result = await client.extract(
            url=url,
            schema=schema
        )
        return {"url": url, "data": result.data, "status": "success"}
    except Exception as e:
        return {"url": url, "error": str(e), "status": "failed"}

async def process_batch(urls: List[str], schema: dict):
    client = alterlab.AsyncClient("YOUR_API_KEY")

    # Create a list of tasks for concurrent execution
    tasks = [fetch_listing_data(client, url, schema) for url in urls]

    # Gather results, maintaining a concurrency limit is recommended in production
    results = await asyncio.gather(*tasks)

    for res in results:
        if res["status"] == "success":
            print(f"Extracted {res['data'].get('title')} from {res['url']}")
        else:
            print(f"Failed {res['url']}: {res['error']}")

if __name__ == "__main__":
    target_urls = [
        "https://etsy.com/listing/111/example-a",
        "https://etsy.com/listing/222/example-b",
        "https://etsy.com/listing/333/example-c"
    ]

    # Define your schema here
    schema = {"type": "object", "properties": {"title": {"type": "string"}}}

    asyncio.run(process_batch(target_urls, schema))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;When building asynchronous scrapers, implement bounded semaphores to avoid overwhelming your own memory or hitting API rate limits too aggressively. A solid pattern involves chunking URLs into batches of 50 or 100, executing the batch, and writing the structured JSON directly to cloud storage or a message queue like Kafka or RabbitMQ.&lt;/p&gt;
&lt;h2&gt;
  
  
  Data Validation and Error Handling
&lt;/h2&gt;

&lt;p&gt;One major advantage of schema-driven extraction is inherent validation. If the target page is taken down (e.g., yielding a 404 error) or if the seller completely removes the price, the API will fail to fulfill required fields in your schema.&lt;/p&gt;

&lt;p&gt;Always utilize the &lt;code&gt;required&lt;/code&gt; array in your JSON schema.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"object"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"properties"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"number"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"price"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the API cannot find a valid number to satisfy the &lt;code&gt;price&lt;/code&gt; field, it will throw a validation error rather than returning dirty data. Your pipeline can catch this exception, log the URL as problematic, and proceed to the next item. This prevents null values from corrupting your downstream analytics databases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;Building a reliable pipeline for public e-commerce data does not require maintaining complex parsing libraries or constantly updating selectors. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Use JSON schemas to define the exact shape and data types required for your application.&lt;/li&gt;
&lt;li&gt;  Leverage AI-driven extraction to bypass the fragility of DOM-based scraping.&lt;/li&gt;
&lt;li&gt;  Implement asynchronous batch processing to efficiently scale your data gathering operations.&lt;/li&gt;
&lt;li&gt;  Enforce strict type checking and require crucial fields to ensure clean data enters your database.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By treating the web as a structured data API, you can focus on building intelligence and analytics tools rather than constantly repairing broken scrapers.&lt;/p&gt;

</description>
      <category>dataextraction</category>
      <category>python</category>
      <category>datapipelines</category>
      <category>ecommerce</category>
    </item>
    <item>
      <title>TikTok Data API: Extract Structured JSON in 2026</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Thu, 18 Jun 2026 19:19:24 +0000</pubDate>
      <link>https://dev.to/alterlab/tiktok-data-api-extract-structured-json-in-2026-3l62</link>
      <guid>https://dev.to/alterlab/tiktok-data-api-extract-structured-json-in-2026-3l62</guid>
      <description>&lt;p&gt;&lt;em&gt;Disclaimer: This guide covers extracting publicly accessible data. Always review a site's robots.txt and Terms of Service before scraping.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;To get structured TikTok data via API, define a JSON schema matching the public fields you need and send it to an extraction endpoint alongside the target URL. The API handles network routing and page rendering, returning validated JSON rather than raw HTML. This approach provides a reliable tiktok data api pipeline without manual DOM parsing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Building a reliable tiktok data extraction python script usually starts with reverse-engineering network requests and ends with brittle regex parsing. You can bypass the DOM entirely by treating the platform as a structured data API. &lt;/p&gt;

&lt;p&gt;This guide details how to build a resilient data pipeline that extracts public information from TikTok profiles and posts. We focus on retrieving typed, structured JSON directly from URLs. If you are setting up your local environment first, see our &lt;a href="https://dev.to/docs/quickstart/installation"&gt;Getting started guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why use TikTok data?
&lt;/h2&gt;

&lt;p&gt;Engineers typically pull social data api metrics for three core applications. The requirement is consistent across all three: the data must be structured, accurate, and delivered reliably.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI Training Pipelines&lt;/strong&gt;&lt;br&gt;
Large language models require natural language datasets. Extracting public video captions, structured hashtags, and public comments provides high-signal training data for sentiment analysis and trend prediction models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Analytics Dashboards&lt;/strong&gt;&lt;br&gt;
Data engineers build automated pipelines to track account growth, engagement rates, and content velocity across specific public profiles. This requires precise, scheduled extraction of numerical metrics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trend Identification&lt;/strong&gt;&lt;br&gt;
Mapping hashtag volume and audio usage helps identify emerging viral patterns. This involves scanning public search results and mapping video metadata to track how specific concepts spread across the platform.&lt;/p&gt;


  
  
  

&lt;h2&gt;
  
  
  What data can you extract?
&lt;/h2&gt;

&lt;p&gt;When building an extraction pipeline, focus exclusively on publicly accessible information visible to unauthenticated users. The goal is to map visual page elements to strict data types. Core fields include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Profile details – &lt;code&gt;username&lt;/code&gt;, &lt;code&gt;bio&lt;/code&gt;, &lt;code&gt;verified&lt;/code&gt; status.&lt;/li&gt;
&lt;li&gt;Metrics – &lt;code&gt;followers&lt;/code&gt;, &lt;code&gt;following&lt;/code&gt;, &lt;code&gt;likes&lt;/code&gt;, &lt;code&gt;post_count&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Content metadata – Video descriptions, hashtags, upload timestamps, public view counts.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A major challenge with raw social data is formatting. A follower count might display visually as "1.2M". Your pipeline needs the integer &lt;code&gt;1200000&lt;/code&gt;. By defining strict JSON schemas, you force the extraction layer to coerce these visual strings into usable database types.&lt;/p&gt;
&lt;h2&gt;
  
  
  The extraction approach
&lt;/h2&gt;

&lt;p&gt;Raw HTTP requests to TikTok return heavily obfuscated HTML and complex JavaScript payloads. Writing CSS selectors for this DOM structure is a maintenance trap. The platform rotates class names constantly. &lt;/p&gt;

&lt;p&gt;Traditional scraping requires managing headless browser infrastructure. You have to handle TLS fingerprinting, bypass initial captchas, wait for React hydration, and parse internal state variables. This consumes significant engineering resources.&lt;/p&gt;

&lt;p&gt;Using a dedicated tiktok api structured data service shifts the complexity. Instead of managing Chromium instances and parsing script tags, you declare the desired output structure. The extraction layer handles the execution environment. It loads the page, resolves the JavaScript, and maps the visual page data directly to your schema. This decoupling makes your pipeline immune to UI layout changes.&lt;/p&gt;
&lt;h2&gt;
  
  
  Quick start with AlterLab Extract API
&lt;/h2&gt;

&lt;p&gt;To implement this pattern, we use the &lt;a href="https://dev.to/docs/api/extract"&gt;Extract API docs&lt;/a&gt; endpoint. This abstracts the network routing, browser rendering, and AI extraction phases into a single POST request.&lt;/p&gt;

&lt;p&gt;Below is the implementation for a basic profile extraction. We define a schema for the exact fields we need.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="extract_tiktok-com.py" {5-12}&lt;/p&gt;

&lt;p&gt;client = alterlab.Client("YOUR_API_KEY")&lt;/p&gt;

&lt;p&gt;schema = {&lt;br&gt;
  "type": "object",&lt;br&gt;
  "properties": {&lt;br&gt;
    "username": {&lt;br&gt;
      "type": "string",&lt;br&gt;
      "description": "The username field"&lt;br&gt;
    },&lt;br&gt;
    "followers": {&lt;br&gt;
      "type": "string",&lt;br&gt;
      "description": "The followers field"&lt;br&gt;
    },&lt;br&gt;
    "bio": {&lt;br&gt;
      "type": "string",&lt;br&gt;
      "description": "The bio field"&lt;br&gt;
    },&lt;br&gt;
    "post_count": {&lt;br&gt;
      "type": "string",&lt;br&gt;
      "description": "The post count field"&lt;br&gt;
    },&lt;br&gt;
    "verified": {&lt;br&gt;
      "type": "string",&lt;br&gt;
      "description": "The verified field"&lt;br&gt;
    }&lt;br&gt;
  }&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;result = client.extract(&lt;br&gt;
    url="&lt;a href="https://tiktok.com/@tiktok" rel="noopener noreferrer"&gt;https://tiktok.com/@tiktok&lt;/a&gt;",&lt;br&gt;
    schema=schema,&lt;br&gt;
)&lt;br&gt;
print(result.data)&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


You can execute the exact same extraction using cURL. This is useful for testing schemas before integrating them into your application code.



```bash title="Terminal"
curl -X POST https://api.alterlab.io/v1/extract \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://tiktok.com/@tiktok",
    "schema": {"properties": {"username": {"type": "string"}, "followers": {"type": "string"}, "bio": {"type": "string"}}}
  }'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Define your schema
&lt;/h2&gt;

&lt;p&gt;The JSON schema acts as both the validation layer and the extraction instruction. The model reads the visual page and maps the data to your requested structure.&lt;/p&gt;

&lt;p&gt;You are not limited to flat objects. You can extract arrays of items. If you need a list of recent videos from a profile, you define an array schema.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="extract_videos.py" {7-11}&lt;br&gt;
video_schema = {&lt;br&gt;
  "type": "object",&lt;br&gt;
  "properties": {&lt;br&gt;
    "recent_videos": {&lt;br&gt;
      "type": "array",&lt;br&gt;
      "items": {&lt;br&gt;
        "type": "object",&lt;br&gt;
        "properties": {&lt;br&gt;
          "description": {"type": "string"},&lt;br&gt;
          "views": {"type": "string"},&lt;br&gt;
          "url": {"type": "string"}&lt;br&gt;
        }&lt;br&gt;
      }&lt;br&gt;
    }&lt;br&gt;
  }&lt;br&gt;
}&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


The `description` field within your schema properties is critical. It guides the extraction engine. If you want the integer value of a follower count instead of the string representation, you specify this in the description. Setting `"type": "integer"` and `"description": "The follower count converted to a full number, e.g. 1.2M becomes 1200000"` ensures your pipeline receives database-ready values.

&amp;lt;div data-infographic="stats"&amp;gt;
  &amp;lt;div data-stat data-value="99.2%" data-label="Extraction Accuracy"&amp;gt;&amp;lt;/div&amp;gt;
  &amp;lt;div data-stat data-value="1.4s" data-label="Avg Response Time"&amp;gt;&amp;lt;/div&amp;gt;
  &amp;lt;div data-stat data-value="100%" data-label="Typed JSON Output"&amp;gt;&amp;lt;/div&amp;gt;
&amp;lt;/div&amp;gt;

## Handle pagination and scale

Single synchronous requests work well for testing. Production data pipelines require processing thousands of URLs. Holding open HTTP connections for thousands of synchronous browser rendering jobs will exhaust your local connection pools.

To scale, transition to asynchronous batch processing via webhooks. You submit a list of URLs and a schema. The platform processes the jobs concurrently and POSTs the extracted JSON back to your server.



```python title="batch_extract.py" {7-11}

client = alterlab.Client("YOUR_API_KEY")

urls = ["https://tiktok.com/@user1", "https://tiktok.com/@user2", "https://tiktok.com/@user3"]

job = client.batch_extract(
    urls=urls,
    schema=profile_schema,
    webhook_url="https://api.yourdomain.com/webhooks/alterlab"
)

print(f"Batch job {job.id} queued.")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Your server needs an endpoint to receive the data. Below is a minimal FastAPI implementation to catch the incoming JSON payloads.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="webhook_receiver.py" {6-9}&lt;br&gt;
from fastapi import FastAPI, Request&lt;/p&gt;

&lt;p&gt;app = FastAPI()&lt;/p&gt;

&lt;p&gt;@app.post("/webhooks/alterlab")&lt;br&gt;
async def receive_data(request: Request):&lt;br&gt;
    payload = await request.json()&lt;br&gt;
    # payload["data"] contains your typed JSON schema&lt;br&gt;
    print(f"Received data for {payload['url']}: {payload['data']}")&lt;br&gt;
    return {"status": "received"}&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


Managing infrastructure costs is straightforward when using a data API. Instead of paying for idle proxy servers and constant maintenance engineering, you incur costs only for successful extractions. Review the [AlterLab pricing](/pricing) page to model your specific pipeline volume. The platform tracks your balance based on compute consumed per URL.

When running high-volume extractions, implement local rate limiting before pushing jobs to the API. While the extraction layer handles proxy rotation and network throttling against the target site, managing your own job queue prevents overwhelming your webhook receiving servers.

&amp;lt;div data-infographic="try-it" data-url="https://tiktok.com" data-description="Extract structured social data from TikTok"&amp;gt;&amp;lt;/div&amp;gt;

## Key takeaways

Extract tiktok data efficiently by moving away from DOM parsing. Relying on HTML structures guarantees pipeline failure when the target site updates its UI. 

By utilizing a tiktok json extraction approach, you define the exact data contract your database requires. You submit a URL and a JSON schema. The API handles network routing, browser execution, and mapping the visual data to your schema. This produces clean, typed data ready for analytics and AI pipelines immediately upon receipt.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>dataextraction</category>
      <category>python</category>
      <category>datapipelines</category>
      <category>ai</category>
    </item>
    <item>
      <title>How to Scrape Facebook Data: Complete Guide for 2026</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Thu, 18 Jun 2026 18:19:24 +0000</pubDate>
      <link>https://dev.to/alterlab/how-to-scrape-facebook-data-complete-guide-for-2026-2noi</link>
      <guid>https://dev.to/alterlab/how-to-scrape-facebook-data-complete-guide-for-2026-2noi</guid>
      <description>&lt;p&gt;&lt;em&gt;Disclaimer: This guide covers extracting publicly accessible data. Always review a site's robots.txt and Terms of Service before scraping. Do not attempt to bypass authentication walls or scrape private user data.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;To scrape Facebook efficiently in 2026, use a managed extraction API to handle JavaScript rendering and automated proxy rotation. Target public Pages or Groups, load the page via a headless browser, and extract the embedded GraphQL JSON hydration objects from the page source rather than relying on brittle, auto-generated CSS selectors. &lt;/p&gt;

&lt;h2&gt;
  
  
  Why collect social data from Facebook?
&lt;/h2&gt;

&lt;p&gt;Extracting data from public Facebook entities provides critical intelligence for several automated pipelines:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Brand Monitoring and Sentiment Analysis:&lt;/strong&gt; Tracking engagement metrics, public post frequency, and user comments on official corporate pages to measure brand health.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Market Research:&lt;/strong&gt; Aggregating event details, business hours, public contact information, and location data from localized business pages.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;E-commerce and Retail:&lt;/strong&gt; Monitoring official brand pages for product drops, limited-time discount codes, and promotional announcements.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In all these cases, the data is publicly visible to unauthenticated users. Automating the retrieval of this data allows engineering teams to build real-time monitoring systems without manual data entry.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical challenges
&lt;/h2&gt;

&lt;p&gt;Scraping facebook.com requires navigating one of the most complex frontend architectures on the web. A standard HTTP GET request using &lt;code&gt;requests&lt;/code&gt; or &lt;code&gt;urllib&lt;/code&gt; will return a bare HTML shell that contains almost no usable data. &lt;/p&gt;

&lt;p&gt;Here is what you are up against:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dynamic JavaScript Rendering&lt;/strong&gt;&lt;br&gt;
Facebook is built on React. The initial payload contains a minimal DOM tree and several megabytes of JavaScript. The actual content (posts, likes, text) is fetched asynchronously via GraphQL and rendered on the client side.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CSS Class Obfuscation&lt;/strong&gt;&lt;br&gt;
Attempting to use CSS selectors like &lt;code&gt;.post-content&lt;/code&gt; or &lt;code&gt;.follower-count&lt;/code&gt; is impossible. Facebook compiles its styles, resulting in utility classes that look like &lt;code&gt;&amp;lt;div class="x1rg5ohu x1n2onr6 x3ajldb"&amp;gt;&lt;/code&gt;. These classes change with every deployment, breaking standard scraping scripts within hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate Limiting and Anti-Bot Systems&lt;/strong&gt;&lt;br&gt;
Facebook aggressively monitors request velocity, IP reputation, and browser fingerprinting. Data center IP ranges are routinely blocked or presented with CAPTCHAs. &lt;/p&gt;

&lt;p&gt;To solve this, developers must execute full browser sessions while distributing requests across residential or high-quality proxy networks. This is where specialized infrastructure like our &lt;a href="https://dev.to/smart-rendering-api"&gt;Smart Rendering API&lt;/a&gt; comes in, automatically handling headless Chrome instances, fingerprint management, and request routing.&lt;/p&gt;
&lt;h2&gt;
  
  
  Quick start with AlterLab API
&lt;/h2&gt;

&lt;p&gt;Instead of managing your own Playwright clusters and proxy pools, you can route your extraction jobs through AlterLab. Before starting, review the &lt;a href="https://dev.to/docs/quickstart/installation"&gt;Getting started guide&lt;/a&gt; to secure your API keys and configure your environment.&lt;/p&gt;


  
  
  
  


&lt;p&gt;Install the Python client:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```bash title="Terminal"&lt;br&gt;
pip install alterlab&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


Here is a basic request to fetch the fully rendered HTML of a public Facebook Page. Note that we enforce JavaScript rendering by setting `render_js=True`.



```python title="scrape_facebook-com.py" {4-8}

client = alterlab.Client(api_key=os.getenv("ALTERLAB_API_KEY"))

response = client.scrape(
    url="https://facebook.com/SpaceX",
    render_js=True,
    wait_for=".x1rg5ohu" # Wait for a known universal container to mount
)

print(f"Status Code: {response.status_code}")
print(f"Content Length: {len(response.text)} bytes")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;If you prefer to work directly with the REST API using cURL or Node.js:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```bash title="Terminal" {3-7}&lt;br&gt;
curl -X POST &lt;a href="https://api.alterlab.io/v1/scrape" rel="noopener noreferrer"&gt;https://api.alterlab.io/v1/scrape&lt;/a&gt; \&lt;br&gt;
  -H "X-API-Key: YOUR_API_KEY" \&lt;br&gt;
  -H "Content-Type: application/json" \&lt;br&gt;
  -d '{&lt;br&gt;
    "url": "&lt;a href="https://facebook.com/SpaceX" rel="noopener noreferrer"&gt;https://facebook.com/SpaceX&lt;/a&gt;",&lt;br&gt;
    "render_js": true&lt;br&gt;
  }'&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


## Extracting structured data

Because Facebook's CSS classes are auto-generated, parsing the DOM with BeautifulSoup or Cheerio is fragile. The most robust method for extracting data from Facebook in 2026 is **Hydration State Extraction**.

Facebook uses Relay to manage its GraphQL data layer. When the server sends the page to the client, it embeds the initial GraphQL query results inside `&amp;lt;script type="application/json"&amp;gt;` tags so the React application can "hydrate" without making immediate API calls.

This JSON data contains clean, structured information about the page, its posts, and its metrics—completely bypassing the obfuscated HTML.

Here is how to extract that structured data using Python:



```python title="extract_hydration_state.py" {11-13,23-28}

def extract_facebook_page_data(url: str):
    client = alterlab.Client("YOUR_API_KEY")

    # Fetch the rendered page
    response = client.scrape(url, render_js=True)
    html = response.text

    # Find the script tag containing the Relay hydration state
    # Facebook typically uses script tags with specific data attributes
    pattern = re.compile(r'&amp;lt;script type="application/json" data-content-len="[^"]*"&amp;gt;(.*?)&amp;lt;/script&amp;gt;')
    matches = pattern.findall(html)

    page_data = {}

    for match in matches:
        try:
            data = json.loads(match)
            # Search the JSON tree for Page nodes
            # Note: The exact JSON path varies based on Facebook's current schema
            if 'require' in data:
                for req in data['require']:
                    if isinstance(req, list) and req[0] == 'RelayPrefetchedStreamCache':
                        # This typically contains the actual GraphQL payload
                        payload = req[3][1]['__bbox']['result']['data']
                        if 'page' in payload:
                            page_data['name'] = payload['page']['name']
                            page_data['followers'] = payload['page']['follower_count']
                            page_data['verification_status'] = payload['page']['is_verified']
        except (json.JSONDecodeError, KeyError, IndexError):
            continue

    return page_data

# Execute
target_url = "https://facebook.com/SpaceX"
data = extract_facebook_page_data(target_url)
print(json.dumps(data, indent=2))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This approach yields clean data arrays. If Facebook changes their UI layout, your scraper continues to function because the underlying GraphQL data model rarely changes abruptly.&lt;/p&gt;
&lt;h2&gt;
  
  
  Best practices
&lt;/h2&gt;

&lt;p&gt;When engineering data pipelines targeting massive platforms, resilience and compliance are your highest priorities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Respect robots.txt and Rate Limits&lt;/strong&gt;&lt;br&gt;
Always check Facebook's &lt;code&gt;robots.txt&lt;/code&gt; file. While you might technically be able to bypass certain restrictions, you must strictly limit your request concurrency. Flooding Facebook's servers can lead to IP bans and violates acceptable use policies. Introduce random jitter between requests (e.g., 2 to 7 seconds).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Target Public Interfaces Only&lt;/strong&gt;&lt;br&gt;
Your scrapers should never attempt to log in. Authenticated scraping violates Terms of Service and handles private user data, exposing you to severe liability. Stick strictly to public-facing Business Pages, public Groups, and public Event listings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Handle Geolocation Consistently&lt;/strong&gt;&lt;br&gt;
Facebook alters the language, layout, and sometimes the visibility of content based on the IP address location. Ensure your proxy network is set to a consistent region (e.g., US-East) so the JSON schema and page structure remain predictable.&lt;/p&gt;
&lt;h2&gt;
  
  
  Scaling up
&lt;/h2&gt;

&lt;p&gt;Running a single script on your laptop is fine for testing, but monitoring thousands of public Pages requires a distributed approach. &lt;/p&gt;

&lt;p&gt;To scale, you need to decouple your extraction logic from your execution environment. Push target URLs into a message broker (like RabbitMQ or AWS SQS), and use worker nodes to process the scrape jobs asynchronously.&lt;/p&gt;


  
  
  


&lt;p&gt;When scaling up, managing browser contexts locally becomes a memory bottleneck. Each Chromium instance can consume hundreds of megabytes of RAM. Offloading this to an API ensures your workers only handle lightweight network I/O and JSON parsing.&lt;/p&gt;

&lt;p&gt;Review the &lt;a href="https://dev.to/pricing"&gt;AlterLab pricing&lt;/a&gt; page to model the costs of running high-concurrency headless browser workloads. You can significantly reduce costs by identifying which pages strictly require JavaScript rendering and which can be parsed from raw HTML responses.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="async_batch_scrape.py" {11-13}&lt;/p&gt;

&lt;p&gt;async def scrape_batch(urls: list[str]):&lt;br&gt;
    # Initialize async client&lt;br&gt;
    client = alterlab.AsyncClient("YOUR_API_KEY")&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tasks = []
for url in urls:
    # Queue up rendering requests
    tasks.append(client.scrape(url, render_js=True))

# Execute concurrently
results = await asyncio.gather(*tasks)

for result in results:
    print(f"Scraped {len(result.text)} bytes from target")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h1&gt;
  
  
  Run async batch
&lt;/h1&gt;

&lt;p&gt;urls_to_monitor = [&lt;br&gt;
    "&lt;a href="https://facebook.com/SpaceX" rel="noopener noreferrer"&gt;https://facebook.com/SpaceX&lt;/a&gt;",&lt;br&gt;
    "&lt;a href="https://facebook.com/NASA" rel="noopener noreferrer"&gt;https://facebook.com/NASA&lt;/a&gt;",&lt;br&gt;
    "&lt;a href="https://facebook.com/esa" rel="noopener noreferrer"&gt;https://facebook.com/esa&lt;/a&gt;"&lt;br&gt;
]&lt;br&gt;
asyncio.run(scrape_batch(urls_to_monitor))&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


## Key takeaways

Scraping Facebook data in 2026 requires moving beyond legacy HTML parsing techniques. 

*   **Avoid CSS Selectors:** Facebook's React utility classes will break your scrapers continuously.
*   **Extract Hydration State:** Target the embedded JSON payloads injected by Relay and GraphQL.
*   **Use Headless Browsers:** Raw HTTP requests will not trigger the JavaScript execution necessary to render the page payload.
*   **Stay Compliant:** Limit your scope to unauthenticated, publicly visible data and throttle your request volume.
*   **Offload Infrastructure:** Use managed scraping APIs to handle proxy rotation and browser lifecycle management, allowing your team to focus on data parsing rather than cat-and-mouse infrastructure games.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>python</category>
      <category>dataextraction</category>
      <category>api</category>
      <category>scraping</category>
    </item>
    <item>
      <title>How to Migrate from Firecrawl to AlterLab: Step-by-Step Guide (2026)</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Thu, 18 Jun 2026 17:49:24 +0000</pubDate>
      <link>https://dev.to/alterlab/how-to-migrate-from-firecrawl-to-alterlab-step-by-step-guide-2026-4mbb</link>
      <guid>https://dev.to/alterlab/how-to-migrate-from-firecrawl-to-alterlab-step-by-step-guide-2026-4mbb</guid>
      <description>&lt;p&gt;Note: &lt;em&gt;Both APIs are capable – this guide is for developers prioritizing pay-as-you-go pricing and no subscription requirements.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;To migrate from Firecrawl to AlterLab, install the &lt;code&gt;alterlab&lt;/code&gt; package, replace your &lt;code&gt;FirecrawlApp&lt;/code&gt; instantiation with &lt;code&gt;alterlab.Client&lt;/code&gt;, and update your API key. The scraping parameters translate directly, allowing you to switch without rewriting your core extraction logic. You can complete the migration in under an hour.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why migrate?
&lt;/h2&gt;

&lt;p&gt;If you are evaluating alternatives to Firecrawl, the most common reason to switch is pricing structure. Firecrawl relies on credit-based billing with monthly plan minimums. AlterLab uses a pure pay-as-you-go model where you only pay for successful requests, and your account balance never expires. Read our &lt;a href="https://dev.to/vs/firecrawl"&gt;detailed Firecrawl comparison&lt;/a&gt; for more context.&lt;/p&gt;

&lt;p&gt;Beyond pricing, AlterLab provides fine-grained control over browser rendering, intelligent tier routing to bypass captchas, and a unified API for scraping, monitoring, and AI extraction. Our architecture handles scale without requiring you to manage concurrent job queues or complex asynchronous workflows manually.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before you start the migration, you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An AlterLab account (&lt;a href="https://dev.to/signup"&gt;free sign-up&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;An active API key&lt;/li&gt;
&lt;li&gt;Python 3.8+ or Node.js environment&lt;/li&gt;
&lt;li&gt;5 minutes to update your code&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Step 1: Install the AlterLab SDK
&lt;/h2&gt;

&lt;p&gt;The fastest way to migrate is using the AlterLab Python SDK. You can also use the REST API directly if you prefer writing your own HTTP wrappers. See our &lt;a href="https://dev.to/docs/quickstart/installation"&gt;Getting started guide&lt;/a&gt; for full installation details.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```bash title="Terminal — Install AlterLab"&lt;br&gt;
pip install alterlab&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


For Node.js users, the process is identical using npm:



```bash title="Terminal — Install AlterLab Node"
npm install @alterlab/client
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



  
  
  
  

&lt;h2&gt;
  
  
  Step 2: Replace your API calls
&lt;/h2&gt;

&lt;p&gt;AlterLab is designed to be highly compatible with existing scraping pipelines. You only need to swap the initialization and the core scraping method. &lt;/p&gt;

&lt;p&gt;Here is what your Firecrawl implementation likely looks like:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="before_firecrawl.py"&lt;/p&gt;

&lt;h1&gt;
  
  
  Firecrawl (before migration)
&lt;/h1&gt;

&lt;p&gt;from firecrawl import FirecrawlApp&lt;/p&gt;

&lt;p&gt;app = FirecrawlApp(api_key="fc-YOUR_API_KEY")&lt;br&gt;
response = app.scrape_url('&lt;a href="https://example.com'" rel="noopener noreferrer"&gt;https://example.com'&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;print(response.get('markdown'))&lt;br&gt;
print(response.get('metadata').get('title'))&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


Here is the equivalent implementation after migrating to AlterLab:



```python title="after_alterlab.py" {3-8}
# AlterLab (after migration)

client = alterlab.Client(api_key="al-YOUR_API_KEY")
response = client.scrape(
    url="https://example.com",
    formats=["markdown", "html"]
)

print(response.markdown)
print(response.metadata.title)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The primary difference is specifying the output formats in the request rather than extracting them from a monolithic response dictionary. This reduces bandwidth overhead by only fetching the formats you actually need. &lt;/p&gt;
&lt;h2&gt;
  
  
  Step 3: Handle response format differences
&lt;/h2&gt;

&lt;p&gt;Firecrawl returns a dictionary containing metadata, markdown, HTML, and other fields depending on the request. AlterLab returns a strongly-typed response object, which provides better IDE autocompletion and type safety.&lt;/p&gt;

&lt;p&gt;Instead of dictionary lookups, access properties directly on the response object:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;response.get('html')&lt;/code&gt; becomes &lt;code&gt;response.html&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;response.get('markdown')&lt;/code&gt; becomes &lt;code&gt;response.markdown&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;response.get('metadata')&lt;/code&gt; becomes &lt;code&gt;response.metadata&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are passing the response directly into an LLM or a downstream pipeline, update the accessor syntax to match the AlterLab object structure. The markdown output generated by AlterLab is optimized for LLM context windows, stripping navigation elements, footers, and boilerplate automatically.&lt;/p&gt;
&lt;h3&gt;
  
  
  Format Conversion Pipelines
&lt;/h3&gt;

&lt;p&gt;Firecrawl often requires you to write custom parsing logic after retrieving the markdown. AlterLab natively supports multiple formats in a single request. &lt;/p&gt;

&lt;p&gt;You can request &lt;code&gt;json&lt;/code&gt;, &lt;code&gt;markdown&lt;/code&gt;, and &lt;code&gt;text&lt;/code&gt; simultaneously.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="multi_format.py" {4}&lt;br&gt;
response = client.scrape(&lt;br&gt;
    url="&lt;a href="https://example.com" rel="noopener noreferrer"&gt;https://example.com&lt;/a&gt;",&lt;br&gt;
    formats=["markdown", "json", "text"]&lt;br&gt;
)&lt;/p&gt;
&lt;h1&gt;
  
  
  Use text for raw token counts
&lt;/h1&gt;

&lt;p&gt;token_count = len(response.text.split())&lt;/p&gt;
&lt;h1&gt;
  
  
  Use markdown for LLM context
&lt;/h1&gt;

&lt;p&gt;llm_prompt = f"Summarize this: {response.markdown}"&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


The `text` format is specifically designed for RAG (Retrieval-Augmented Generation) pipelines, stripping all HTML and markdown formatting to provide clean, readable prose.

### Migrating Batch Processing
If you use Firecrawl to scrape multiple URLs simultaneously, you likely map over the `scrape_url` method or use their async batch endpoints. AlterLab handles batching seamlessly.

Pass a list of URLs directly to the `scrape` method. AlterLab automatically parallelizes the requests and returns a list of response objects.



```python title="batch_processing.py" {4-7}

client = alterlab.Client()
responses = client.scrape(
    urls=["https://example.com/1", "https://example.com/2"],
    formats=["markdown"]
)

for response in responses:
    print(response.markdown)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Migrating LLM Extraction
&lt;/h3&gt;

&lt;p&gt;If you use Firecrawl's LLM extraction capabilities to convert raw text into structured JSON, you will migrate to AlterLab's Cortex AI. The schema definition remains exactly the same, using standard JSON Schema.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="cortex_extraction.py" {3-17}&lt;/p&gt;

&lt;h1&gt;
  
  
  AlterLab Cortex AI Extraction
&lt;/h1&gt;

&lt;p&gt;client = alterlab.Client()&lt;br&gt;
response = client.extract(&lt;br&gt;
    url="&lt;a href="https://example.com/products" rel="noopener noreferrer"&gt;https://example.com/products&lt;/a&gt;",&lt;br&gt;
    schema={&lt;br&gt;
        "type": "object",&lt;br&gt;
        "properties": {&lt;br&gt;
            "products": {&lt;br&gt;
                "type": "array",&lt;br&gt;
                "items": {&lt;br&gt;
                    "type": "object",&lt;br&gt;
                    "properties": {&lt;br&gt;
                        "name": {"type": "string"},&lt;br&gt;
                        "price": {"type": "number"}&lt;br&gt;
                    }&lt;br&gt;
                }&lt;br&gt;
            }&lt;br&gt;
        }&lt;br&gt;
    }&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;print(response.data.products)&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


Cortex AI natively understands complex DOM structures and requires no manual CSS selectors.

## Step 4: Update your error handling
AlterLab simplifies error handling through intelligent tier routing. If a scrape fails due to an anti-bot block on a standard residential proxy, AlterLab automatically escalates the request using a higher tier. Tier 1 handles standard sites using cURL. Tier 3 introduces headless browsers for JavaScript rendering. Tier 5 resolves complex captchas and Cloudflare challenges.

You do not need to implement manual retry logic for blocks. The API returns the data or throws a terminal exception.

Catch `alterlab.errors.ScrapeError` for terminal failures:



```python title="error_handling.py" {7-9}

from alterlab.errors import ScrapeError

client = alterlab.Client(api_key="al-YOUR_API_KEY")

try:
    response = client.scrape("https://example.com")
except ScrapeError as e:
    print(f"Scrape failed: {e.message}")
    print(f"Status code: {e.status_code}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Remove any exponential backoff or retry decorators you previously used to handle 429 Too Many Requests errors from the target site. AlterLab manages proxy rotation, rate limits, and concurrent session tracking internally.&lt;/p&gt;
&lt;h2&gt;
  
  
  Cost comparison
&lt;/h2&gt;

&lt;p&gt;When you migrate, your billing shifts from monthly subscriptions to strict usage-based billing. See the full &lt;a href="https://dev.to/pricing"&gt;AlterLab pricing&lt;/a&gt; page for details.&lt;/p&gt;

&lt;p&gt;Here is a practical comparison:&lt;/p&gt;


  
  
  


&lt;p&gt;With AlterLab, 10,000 basic HTML requests cost exactly $2.00. You add funds to your balance, and those funds remain available until you use them. There are no monthly quotas to manage, no overage penalties, and no credits that expire at the end of the billing cycle.&lt;/p&gt;

&lt;p&gt;Firecrawl pricing requires you to select a monthly tier, which means you pay the fixed price even if your scraping volume drops. AlterLab aligns costs directly with your infrastructure usage.&lt;/p&gt;
&lt;h2&gt;
  
  
  Migrating Advanced Workflows
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Webhooks
&lt;/h3&gt;

&lt;p&gt;If you rely on Firecrawl webhooks to push scraped data to your servers, you must update your endpoint to receive the AlterLab JSON payload structure. AlterLab webhooks trigger immediately upon task completion, removing the need to poll for status updates.&lt;/p&gt;

&lt;p&gt;To configure a webhook in AlterLab, pass the &lt;code&gt;webhook_url&lt;/code&gt; parameter during your scrape request. The core data payload remains the same, but the wrapper keys differ.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="webhooks.py" {4}&lt;br&gt;
response = client.scrape(&lt;br&gt;
    url="&lt;a href="https://example.com" rel="noopener noreferrer"&gt;https://example.com&lt;/a&gt;",&lt;br&gt;
    formats=["json"],&lt;br&gt;
    webhook_url="&lt;a href="https://your-server.com/webhooks/alterlab" rel="noopener noreferrer"&gt;https://your-server.com/webhooks/alterlab&lt;/a&gt;"&lt;br&gt;
)&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


### Scheduling and Diff Monitoring
Many developers run Firecrawl within CRON jobs on their own servers to track page changes over time. AlterLab handles this natively. You can migrate your local CRON schedules directly to AlterLab's infrastructure.



```python title="scheduling.py" {4-5}
client.schedules.create(
    url="https://example.com/pricing",
    formats=["markdown"],
    cron="0 0 * * *",
    detect_diff=True
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This configuration runs the scrape daily at midnight and only triggers a webhook if the markdown content has changed since the previous run.&lt;/p&gt;

&lt;h2&gt;
  
  
  Team Collaboration and API Keys
&lt;/h2&gt;

&lt;p&gt;If you are migrating a team from Firecrawl, AlterLab simplifies access control. In Firecrawl, teams often share a single API key or struggle with segmented billing. AlterLab provides native multi-user organizations with shared billing.&lt;/p&gt;

&lt;p&gt;You can issue scoped API keys for different environments:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;al-prod-...&lt;/code&gt; with a $100/day spend limit&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;al-dev-...&lt;/code&gt; with a $5/day spend limit&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This ensures a rogue script in development cannot drain your account balance. You configure these limits in the AlterLab dashboard without modifying your application code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Common issues and fixes
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Missing format error&lt;/strong&gt;: AlterLab requires you to specify the formats you need (e.g., &lt;code&gt;formats=["json", "markdown"]&lt;/code&gt;). If you omit this parameter, the API defaults to raw HTML. Explicitly declare your required formats to ensure consistent pipeline behavior and minimize bandwidth.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Timeout limits&lt;/strong&gt;: AlterLab defaults to a 60-second timeout. If you are scraping extremely slow, Javascript-heavy sites, increase this limit via &lt;code&gt;client.scrape(url, timeout=120)&lt;/code&gt;. The maximum allowed timeout is 300 seconds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Authentication&lt;/strong&gt;: Ensure your &lt;code&gt;ALTERLAB_API_KEY&lt;/code&gt; is loaded in your environment variables. The Python SDK will automatically pick it up if you omit the &lt;code&gt;api_key&lt;/code&gt; parameter during client initialization.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Javascript Rendering&lt;/strong&gt;: Firecrawl attempts to automatically detect when Javascript rendering is needed. In AlterLab, you can explicitly control this by setting the minimum proxy tier. Set &lt;code&gt;min_tier=3&lt;/code&gt; to guarantee headless browser execution for Single Page Applications (SPAs) built with React or Vue.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pagination handling&lt;/strong&gt;: If your Firecrawl script handled pagination by manually finding links, you can migrate this logic directly. AlterLab's Cortex AI can also extract pagination URLs automatically by adding a &lt;code&gt;next_page_url&lt;/code&gt; field to your JSON schema.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Geolocation restrictions&lt;/strong&gt;: If you need to scrape sites that restrict access by country, specify the proxy location in your AlterLab request: &lt;code&gt;client.scrape(url, geo="us")&lt;/code&gt;. We support over 40 country codes natively without requiring third-party proxy integrations.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  You're done
&lt;/h2&gt;

&lt;p&gt;That is the entire migration process. Swap the client library, update the API key, and adjust your response object accessors. Your existing data extraction logic, LLM prompts, and downstream data pipelines will continue working as normal. &lt;/p&gt;

&lt;p&gt;Hit reply to our support team if you encounter any unexpected behavior during your transition. We monitor API logs 24/7 and assist with custom extraction schemas if you hit edge cases.&lt;/p&gt;

</description>
      <category>antibot</category>
      <category>automation</category>
      <category>python</category>
      <category>dataextraction</category>
    </item>
    <item>
      <title>Rotating vs Residential Proxies: Choose the Right IP</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Thu, 18 Jun 2026 16:31:33 +0000</pubDate>
      <link>https://dev.to/alterlab/rotating-vs-residential-proxies-choose-the-right-ip-jda</link>
      <guid>https://dev.to/alterlab/rotating-vs-residential-proxies-choose-the-right-ip-jda</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;Use residential proxies for targets with strict bot protection where IP trust scores matter. Use rotating datacenter proxies for general data extraction where speed and cost-efficiency take priority. Your choice directly dictates the success rate, infrastructure cost, and architectural complexity of your scraping pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Proxy Trust Hierarchy
&lt;/h2&gt;

&lt;p&gt;Target servers evaluate incoming requests based on the IP address origin. This origin dictates a foundational trust score. &lt;/p&gt;

&lt;p&gt;Every IP address maps to an Autonomous System Number (ASN). Firewalls and WAFs classify ASNs into broad categories. Datacenter ASNs belong to cloud hosting providers. Traffic originating from these IPs is instantly categorized as machine-generated. Consumer ISP ASNs belong to residential telecommunications companies. Traffic originating from these IPs is categorized as human.&lt;/p&gt;

&lt;p&gt;When building a web scraper for publicly accessible data, the ASN classification determines whether your request gets served an HTML document, a CAPTCHA, or a hard TCP reset.&lt;/p&gt;

&lt;h2&gt;
  
  
  Datacenter Rotating Proxies: Fast and Cost-Effective
&lt;/h2&gt;

&lt;p&gt;Datacenter proxies are IP addresses assigned to servers in commercial data centers. When you use a rotating datacenter proxy, a gateway server intercepts your request and routes it through one of thousands of available datacenter IPs. The gateway automatically swaps the exit IP address based on a time interval or on every new request.&lt;/p&gt;

&lt;p&gt;These proxies operate on gigabit fiber connections. They offer sub-millisecond latency to major cloud providers. They process high-concurrency requests without bottlenecking.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Cost Structure
&lt;/h3&gt;

&lt;p&gt;Datacenter IPs are cheap to provision in bulk. Providers typically charge a flat monthly rate per IP or provide unmetered bandwidth on a shared pool. This makes them highly cost-effective for large-scale data extraction tasks. &lt;/p&gt;

&lt;h3&gt;
  
  
  Ideal Use Cases
&lt;/h3&gt;

&lt;p&gt;Deploy rotating datacenter proxies when your targets lack sophisticated bot protection. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Standard public record databases&lt;/li&gt;
&lt;li&gt;Weather telemetry endpoints&lt;/li&gt;
&lt;li&gt;Academic and scientific publication repositories&lt;/li&gt;
&lt;li&gt;Basic news and media aggregation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the target server does not penalize cloud ASNs, datacenter proxies are the correct engineering choice. They provide the necessary concurrency without inflating infrastructure spend.&lt;/p&gt;

&lt;h2&gt;
  
  
  Residential Proxies: High Trust, Higher Complexity
&lt;/h2&gt;

&lt;p&gt;Residential proxies route your HTTP requests through real devices sitting in homes around the world. These devices connect to standard consumer ISPs. &lt;/p&gt;

&lt;p&gt;When a WAF inspects a request from a residential proxy, it sees an IP address belonging to a local telecommunications provider. The trust score is inherently high. The request looks like a standard consumer browsing the web.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Architecture of a Residential Network
&lt;/h3&gt;

&lt;p&gt;Unlike datacenter servers mounted in static racks, residential nodes are dynamic. The IP pool consists of devices that come online and offline unpredictably. A user might turn off their Wi-Fi router. A mobile phone might switch cellular towers. &lt;/p&gt;

&lt;p&gt;This introduces instability. Connections drop. Latency spikes depending on the node's geographic location and local network congestion. You must architect your scraping pipeline to handle frequent connection resets and high timeout thresholds.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Cost Structure
&lt;/h3&gt;

&lt;p&gt;Because sourcing residential IP addresses is difficult, the pricing model shifts. Providers bill residential proxies by bandwidth consumption (per gigabyte) rather than per IP. Fetching large HTML payloads, images, or executing heavy JavaScript bundles over residential networks becomes expensive quickly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ideal Use Cases
&lt;/h3&gt;

&lt;p&gt;Deploy residential proxies when extracting data from high-value targets that actively block cloud traffic.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Localized e-commerce pricing and availability&lt;/li&gt;
&lt;li&gt;Travel and flight fare aggregation&lt;/li&gt;
&lt;li&gt;Real estate listing aggregation&lt;/li&gt;
&lt;li&gt;Ad verification and localized search engine results&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Residential IPs excel at geo-targeting. Because the nodes are real devices, you can specify traffic to exit from specific countries, states, or even individual cities. This is required when scraping localized inventory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Feature Breakdown
&lt;/h2&gt;

&lt;p&gt;Understanding the tradeoffs requires a direct comparison of infrastructure capabilities.&lt;/p&gt;


&lt;div class="table-wrapper-paragraph"&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;table&gt;

    &lt;thead&gt;

      &lt;tr&gt;

        &lt;th&gt;Specification&lt;/th&gt;

        &lt;th&gt;Rotating Datacenter&lt;/th&gt;

        &lt;th&gt;Residential&lt;/th&gt;

      &lt;/tr&gt;

    &lt;/thead&gt;

    &lt;tbody&gt;

      &lt;tr&gt;

        &lt;td&gt;IP Origin&lt;/td&gt;

        &lt;td&gt;Commercial Server (Cloud ASN)&lt;/td&gt;

        &lt;td&gt;Consumer Device (ISP ASN)&lt;/td&gt;

      &lt;/tr&gt;

      &lt;tr&gt;

        &lt;td&gt;Trust Score&lt;/td&gt;

        &lt;td&gt;Low to Medium&lt;/td&gt;

        &lt;td&gt;High&lt;/td&gt;

      &lt;/tr&gt;

      &lt;tr&gt;

        &lt;td&gt;Connection Speed&lt;/td&gt;

        &lt;td&gt;1000+ Mbps&lt;/td&gt;

        &lt;td&gt;1-50 Mbps&lt;/td&gt;

      &lt;/tr&gt;

      &lt;tr&gt;

        &lt;td&gt;Latency&lt;/td&gt;

        &lt;td&gt;&amp;lt; 50ms&lt;/td&gt;

        &lt;td&gt;200ms - 2000ms+&lt;/td&gt;

      &lt;/tr&gt;

      &lt;tr&gt;

        &lt;td&gt;Billing Model&lt;/td&gt;

        &lt;td&gt;Per IP / Flat rate pool&lt;/td&gt;

        &lt;td&gt;Per Gigabyte (GB)&lt;/td&gt;

      &lt;/tr&gt;

      &lt;tr&gt;

        &lt;td&gt;Target Stability&lt;/td&gt;

        &lt;td&gt;99.9% Uptime&lt;/td&gt;

        &lt;td&gt;Variable (Nodes drop offline)&lt;/td&gt;

      &lt;/tr&gt;

    &lt;/tbody&gt;

  &lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Implementation Mechanics
&lt;/h2&gt;

&lt;p&gt;Integrating rotating proxies into a data pipeline requires handling the authentication and routing at the HTTP client level. Most proxy providers use a backconnect gateway. You send requests to a single hostname, and the provider's load balancer handles the IP rotation on the backend.&lt;/p&gt;

&lt;p&gt;Here is a standard implementation using Python.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="standard_proxy.py" {8-11}&lt;/p&gt;

&lt;h1&gt;
  
  
  Proxy gateway credentials provided by your network
&lt;/h1&gt;

&lt;p&gt;PROXY_HOST = "gateway.proxyprovider.com"&lt;br&gt;
PROXY_PORT = "8000"&lt;br&gt;
PROXY_USER = "user123"&lt;br&gt;
PROXY_PASS = "pass456"&lt;/p&gt;

&lt;p&gt;proxy_url = f"http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}"&lt;br&gt;
proxies = {&lt;br&gt;
    "http": proxy_url,&lt;br&gt;
    "https": proxy_url&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;def fetch_data(url):&lt;br&gt;
    try:&lt;br&gt;
        # High timeout required if routing through residential nodes&lt;br&gt;
        response = requests.get(url, proxies=proxies, timeout=15)&lt;br&gt;
        response.raise_for_status()&lt;br&gt;
        return response.text&lt;br&gt;
    except requests.exceptions.RequestException as e:&lt;br&gt;
        print(f"Request failed: {e}")&lt;br&gt;
        return None&lt;/p&gt;

&lt;p&gt;target = "&lt;a href="https://example-retail-site.com/product/123" rel="noopener noreferrer"&gt;https://example-retail-site.com/product/123&lt;/a&gt;"&lt;br&gt;
html_content = fetch_data(target)&lt;br&gt;
print(f"Fetched {len(html_content)} bytes")&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


The code above solves the IP routing. However, an IP address is only one layer of the HTTP request.

## Beyond the IP Address: The Fingerprint Problem

Modern web applications do not rely solely on IP reputation. They inspect the entire request fingerprint. 

If you route a Python `requests` call through a highly trusted residential IP, the request will still get blocked by a competent WAF. The WAF inspects the TLS handshake. It sees the JA3/JA4 fingerprint associated with the Python `ssl` module. It inspects the HTTP/2 pseudo-headers and sees an order that does not match a standard Chrome or Firefox browser.

The target server concludes that while the IP address belongs to a consumer ISP, the software making the request is a script. The connection is dropped.

To succeed at scale, your infrastructure must pair high-trust IPs with accurate browser fingerprinting. This requires managing headless browsers, patching TLS libraries, and handling dynamic rendering. 

Instead of building and maintaining this infrastructure internally, engineers use AlterLab. The platform handles the IP rotation, network retries, and browser fingerprinting automatically.



```python title="alterlab_scraper.py" {4-6}
from alterlab import Client

# Initialize the client. IP rotation and TLS patching are automatic.
client = Client("YOUR_API_KEY")

# AlterLab routes the request through the optimal proxy pool
response = client.scrape(
    "https://example-retail-site.com/product/123",
    render_js=True,
    country="US"
)

print(response.json())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This abstracts the proxy management entirely. You request the data. The API handles the network layer. You can explore the &lt;a href="https://alterlab.io/web-scraping-api-python" rel="noopener noreferrer"&gt;Python SDK&lt;/a&gt; to see how connection handling and automated retries are abstracted out of your application code.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Waterfall Strategy: Optimizing Cost and Success
&lt;/h2&gt;

&lt;p&gt;Because residential proxies bill by bandwidth, running all scraper traffic through them is financially inefficient. Data engineering teams solve this using a waterfall proxy strategy.&lt;/p&gt;

&lt;p&gt;The waterfall method implements a fallback mechanism in the scraping queue.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Attempt 1 (Datacenter)&lt;/strong&gt;: The scraper requests the target URL using a fast, cheap datacenter proxy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validation&lt;/strong&gt;: The system inspects the response. Does it contain the expected data payload? Did the server return a 403 Forbidden? Did it return a CAPTCHA challenge page?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Attempt 2 (Residential)&lt;/strong&gt;: If the datacenter request fails validation, the scraper requeues the URL and routes the second attempt through a residential proxy.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This architecture ensures you only pay residential proxy bandwidth rates when absolutely necessary. Routine API endpoints and static assets load via cheap datacenter nodes. Highly protected HTML payloads load via residential nodes.&lt;/p&gt;
&lt;h2&gt;
  
  
  Performance and Cost Analytics
&lt;/h2&gt;

&lt;p&gt;When designing your system, expect distinct performance profiles between the two networks.&lt;/p&gt;


  
  
  


&lt;p&gt;Residential networks introduce significant latency. A standard HTTP GET request might take 800 milliseconds just to establish the TCP connection and TLS handshake, before any data transfers. If your pipeline relies on scraping tens of thousands of pages per minute, this latency dictates how many concurrent workers you must provision.&lt;/p&gt;

&lt;p&gt;Datacenter networks are highly predictable. Throughput is limited only by your server's network interface and the target's rate limits. &lt;/p&gt;
&lt;h2&gt;
  
  
  Connection Handling and Retries
&lt;/h2&gt;

&lt;p&gt;When using residential proxy pools, your code must anticipate connection failures. Residential nodes are mobile phones losing cellular signal, or home routers rebooting. A node might die midway through transmitting an HTML payload.&lt;/p&gt;

&lt;p&gt;Implement aggressive retry logic with exponential backoff.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```bash title="Terminal"&lt;/p&gt;

&lt;h1&gt;
  
  
  A robust pipeline will automatically retry on 502 Bad Gateway
&lt;/h1&gt;

&lt;h1&gt;
  
  
  or 504 Gateway Timeout, which are common on residential networks.
&lt;/h1&gt;

&lt;p&gt;curl -X POST &lt;a href="https://api.alterlab.io/v1/scrape" rel="noopener noreferrer"&gt;https://api.alterlab.io/v1/scrape&lt;/a&gt; \&lt;br&gt;
  -H "X-API-Key: YOUR_API_KEY" \&lt;br&gt;
  -d '{&lt;br&gt;
    "url": "&lt;a href="https://example-retail-site.com" rel="noopener noreferrer"&gt;https://example-retail-site.com&lt;/a&gt;",&lt;br&gt;
    "proxy_type": "residential",&lt;br&gt;
    "retry_on_failure": true,&lt;br&gt;
    "max_retries": 3&lt;br&gt;
  }'&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


If you are managing the proxies manually, wrap your HTTP calls in a retry block that catches `ConnectionResetError` and `ReadTimeout` exceptions. Re-resolving the backconnect gateway will assign a new, healthy residential node for the retry attempt.

## Advanced Routing: Mobile Proxies

A subset of the residential proxy market includes mobile proxies. These route traffic specifically through 4G and 5G cellular devices. 

Mobile ISPs utilize Carrier-Grade NAT (CGNAT). This means thousands of legitimate consumer cell phones share a single public IP address simultaneously. Target servers cannot ban mobile IP addresses without instantly blocking thousands of real mobile users. Mobile proxies command the highest trust score available, but also the highest cost per gigabyte and the lowest bandwidth capacity. 

Reserve mobile proxies strictly for targets utilizing the most aggressive anti-bot countermeasures, such as native social networking applications or highly gated ticket queues.

## Offloading the Complexity

Managing proxy pools, tracking IP bans, implementing waterfall fallback logic, and handling browser fingerprinting requires dedicated engineering resources. Target servers continually update their defense mechanisms. A proxy pool that yields a 99% success rate today might drop to 40% tomorrow if the target upgrades its WAF rules.

If your core business is analyzing data rather than maintaining extraction infrastructure, utilize a managed API. Features like [anti-bot handling](https://alterlab.io/smart-rendering-api) monitor target defense changes and automatically route requests through the appropriate network tier without manual intervention.

## Final Takeaways

Select your proxy infrastructure based on the specific constraints of your target data source. 

If the data is highly protected, localized, or resides on platforms known for strict security, residential proxies are mandatory. You must design your system to tolerate higher latency, handle dropped connections, and optimize bandwidth usage to control costs.

If the data is generally accessible and scale is the primary objective, rotating datacenter proxies provide the speed and cost-efficiency required for high-throughput pipelines. 

Combine both using a waterfall approach, or utilize an API with dynamic routing to abstract the network layer entirely. Review the [pricing plans](https://alterlab.io/pricing) to understand how different network types impact your data acquisition budget at scale.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>proxies</category>
      <category>webscraping</category>
      <category>dataextraction</category>
      <category>antibot</category>
    </item>
    <item>
      <title>Airbnb Data API: Extract Structured JSON in 2026</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Thu, 18 Jun 2026 15:31:33 +0000</pubDate>
      <link>https://dev.to/alterlab/airbnb-data-api-extract-structured-json-in-2026-12ai</link>
      <guid>https://dev.to/alterlab/airbnb-data-api-extract-structured-json-in-2026-12ai</guid>
      <description>&lt;p&gt;Disclaimer: This guide covers extracting publicly accessible data. Always review a site's robots.txt and Terms of Service before scraping.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;To get structured Airbnb data via API, pass a target listing URL and a JSON schema to the AlterLab Extract API. The system handles the underlying access, parses the page using AI, and returns a typed JSON payload containing exactly the fields you requested. This eliminates the need for manual HTML parsing and CSS selector maintenance.&lt;/p&gt;

&lt;p&gt;For a full setup walk-through, see our &lt;a href="https://dev.to/docs/quickstart/installation"&gt;Getting started guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why use Airbnb data?
&lt;/h2&gt;

&lt;p&gt;Publicly available travel data powers various downstream applications and analytical models. Building a reliable Airbnb data API pipeline enables engineering teams to solve several high-value problems without manually gathering data.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Competitive Intelligence:&lt;/strong&gt; Travel agencies and property managers monitor local inventory, analyze pricing strategies, and identify market gaps. Tracking dynamic pricing algorithms requires consistent data feeds.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Market Analytics:&lt;/strong&gt; Real estate investors use historical pricing and occupancy indicators to evaluate potential investment properties. Aggregate data highlights seasonal trends and neighborhood profitability.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;AI Training and RAG Systems:&lt;/strong&gt; Large language models require structured, real-world data for travel planning applications. A reliable stream of JSON extraction from property listings feeds directly into vector databases for Retrieval-Augmented Generation workflows.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What data can you extract?
&lt;/h2&gt;

&lt;p&gt;When interacting with an Airbnb API structured data approach, you can extract any information publicly visible on a listing page or search results page. Focus on fields that map cleanly to standard data types.&lt;/p&gt;

&lt;p&gt;Commonly requested travel data fields include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;code&gt;property_name&lt;/code&gt; (String): The full title of the listing.&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;price_per_night&lt;/code&gt; (Number): The base cost before fees.&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;rating&lt;/code&gt; (Number): The aggregate user review score.&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;location&lt;/code&gt; (String): The neighborhood or city descriptor.&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;availability&lt;/code&gt; (Boolean/String): Indicators of booking status for specific dates.&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;amenities&lt;/code&gt; (Array of Strings): Provided facilities like Wi-Fi, pool, or kitchen.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By treating the source page as a document and passing a schema, the extraction engine handles the mapping of visual elements to these specific data structures.&lt;/p&gt;

&lt;h2&gt;
  
  
  The extraction approach
&lt;/h2&gt;

&lt;p&gt;Extracting Airbnb data manually using raw HTTP requests (like &lt;code&gt;curl&lt;/code&gt; or &lt;code&gt;requests&lt;/code&gt;) combined with HTML parsing (&lt;code&gt;BeautifulSoup&lt;/code&gt; or &lt;code&gt;Cheerio&lt;/code&gt;) is fragile. Complex frontend frameworks dynamically generate class names, meaning CSS selectors break frequently.&lt;/p&gt;

&lt;p&gt;When an interface updates, your extraction pipeline fails, requiring immediate engineering intervention. Furthermore, modern web applications implement significant bot mitigation strategies. Managing IP rotation, headless browser sessions, and CAPTCHA solving introduces massive operational overhead.&lt;/p&gt;

&lt;p&gt;A data API abstracts this complexity. Instead of writing parsing logic, you define the desired output structure. The extraction system handles the request execution, page rendering, and data mapping. This shifts the engineering focus from maintaining fragile scrapers to consuming typed JSON.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick start with AlterLab Extract API
&lt;/h2&gt;

&lt;p&gt;The quickest path to reliable Airbnb json extraction is using the Extract API. We pass the target URL and our desired JSON schema. The system returns validated data.&lt;/p&gt;

&lt;p&gt;Check the &lt;a href="https://dev.to/docs/api/extract"&gt;Extract API docs&lt;/a&gt; for full parameter references.&lt;/p&gt;

&lt;p&gt;Here is the primary implementation using Python:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="extract_airbnb-com.py" {5-12}&lt;/p&gt;

&lt;p&gt;client = alterlab.Client("YOUR_API_KEY")&lt;/p&gt;

&lt;p&gt;schema = {&lt;br&gt;
  "type": "object",&lt;br&gt;
  "properties": {&lt;br&gt;
    "property_name": {&lt;br&gt;
      "type": "string",&lt;br&gt;
      "description": "The property name field"&lt;br&gt;
    },&lt;br&gt;
    "price_per_night": {&lt;br&gt;
      "type": "string",&lt;br&gt;
      "description": "The price per night field"&lt;br&gt;
    },&lt;br&gt;
    "rating": {&lt;br&gt;
      "type": "string",&lt;br&gt;
      "description": "The rating field"&lt;br&gt;
    },&lt;br&gt;
    "location": {&lt;br&gt;
      "type": "string",&lt;br&gt;
      "description": "The location field"&lt;br&gt;
    },&lt;br&gt;
    "availability": {&lt;br&gt;
      "type": "string",&lt;br&gt;
      "description": "The availability field"&lt;br&gt;
    }&lt;br&gt;
  }&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;result = client.extract(&lt;br&gt;
    url="&lt;a href="https://airbnb.com/example-page" rel="noopener noreferrer"&gt;https://airbnb.com/example-page&lt;/a&gt;",&lt;br&gt;
    schema=schema,&lt;br&gt;
)&lt;br&gt;
print(result.data)&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


You can also use cURL to test the endpoint directly from your terminal:



```bash title="Terminal"
curl -X POST https://api.alterlab.io/v1/extract \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://airbnb.com/example-page",
    "schema": {"properties": {"property_name": {"type": "string"}, "price_per_night": {"type": "string"}, "rating": {"type": "string"}}}
  }'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Output example:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```json title="output.json"&lt;br&gt;
{&lt;br&gt;
  "property_name": "Cozy Loft in Downtown",&lt;br&gt;
  "price_per_night": "150",&lt;br&gt;
  "rating": "4.95",&lt;br&gt;
  "location": "Downtown, Seattle",&lt;br&gt;
  "availability": "Available"&lt;br&gt;
}&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


&amp;lt;div data-infographic="steps"&amp;gt;
  &amp;lt;div data-step data-number="1" data-title="Define Schema" data-description="Specify the fields you want as a JSON schema"&amp;gt;&amp;lt;/div&amp;gt;
  &amp;lt;div data-step data-number="2" data-title="Call Extract API" data-description="POST the URL + schema to AlterLab"&amp;gt;&amp;lt;/div&amp;gt;
  &amp;lt;div data-step data-number="3" data-title="Receive Typed JSON" data-description="Get back validated, structured data — no parsing needed"&amp;gt;&amp;lt;/div&amp;gt;
&amp;lt;/div&amp;gt;

## Define your schema

The core advantage of this approach is schema-driven extraction. When you define a schema, you are instructing the underlying AI model exactly what data points matter and what format they must follow.

If you request a number for `price_per_night`, the system strips currency symbols and string text, returning a clean float or integer. This eliminates the need for post-processing regex or string manipulation. You receive data that is immediately ready for insertion into a database.

The schema acts as a contract. The system strictly adheres to the properties defined, ensuring that the resulting JSON payload is predictable, structured, and easy to validate.

&amp;lt;div data-infographic="try-it" data-url="https://airbnb.com" data-description="Extract structured travel data from Airbnb"&amp;gt;&amp;lt;/div&amp;gt;

## Handle pagination and scale

When building an airbnb data extraction python pipeline, you rarely extract a single page. Processing search results and traversing paginated lists requires a robust approach to concurrency and scale.

For high-volume workloads, synchronous requests become a bottleneck. Using an asynchronous batch processing method ensures efficient resource utilization and respects downstream rate limits.

Here is how you handle batch extraction for multiple URLs concurrently:



```python title="batch_extract.py" {11-18}

client = alterlab.AsyncClient("YOUR_API_KEY")

async def extract_listings(urls, schema):
    tasks = []
    for url in urls:
        tasks.append(client.extract(url=url, schema=schema))

    # Execute all extraction tasks concurrently
    results = await asyncio.gather(*tasks, return_exceptions=True)

    valid_data = []
    for res in results:
        if not isinstance(res, Exception):
            valid_data.append(res.data)

    return valid_data

urls = [
    "https://airbnb.com/example-page-1",
    "https://airbnb.com/example-page-2",
    "https://airbnb.com/example-page-3"
]

# Assuming 'schema' is defined as in the previous example
# data = asyncio.run(extract_listings(urls, schema))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To manage the financial aspects of scaling your pipeline, refer to the &lt;a href="https://dev.to/pricing"&gt;AlterLab pricing&lt;/a&gt; page. Structuring your architecture around async batching provides the most cost-effective path to high-throughput data retrieval.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;p&gt;Retrieving structured data from complex web interfaces does not require maintaining brittle parsing scripts. By utilizing a schema-driven extraction approach, engineering teams can build reliable, scalable pipelines.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Avoid HTML Parsing:&lt;/strong&gt; Focus on schemas, not CSS selectors.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Embrace Typed JSON:&lt;/strong&gt; Ensure data is ready for immediate database insertion.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Scale Asynchronously:&lt;/strong&gt; Use concurrent processing for large-scale travel data API requirements.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Deploying an Airbnb data API pipeline using an extraction system dramatically reduces maintenance overhead and accelerates the delivery of accurate, structured data to downstream applications.&lt;/p&gt;

</description>
      <category>dataextraction</category>
      <category>api</category>
      <category>python</category>
      <category>ai</category>
    </item>
    <item>
      <title>How to Scrape Reddit Data with Python in 2026</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Thu, 18 Jun 2026 14:31:33 +0000</pubDate>
      <link>https://dev.to/alterlab/how-to-scrape-reddit-data-with-python-in-2026-2ea7</link>
      <guid>https://dev.to/alterlab/how-to-scrape-reddit-data-with-python-in-2026-2ea7</guid>
      <description>&lt;p&gt;&lt;em&gt;Disclaimer: This guide covers extracting publicly accessible data. Always review a site's robots.txt and Terms of Service before scraping.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;To scrape Reddit data, bypass raw HTTP requests and use a specialized scraping API or headless browser to handle dynamic rendering and rate limits. For the most resilient setup, send the target Reddit URL to AlterLab's API, which automatically manages proxies and extracts the public JSON or HTML, then parse the response using Python's &lt;code&gt;json&lt;/code&gt; or &lt;code&gt;BeautifulSoup&lt;/code&gt; libraries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why collect social data from Reddit?
&lt;/h2&gt;

&lt;p&gt;Reddit is an aggregation of specialized communities. Extracting public posts and comments provides direct access to unfiltered consumer sentiment, technical discussions, and emerging trends. Engineering and data teams typically scrape Reddit for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Market Research and Sentiment Analysis&lt;/strong&gt;: Tracking brand mentions, product feedback, and public opinion across niche subreddits (e.g., tracking &lt;code&gt;r/MachineLearning&lt;/code&gt; for new paper discussions).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Competitor Monitoring&lt;/strong&gt;: Observing public complaints or feature requests directed at competitor products to identify market gaps.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Training LLMs and AI Models&lt;/strong&gt;: Collecting structured conversational data, Q&amp;amp;A pairs, and human reasoning chains to fine-tune specialized language models.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Technical challenges
&lt;/h2&gt;

&lt;p&gt;Extracting data from Reddit presents specific infrastructure challenges. While Reddit offers an official API, it imposes strict rate limits and data access restrictions that may not suit all analytical workloads. When falling back to web scraping public pages, you will encounter:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dynamic Rendering&lt;/strong&gt;: Modern Reddit relies heavily on client-side rendering (React). A standard &lt;code&gt;requests.get()&lt;/code&gt; call will often return an empty application shell. Extracting the actual post content requires executing JavaScript.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate Limiting&lt;/strong&gt;: Reddit aggressively throttles rapid requests from the same IP address. Attempting concurrent scraping without a distributed proxy network will quickly result in HTTP 429 (Too Many Requests) errors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;UI Fragmentation&lt;/strong&gt;: Reddit maintains multiple frontend versions (&lt;code&gt;old.reddit.com&lt;/code&gt;, &lt;code&gt;new.reddit.com&lt;/code&gt;, &lt;code&gt;sh.reddit.com&lt;/code&gt;). Selectors constantly shift, meaning static HTML parsing often breaks.&lt;/p&gt;

&lt;p&gt;To handle dynamic React apps without managing infrastructure, developers use tools like AlterLab's &lt;a href="https://dev.to/smart-rendering-api"&gt;Smart Rendering API&lt;/a&gt;, which automatically executes JavaScript and waits for network idle states before returning the fully rendered DOM.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick start with AlterLab API
&lt;/h2&gt;

&lt;p&gt;The most reliable way to scrape Reddit is by offloading the browser management and IP rotation. AlterLab provides a unified API to handle this.&lt;/p&gt;

&lt;p&gt;First, check out the &lt;a href="https://dev.to/docs/quickstart/installation"&gt;Getting started guide&lt;/a&gt; to set up your environment, then install the Python SDK.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```bash title="Terminal" {1}&lt;br&gt;
pip install alterlab&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


You can target a specific public post. Here is how to execute a basic scrape.



```python title="scrape_reddit.py" {4-7}

client = alterlab.Client("YOUR_API_KEY")

# Target a public subreddit page
response = client.scrape(
    url="https://www.reddit.com/r/webscraping/new/",
    render_js=True,
    wait_for=".Post" # Wait for post elements to load
)

print(f"Status: {response.status_code}")
print(f"HTML Length: {len(response.text)}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;If you prefer operating from the terminal or using different languages, the REST API works directly via cURL:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```bash title="Terminal" {2-3}&lt;br&gt;
curl -X POST &lt;a href="https://api.alterlab.io/v1/scrape" rel="noopener noreferrer"&gt;https://api.alterlab.io/v1/scrape&lt;/a&gt; \&lt;br&gt;
  -H "X-API-Key: YOUR_API_KEY" \&lt;br&gt;
  -H "Content-Type: application/json" \&lt;br&gt;
  -d '{"url": "&lt;a href="https://www.reddit.com/r/webscraping/new/" rel="noopener noreferrer"&gt;https://www.reddit.com/r/webscraping/new/&lt;/a&gt;", "render_js": true}'&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


&amp;lt;div data-infographic="try-it" data-url="https://reddit.com/r/python" data-description="Test Reddit Scraping with AlterLab"&amp;gt;&amp;lt;/div&amp;gt;

## Extracting structured data

Reddit's HTML structure is complex and changes frequently. However, Reddit often embeds the initial state of the page in a `&amp;lt;script&amp;gt;` tag, or you can append `.json` to any public Reddit URL to get the data in a structured format without parsing HTML.

If you are scraping the `.json` endpoint, the parsing logic is straightforward.



```python title="extract_json.py" {6-9}

client = alterlab.Client("YOUR_API_KEY")

# Appending .json to the URL returns structured data
response = client.scrape(
    url="https://www.reddit.com/r/webscraping/new.json",
    render_js=False # No JS rendering needed for raw JSON
)

data = response.json()
posts = data['data']['children']

for post in posts[:5]:
    post_data = post['data']
    print(f"Title: {post_data.get('title')}")
    print(f"Author: {post_data.get('author')}")
    print(f"Score: {post_data.get('score')}")
    print("---")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;If you need to parse the actual rendered HTML (for example, if the JSON endpoint is heavily rate-limited for your specific IP range), use BeautifulSoup with resilient selectors.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="parse_html.py" {9-11}&lt;br&gt;
from bs4 import BeautifulSoup&lt;/p&gt;

&lt;p&gt;client = alterlab.Client("YOUR_API_KEY")&lt;br&gt;
response = client.scrape(&lt;br&gt;
    url="&lt;a href="https://old.reddit.com/r/webscraping/" rel="noopener noreferrer"&gt;https://old.reddit.com/r/webscraping/&lt;/a&gt;",&lt;br&gt;
    render_js=True&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;soup = BeautifulSoup(response.text, 'html.parser')&lt;/p&gt;

&lt;h1&gt;
  
  
  Targeting old.reddit.com is often easier for static parsing
&lt;/h1&gt;

&lt;p&gt;posts = soup.select('div.thing')&lt;/p&gt;

&lt;p&gt;for post in posts[:5]:&lt;br&gt;
    title_elem = post.select_one('p.title a.title')&lt;br&gt;
    if title_elem:&lt;br&gt;
        print(title_elem.text)&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


&amp;lt;div data-infographic="steps"&amp;gt;
  &amp;lt;div data-step data-number="1" data-title="Target URL" data-description="Identify the public subreddit or post URL. Append .json if possible."&amp;gt;&amp;lt;/div&amp;gt;
  &amp;lt;div data-step data-number="2" data-title="Route via API" data-description="Send the request through AlterLab to handle IP rotation and rendering."&amp;gt;&amp;lt;/div&amp;gt;
  &amp;lt;div data-step data-number="3" data-title="Extract Content" data-description="Parse the returned JSON payload or target HTML elements."&amp;gt;&amp;lt;/div&amp;gt;
&amp;lt;/div&amp;gt;

## Best practices

When you scrape Reddit, build your pipelines for resilience and compliance.

**Respect robots.txt**: Always check `https://www.reddit.com/robots.txt` before deploying a crawler. Do not target endpoints or directories explicitly disallowed.

**Implement Rate Limiting**: Even when using a distributed network, avoid sending massive bursts of traffic. Add delays between your requests. A good rule of thumb is limiting concurrent requests and spacing them out over time to respect the platform's infrastructure.

**Target `old.reddit.com` or `.json`**: The modern React frontend is heavy and changes constantly. `old.reddit.com` uses server-side rendered HTML with stable CSS classes. The `.json` extension method skips HTML entirely, reducing bandwidth and parsing complexity.

**Handle Pagination**: Reddit uses cursor-based pagination (`after` and `before` tokens). Extract the `after` token from your JSON response and append it to your next request URL (`?after=TOKEN`) to traverse public historical data.

## Scaling up

When moving from a single script to a production data pipeline, infrastructure management becomes the primary bottleneck. Scraping thousands of subreddits requires managing proxy pools, handling retries, and storing large volumes of data.

To scale effectively, utilize batch processing.



```python title="batch_scrape.py" {6-10}

client = alterlab.Client("YOUR_API_KEY")

urls = [
    "https://www.reddit.com/r/Python/new.json",
    "https://www.reddit.com/r/webscraping/new.json",
    "https://www.reddit.com/r/dataengineering/new.json"
]

# AlterLab handles concurrent execution and proxy rotation natively
results = client.scrape_batch(urls, render_js=False, max_concurrency=10)

for result in results:
    if result.success:
        print(f"Successfully scraped {result.url}")
    else:
        print(f"Failed: {result.error}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Managing your own proxy infrastructure for this volume quickly becomes a full-time job. Review &lt;a href="https://dev.to/pricing"&gt;AlterLab pricing&lt;/a&gt; to understand how offloading this infrastructure provides a predictable cost model for enterprise scale. &lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;p&gt;Scraping public Reddit data provides valuable insights for market research and AI training. Bypassing the dynamic rendering and rate limiting challenges requires specific strategies:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Target &lt;code&gt;.json&lt;/code&gt; endpoints or &lt;code&gt;old.reddit.com&lt;/code&gt; for more stable, easier-to-parse data structures.&lt;/li&gt;
&lt;li&gt;Comply with &lt;code&gt;robots.txt&lt;/code&gt; and implement sensible rate limits to ensure sustainable data access.&lt;/li&gt;
&lt;li&gt;Use specialized infrastructure like AlterLab to handle JavaScript execution, proxy rotation, and concurrency, allowing your engineering team to focus on data processing rather than browser management.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>proxies</category>
      <category>python</category>
      <category>dataextraction</category>
      <category>api</category>
    </item>
    <item>
      <title>How to Scrape Booking.com Data: Complete Guide for 2026</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Thu, 18 Jun 2026 14:31:33 +0000</pubDate>
      <link>https://dev.to/alterlab/how-to-scrape-bookingcom-data-complete-guide-for-2026-3dnh</link>
      <guid>https://dev.to/alterlab/how-to-scrape-bookingcom-data-complete-guide-for-2026-3dnh</guid>
      <description>&lt;p&gt;&lt;em&gt;Disclaimer: This guide covers extracting publicly accessible data. Always review a site's robots.txt and Terms of Service before scraping.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;To scrape Booking.com, you need a system capable of executing JavaScript and routing requests through diverse IP pools to load dynamic content. You can send requests with browser rendering enabled to fetch fully populated HTML layouts, then parse the response using Python tools like BeautifulSoup. Always respect rate limits, target strictly public inventory data, and adhere to site guidelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why collect travel data from Booking.com?
&lt;/h2&gt;

&lt;p&gt;Booking.com hosts one of the largest publicly visible inventories of global accommodations. Data engineers and analysts build pipelines targeting this data for specific operational reasons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Market Research&lt;/strong&gt;&lt;br&gt;
Travel aggregators and hospitality groups track regional availability trends. Monitoring public hotel listings allows analysts to model seasonal demand curves. You can correlate hotel density in specific zip codes with upcoming local events.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Price Monitoring&lt;/strong&gt;&lt;br&gt;
Hotels dynamically adjust rates based on occupancy and local demand. Revenue managers extract public pricing from local competitors to benchmark their own pricing strategies. Tracking these adjustments over time reveals the underlying logic of local market fluctuations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data Analysis&lt;/strong&gt;&lt;br&gt;
Researchers compile datasets on review scores, amenity offerings, and property types. This structured data feeds into machine learning models predicting neighborhood gentrification, tourism recovery post-incidents, or shifts in consumer preference toward specific property types like short-term rentals.&lt;/p&gt;


  
  

&lt;h2&gt;
  
  
  Technical challenges
&lt;/h2&gt;

&lt;p&gt;Extracting data from major travel platforms requires solving infrastructure problems. Booking.com does not serve a static HTML document containing all visible data. The initial HTTP response contains skeleton structures. The actual property prices, availability, and review snippets load asynchronously via JavaScript. &lt;/p&gt;

&lt;p&gt;Standard HTTP clients like the Python &lt;code&gt;requests&lt;/code&gt; library or basic &lt;code&gt;curl&lt;/code&gt; commands will only retrieve this unpopulated skeleton. To see the data a user sees, your scraper must execute the JavaScript payload.&lt;/p&gt;

&lt;p&gt;Second, travel sites deploy advanced security architectures. They profile incoming requests based on TLS fingerprints (like JA3/JA4 hashes). If the TLS handshake matches a known Python library rather than a standard Chrome browser, the server drops the connection. They also monitor IP reputation, request velocity, and HTTP header order.&lt;/p&gt;

&lt;p&gt;To handle these layers reliably, developers deploy clusters of headless browsers routed through proxy networks. Managing Chrome instances at scale introduces massive memory overhead and maintenance burdens. Using managed infrastructure like AlterLab's &lt;a href="https://dev.to/smart-rendering-api"&gt;Smart Rendering API&lt;/a&gt; shifts this execution layer off your servers.&lt;/p&gt;


  
  
  

&lt;h2&gt;
  
  
  Quick start with AlterLab API
&lt;/h2&gt;

&lt;p&gt;You can bypass the infrastructure setup by relying on an established extraction API. Ensure you have reviewed the &lt;a href="https://dev.to/docs/quickstart/installation"&gt;Getting started guide&lt;/a&gt; to set up your environment variables.&lt;/p&gt;

&lt;p&gt;Below are examples of fetching a public property page. We enable JavaScript rendering to ensure the pricing data populates before the API returns the HTML.&lt;/p&gt;
&lt;h3&gt;
  
  
  Python Example
&lt;/h3&gt;

&lt;p&gt;Use the official Python SDK. This approach abstracts the HTTP requests and handles automatic retries.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="scrape_booking.py" {4-6}&lt;/p&gt;

&lt;p&gt;client = alterlab.Client(os.environ.get("ALTERLAB_API_KEY"))&lt;/p&gt;

&lt;p&gt;response = client.scrape(&lt;br&gt;
    "&lt;a href="https://www.booking.com/hotel/us/example-public-listing.html" rel="noopener noreferrer"&gt;https://www.booking.com/hotel/us/example-public-listing.html&lt;/a&gt;",&lt;br&gt;
    render_js=True,&lt;br&gt;
    wait_for=".prco-valign-middle-helper"&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;print(f"Status: {response.status_code}")&lt;br&gt;
print(f"HTML Length: {len(response.text)}")&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


### Node.js Example

If your pipeline runs in a TypeScript or Node environment, the integration follows a similar pattern.



```javascript title="scrapeBooking.js" {6-9}
const AlterLab = require('alterlab');

const client = new AlterLab.Client(process.env.ALTERLAB_API_KEY);

async function fetchPublicData() {
  const response = await client.scrape('https://www.booking.com/hotel/us/example-public-listing.html', {
    renderJs: true,
    waitFor: '.prco-valign-middle-helper'
  });

  console.log(`Retrieved ${response.text.length} bytes of HTML`);
}

fetchPublicData();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  cURL Example
&lt;/h3&gt;

&lt;p&gt;For shell scripts or isolated testing, call the REST endpoint directly.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```bash title="Terminal"&lt;br&gt;
curl -X POST &lt;a href="https://api.alterlab.io/v1/scrape" rel="noopener noreferrer"&gt;https://api.alterlab.io/v1/scrape&lt;/a&gt; \&lt;br&gt;
  -H "X-API-Key: YOUR_KEY" \&lt;br&gt;
  -H "Content-Type: application/json" \&lt;br&gt;
  -d '{&lt;br&gt;
    "url": "&lt;a href="https://www.booking.com/hotel/us/example-public-listing.html" rel="noopener noreferrer"&gt;https://www.booking.com/hotel/us/example-public-listing.html&lt;/a&gt;",&lt;br&gt;
    "render_js": true,&lt;br&gt;
    "wait_for": ".prco-valign-middle-helper"&lt;br&gt;
  }'&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


&amp;lt;div data-infographic="try-it" data-url="https://www.booking.com/hotel/us/example-public-listing.html" data-description="Test rendering parameters on a public URL"&amp;gt;&amp;lt;/div&amp;gt;

## Extracting structured data

Once you retrieve the fully rendered HTML, you must parse it. Booking.com frequently updates its CSS classes. Relying on utility classes (like `.bui-price-display__value`) results in fragile scrapers that break during minor site updates.

Instead, target structural data attributes. Developers use `data-testid` attributes for internal automated testing. These attributes change less frequently than styling classes.

Here is how to extract core public data points using Python and BeautifulSoup.



```python title="parser.py" {11-13,18-20}
from bs4 import BeautifulSoup

def parse_property_data(html_content):
    soup = BeautifulSoup(html_content, "html.parser")

    # Extract property name
    name_element = soup.find("h2", {"class": "pp-header__title"})
    hotel_name = name_element.text.strip() if name_element else "Unknown"

    # Extract review score
    score_element = soup.find("div", {"data-testid": "review-score-component"})
    score_text = score_element.text.strip() if score_element else "No score"

    # Extract price
    # The wait_for parameter in our scrape call ensured this element exists
    price_element = soup.find("span", {"class": "prco-valign-middle-helper"})
    price = price_element.text.strip() if price_element else "Price unavailable"

    return {
        "hotel_name": hotel_name,
        "score": score_text,
        "price": price
    }

# Assuming `response.text` from the previous script
data = parse_property_data(response.text)
print(data)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Travel sites inject structured JSON-LD data into the &lt;code&gt;&amp;lt;head&amp;gt;&lt;/code&gt; of the document for search engine indexing. This JSON object often contains the cleanest, most reliable property information. You can parse this directly instead of writing CSS selectors.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="parse_jsonld.py" {5-8}&lt;/p&gt;

&lt;p&gt;from bs4 import BeautifulSoup&lt;/p&gt;

&lt;p&gt;def extract_schema_data(html_content):&lt;br&gt;
    soup = BeautifulSoup(html_content, "html.parser")&lt;br&gt;
    schema_script = soup.find("script", type="application/ld+json")&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if schema_script:
    try:
        data = json.loads(schema_script.string)
        return data
    except json.JSONDecodeError:
        return None
return None
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


## Best practices

Building a durable pipeline requires defensive programming and respect for target infrastructure. 

### Respect robots.txt
Always check `https://www.booking.com/robots.txt` before deploying a crawler. Do not target paths disallowed by the site operators. Limit your scraping strictly to publicly accessible search result pages and property listings.

### Implement rate limiting
Do not flood the target server. Introduce randomized delays between requests. If you are scraping a list of 500 URLs, distribute those requests over several hours rather than executing them concurrently. Aggressive concurrency triggers security thresholds and results in IP bans.

### Handle dynamic parameters
Booking.com URLs contain numerous tracking parameters. Clean your URLs before scraping to normalize your dataset. A URL like `?checkin=2026-10-01&amp;amp;checkout=2026-10-05` is essential, but parameters like `?label=...` or `?sid=...` are session identifiers. Strip session identifiers to avoid cache misses and tracking anomalies.

### Validate extracted data
DOM structures change. Implement validation logic. If your parser returns `None` for the price on 10 consecutive requests, pause the pipeline and trigger an alert. Do not insert null values into your database silently.

## Scaling up

When moving from a local script to a production pipeline, architecture matters. A single machine running a Python loop will bottleneck quickly.

### Batch requests and queues
Deploy a message broker like RabbitMQ or Redis. Push your target URLs into a queue. Deploy worker nodes that pull URLs from the queue, execute the scrape, and write the payload to an object store (like AWS S3). Decoupling the extraction from the processing prevents pipeline crashes if the database goes down.

### Webhook delivery
Polling an API for results wastes compute cycles. Configure webhooks. Submit a batch of 100 URLs to your scraping API and provide a callback URL. The API processes the URLs asynchronously and POSTs the extracted JSON back to your server as each job completes.

### Cost optimization
Running headless Chrome for every request is expensive. Use standard HTTP requests for simple sites, but escalate to JavaScript rendering specifically for dynamic travel pages. Depending on your volume, [AlterLab pricing](/pricing) scales with your throughput, allowing you to control costs by routing requests dynamically based on the target domain.

## Key takeaways

1.  Standard HTTP clients cannot retrieve dynamic travel pricing. You must render JavaScript.
2.  Use structural attributes like `data-testid` or embedded JSON-LD scripts for reliable parsing.
3.  Strip session parameters from URLs before execution.
4.  Implement strict rate limiting and stagger your requests to avoid flooding servers.
5.  Offload browser infrastructure to an API to focus on data engineering rather than server maintenance.
6.  Extract only publicly visible information and respect the operational guidelines of the target platform.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>python</category>
      <category>dataextraction</category>
      <category>automation</category>
      <category>headlessbrowsers</category>
    </item>
    <item>
      <title>How to Give Your AI Agent Access to Trustpilot Data</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Thu, 18 Jun 2026 12:00:24 +0000</pubDate>
      <link>https://dev.to/alterlab/how-to-give-your-ai-agent-access-to-trustpilot-data-3o91</link>
      <guid>https://dev.to/alterlab/how-to-give-your-ai-agent-access-to-trustpilot-data-3o91</guid>
      <description>&lt;p&gt;&lt;em&gt;Disclaimer: This guide covers accessing publicly available data. Always review a site's robots.txt and Terms of Service before automated access.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;To give your AI agent access to Trustpilot data, connect it to an extraction API that handles headless browsing and anti-bot systems automatically. By defining a strict JSON schema, you convert unstructured review pages into clean data arrays ready for immediate insertion into your LLM context window. This eliminates token waste and prevents pipeline failures caused by rate limits.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why AI Agents Need Trustpilot Data
&lt;/h2&gt;

&lt;p&gt;Agents require live context to make accurate decisions. Connecting them to public review platforms unlocks several core autonomous use cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reputation Monitoring&lt;/strong&gt;&lt;br&gt;
Autonomous agents track brand sentiment continuously. They pull the latest reviews, classify the core complaints, and alert human engineering teams when technical issues arise in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Competitor Tracking&lt;/strong&gt;&lt;br&gt;
Retrieval-Augmented Generation (RAG) pipelines ingest competitor feedback. Product managers can query their internal knowledge base to discover exactly what features users dislike about competing tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automated Support Triage&lt;/strong&gt;&lt;br&gt;
Agents read incoming reviews instantly. They cross-reference the stated problems with internal documentation and draft personalized, context-aware responses for your support team to approve.&lt;/p&gt;


  
  
  

&lt;h2&gt;
  
  
  Why Raw HTTP Requests Fail for Agents
&lt;/h2&gt;

&lt;p&gt;Giving an LLM access to the internet via standard HTTP libraries causes immediate pipeline degradation. Websites deploy heavy countermeasures against automated access.&lt;/p&gt;

&lt;p&gt;Standard &lt;code&gt;requests.get()&lt;/code&gt; calls fail. Sites block unrecognized user agents. Even if you spoof headers, datacenter IP addresses trigger immediate CAPTCHA challenges. Your agent receives an HTML page containing a security challenge instead of the requested data.&lt;/p&gt;

&lt;p&gt;Token waste presents a larger architectural problem. A standard Trustpilot page contains megabytes of DOM elements, inline CSS, and tracking scripts. Feeding raw HTML into an LLM context window burns token budget rapidly. It also severely limits the number of reviews the model can analyze simultaneously. Dense, unparsed HTML increases hallucination rates because the model struggles to isolate the actual review text from the surrounding noise.&lt;/p&gt;
&lt;h2&gt;
  
  
  Connecting Your Agent to Trustpilot via AlterLab
&lt;/h2&gt;

&lt;p&gt;You need a middleware layer that translates unstructured web pages into strict JSON. AlterLab provides this layer. Read our &lt;a href="https://dev.to/docs/quickstart/installation"&gt;Getting started guide&lt;/a&gt; for initial environment setup.&lt;/p&gt;

&lt;p&gt;For LLM workflows, the &lt;a href="https://dev.to/docs/extract"&gt;Extract API docs&lt;/a&gt; detail the optimal approach. Instead of returning HTML, the API uses a headless browser to render the page, solves any bot challenges, and extracts exactly the data defined in your JSON schema. &lt;/p&gt;

&lt;p&gt;Here is how to implement the extraction tool in Python.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="agent_trustpilot_tool.py" {8-17}&lt;/p&gt;

&lt;p&gt;client = alterlab.Client("YOUR_API_KEY")&lt;/p&gt;

&lt;p&gt;def get_trustpilot_reviews(url: str) -&amp;gt; str:&lt;br&gt;
    """Tool for the agent to fetch structured review data."""&lt;br&gt;
    schema = {&lt;br&gt;
        "company_name": "string",&lt;br&gt;
        "overall_rating": "number",&lt;br&gt;
        "reviews": [{&lt;br&gt;
            "author": "string",&lt;br&gt;
            "rating": "number",&lt;br&gt;
            "date": "string",&lt;br&gt;
            "text": "string",&lt;br&gt;
            "helpful_votes": "number"&lt;br&gt;
        }]&lt;br&gt;
    }&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;result = client.extract(
    url=url,
    schema=schema,
    min_tier=3  # Force JS rendering for dynamic review loading
)

# Return compact JSON string to save agent token budget
return json.dumps(result.data, separators=(',', ':'))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h1&gt;
  
  
  Example usage by the agent
&lt;/h1&gt;

&lt;p&gt;extracted_data = get_trustpilot_reviews("&lt;a href="https://www.trustpilot.com/review/example.com%22" rel="noopener noreferrer"&gt;https://www.trustpilot.com/review/example.com"&lt;/a&gt;)&lt;br&gt;
print(extracted_data)&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


You can test this pipeline directly from your terminal to verify the structured output format before integrating it into your agent's tool registry.



```bash title="Terminal"
curl -X POST https://api.alterlab.io/api/v1/extract \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://www.trustpilot.com/review/example.com",
    "min_tier": 3,
    "schema": {
      "company_name": "string",
      "reviews": [{"rating": "number", "text": "string"}]
    }
  }'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Using the Search API for Trustpilot Queries
&lt;/h2&gt;

&lt;p&gt;Agents rarely know the exact Trustpilot URL for a given company. A robust agentic workflow requires a two-step process. First, the agent searches for the company profile. Second, the agent extracts the reviews from the located profile.&lt;/p&gt;

&lt;p&gt;The Search API handles the discovery phase. It executes a query on the target site and returns a structured list of results. Your agent can evaluate the results, select the correct URL, and proceed with extraction.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="search_tool.py" {7-10}&lt;br&gt;
def find_trustpilot_profile(company_name: str) -&amp;gt; str:&lt;br&gt;
    """Tool for the agent to locate a company's Trustpilot URL."""&lt;br&gt;
    client = alterlab.Client("YOUR_API_KEY")&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;query = f"site:trustpilot.com {company_name}"

result = client.search(
    query=query,
    num_results=3
)

return json.dumps([
    {"title": r.title, "url": r.url} 
    for r in result.results
])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


## MCP Integration

Building custom tools requires writing boilerplate code for every new LLM framework. The Model Context Protocol (MCP) standardizes how agents interact with external tools. 

Instead of writing wrapper functions, you can connect your agent directly to the web using our official MCP server. This allows AI assistants like Claude, Cursor, or custom LangChain agents to natively call extraction commands. Read the complete setup instructions in the [AlterLab for AI Agents](https://alterlab.io/docs/tutorials/ai-agent) documentation.

&amp;lt;div data-infographic="steps"&amp;gt;
  &amp;lt;div data-step data-number="1" data-title="Agent executes tool call" data-description="LLM decides it needs review data and calls the Extract MCP tool with the target URL"&amp;gt;&amp;lt;/div&amp;gt;
  &amp;lt;div data-step data-number="2" data-title="System fetches data" data-description="Platform handles headless rendering, bypasses anti-bot, and extracts JSON"&amp;gt;&amp;lt;/div&amp;gt;
  &amp;lt;div data-step data-number="3" data-title="Context window updated" data-description="Clean structured data feeds directly back into the agent context for analysis"&amp;gt;&amp;lt;/div&amp;gt;
&amp;lt;/div&amp;gt;

## Building a Reputation Monitoring Pipeline

Let us assemble a complete, production-ready pipeline. This example demonstrates how an OpenAI-powered agent utilizes defined tools to monitor reputation autonomously. The pipeline handles discovery, extraction, and synthesis.

We define two tools for the LLM. The first locates the target URL. The second performs the heavy extraction. The system prompt instructs the agent on how to sequence these tools.



```python title="reputation_pipeline.py" {30-36, 45-48}

from tools import find_trustpilot_profile, get_trustpilot_reviews

client = openai.Client()

tools = [
    {
        "type": "function",
        "function": {
            "name": "find_trustpilot_profile",
            "description": "Finds the Trustpilot URL for a given company name.",
            "parameters": {
                "type": "object",
                "properties": {
                    "company_name": {"type": "string"}
                },
                "required": ["company_name"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "get_trustpilot_reviews",
            "description": "Extracts recent reviews from a specific Trustpilot URL.",
            "parameters": {
                "type": "object",
                "properties": {
                    "url": {"type": "string"}
                },
                "required": ["url"]
            }
        }
    }
]

def analyze_competitor(company_name: str):
    messages = [
        {"role": "system", "content": "You are a competitive intelligence agent. First, find the target company's Trustpilot URL. Then, extract their reviews. Finally, write a brief technical summary of their users' most common complaints."},
        {"role": "user", "content": f"Analyze recent feedback for {company_name}."}
    ]

    # Initial LLM call to determine next action
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=messages,
        tools=tools
    )

    # In a production system, you would iterate through tool calls here.
    # The agent will output a tool call to find_trustpilot_profile.
    # You execute it, append the result to messages, and call the LLM again.
    # It then calls get_trustpilot_reviews.
    # You execute that, append the JSON data, and the LLM generates the final report.

    return response.choices[0].message

# Execute the pipeline
print(analyze_competitor("Acme Corp"))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This architecture ensures the language model only operates on highly condensed, relevant information. By the time the LLM performs its final synthesis step, all HTML boilerplate and navigation logic has been stripped away. The model focuses purely on semantic analysis of the actual review text.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scaling and Cost
&lt;/h2&gt;

&lt;p&gt;Agentic workflows execute frequently. If you run a scheduled job that checks twenty competitors every hour, your infrastructure needs to handle that volume without unpredictable cost spikes. Review &lt;a href="https://dev.to/pricing"&gt;AlterLab pricing&lt;/a&gt; to calculate exact usage limits for your specific pipeline. You pay strictly for successful extractions, ensuring your agentic architecture remains highly scalable and your budgeting remains predictable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;Giving your AI agent access to Trustpilot data requires robust infrastructure. Raw HTTP calls fail against modern bot protection. Sending raw HTML wastes token context windows.&lt;/p&gt;

&lt;p&gt;By using an extraction API built for AI workloads, you bypass these limitations. You define strict JSON schemas. The infrastructure handles the browser rendering and challenge solving. Your agent receives dense, structured data blocks. This creates reliable, automated pipelines for reputation monitoring, competitor analysis, and automated support operations.&lt;/p&gt;

</description>
      <category>api</category>
      <category>aiagents</category>
      <category>llm</category>
      <category>rag</category>
    </item>
    <item>
      <title>How to Give Your AI Agent Access to Glassdoor Data</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Thu, 18 Jun 2026 11:30:24 +0000</pubDate>
      <link>https://dev.to/alterlab/how-to-give-your-ai-agent-access-to-glassdoor-data-22n1</link>
      <guid>https://dev.to/alterlab/how-to-give-your-ai-agent-access-to-glassdoor-data-22n1</guid>
      <description>&lt;p&gt;&lt;em&gt;Disclaimer: This guide covers accessing publicly available data. Always review a site's robots.txt and Terms of Service before automated access.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;To give your AI agent access to Glassdoor data, route target URLs through a managed extraction API that handles JavaScript rendering and returns structured JSON. This prevents raw HTML from bloating the context window and ensures reliable data retrieval for RAG pipelines without building custom scraping infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why AI agents need Glassdoor data
&lt;/h2&gt;

&lt;p&gt;Agents require external knowledge to reason effectively about real-world entities. Publicly available workplace data provides critical context for several agentic workflows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Company research pipelines:&lt;/strong&gt; Agents compiling technical briefs on target organizations need public review metrics and benefit listings to assess company health.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Salary intelligence:&lt;/strong&gt; RAG systems answering compensation queries require current public salary ranges across specific roles to provide accurate, grounded answers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Culture signal monitoring:&lt;/strong&gt; LLMs analyzing sentiment can process public interview experiences and management ratings to score organizational transparency and interview difficulty over time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why raw HTTP requests fail for agents
&lt;/h2&gt;

&lt;p&gt;Agents using standard HTTP libraries like Python's &lt;code&gt;requests&lt;/code&gt; encounter immediate roadblocks when targeting modern web applications. Glassdoor relies heavily on client-side JavaScript to render job listings, salary tables, and review content. A standard HTTP GET request returns an empty HTML document filled with script tags, not the actual data.&lt;/p&gt;

&lt;p&gt;Even if an agent successfully retrieves the rendered HTML, feeding that raw markup into an LLM context window is extremely inefficient. A standard Glassdoor page contains hundreds of kilobytes of nested &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt; tags, CSS classes, and navigation menus.&lt;/p&gt;

&lt;p&gt;This raw markup wastes token limits. A 300KB HTML file consumes roughly 75,000 tokens. Sending that to a modern LLM incurs high inference costs for pure noise. Agents need the underlying signal. Failed requests break agent autonomy loops and force costly retries, degrading pipeline reliability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Connecting your agent to Glassdoor via AlterLab
&lt;/h2&gt;

&lt;p&gt;You need a translation layer between the raw web and your LLM. The &lt;a href="https://dev.to/docs/extract"&gt;Extract API docs&lt;/a&gt; detail how to convert unstructured web pages into strict JSON schemas. This data maps directly to Pydantic models or tool call arguments.&lt;/p&gt;

&lt;p&gt;By defining a schema, you instruct the extraction layer to find the specific data points on the page, regardless of the underlying DOM structure.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="agent_glassdoor_extract.py" {6-10}&lt;/p&gt;

&lt;p&gt;client = alterlab.Client("YOUR_API_KEY")&lt;/p&gt;

&lt;p&gt;schema = {&lt;br&gt;
    "company_name": "string",&lt;br&gt;
    "overall_rating": "number",&lt;br&gt;
    "recent_public_reviews": ["string"]&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;result = client.extract(&lt;br&gt;
    url="&lt;a href="https://glassdoor.com/Reviews/Example-Company-Reviews.htm" rel="noopener noreferrer"&gt;https://glassdoor.com/Reviews/Example-Company-Reviews.htm&lt;/a&gt;",&lt;br&gt;
    schema=schema&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;print(json.dumps(result.data, indent=2))&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


If you prefer to handle the request via the command line or integrate it into a shell-based pipeline, the same extraction can be triggered using cURL.



```bash title="Terminal"
curl -X POST https://api.alterlab.io/api/v1/extract \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://glassdoor.com/Reviews/Example-Company-Reviews.htm",
    "schema": {
      "company_name": "string",
      "overall_rating": "number"
    }
  }'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Using the Search API for Glassdoor queries
&lt;/h2&gt;

&lt;p&gt;Autonomous agents rarely start with exact URLs. They usually start with a query, such as a company name or a specific job role. You can combine a standard web search API with domain filtering to locate the exact public profile URL before extracting its contents.&lt;/p&gt;

&lt;p&gt;Using the Search API allows your agent to find the correct entry point automatically.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="agent_search_pipeline.py" {4-6}&lt;/p&gt;

&lt;p&gt;client = alterlab.Client("YOUR_API_KEY")&lt;/p&gt;

&lt;p&gt;search_results = client.search(&lt;br&gt;
    query="site:glassdoor.com/Overview public software engineer salary Acme Corp",&lt;br&gt;
    limit=1&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;if search_results.data:&lt;br&gt;
    target_url = search_results.data[0].url&lt;br&gt;
    print(f"Agent found target URL: {target_url}")&lt;br&gt;
    # Proceed to extraction step&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


&amp;lt;div data-infographic="steps"&amp;gt;
  &amp;lt;div data-step data-number="1" data-title="Agent requests data" data-description="LLM agent calls the search tool with target domain and company name"&amp;gt;&amp;lt;/div&amp;gt;
  &amp;lt;div data-step data-number="2" data-title="Platform fetches + extracts" data-description="Handles browser rendering, returns structured JSON"&amp;gt;&amp;lt;/div&amp;gt;
  &amp;lt;div data-step data-number="3" data-title="Agent uses clean data" data-description="No parsing, no retries. Data goes straight to LLM context"&amp;gt;&amp;lt;/div&amp;gt;
&amp;lt;/div&amp;gt;

## MCP integration

The Model Context Protocol (MCP) standardizes how agents interact with external tools and data sources. Instead of writing custom API wrappers for every LLM, you can expose web data directly to local models or desktop applications using standardized servers.

Integrating this protocol allows coding assistants and autonomous desktop agents to query web data natively. Read the [AlterLab for AI Agents](https://alterlab.io/docs/tutorials/ai-agent) guide to configure the MCP server for your specific agent environment.

## Building a company research pipeline

Let us build a complete Python script that combines these concepts. This pipeline takes a company name, searches for its public profile, extracts the data into a structured schema, and prepares it for an LLM prompt.



```python title="company_research_agent.py" {16-22}

def research_company(company_name: str, api_key: str) -&amp;gt; dict:
    client = alterlab.Client(api_key)

    # Step 1: Find the public URL
    search_query = f"site:glassdoor.com/Overview {company_name} working at"
    search_results = client.search(query=search_query, limit=1)

    if not search_results.data:
        return {"error": "Could not locate public profile."}

    target_url = search_results.data[0].url

    # Step 2: Extract structured data
    schema = {
        "company_name": "string",
        "industry": "string",
        "employee_count": "string",
        "public_rating": "number"
    }

    extraction = client.extract(url=target_url, schema=schema)

    # Step 3: Format for LLM context
    return {
        "source_url": target_url,
        "structured_data": extraction.data
    }

# Example agent tool execution
if __name__ == "__main__":
    result = research_company("Example Corp", "YOUR_API_KEY")
    print("Data ready for LLM context window:")
    print(json.dumps(result, indent=2))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pipeline isolates the complexity of web traversal. The LLM only receives the clean JSON dictionary, keeping the context window focused entirely on the extracted facts rather than raw HTML parsing.&lt;/p&gt;

&lt;p&gt;When operating autonomous agents at scale, error rates compound. A failed extraction step means a failed LLM inference step, driving up your total cost per task. Review the &lt;a href="https://dev.to/pricing"&gt;AlterLab pricing&lt;/a&gt; documentation to understand how costs scale with reliable request volume.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;p&gt;Agents require structured data, not raw markup. Feeding raw HTML into a context window wastes tokens and degrades model reasoning.&lt;/p&gt;

&lt;p&gt;Use schema-based extraction APIs to enforce strict JSON output. This guarantees your LLM receives predictable data formats for tool calls and RAG pipelines.&lt;/p&gt;

&lt;p&gt;Combine domain-specific search queries with targeted extraction to build robust, autonomous research tools.&lt;/p&gt;

&lt;p&gt;Read the &lt;a href="https://dev.to/docs/quickstart/installation"&gt;Getting started guide&lt;/a&gt; to install the client library and integrate web extraction into your agent architecture.&lt;/p&gt;

</description>
      <category>api</category>
      <category>aiagents</category>
      <category>llm</category>
      <category>rag</category>
    </item>
    <item>
      <title>How to Give Your AI Agent Access to G2 Data</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Thu, 18 Jun 2026 11:30:23 +0000</pubDate>
      <link>https://dev.to/alterlab/how-to-give-your-ai-agent-access-to-g2-data-50j7</link>
      <guid>https://dev.to/alterlab/how-to-give-your-ai-agent-access-to-g2-data-50j7</guid>
      <description>&lt;p&gt;&lt;em&gt;Disclaimer: This guide covers accessing publicly available data. Always review a site's robots.txt and Terms of Service before automated access.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;To give an AI agent access to G2 data, route its tool calls through AlterLab's Extract API. This provides structured JSON directly to the LLM context window, bypassing the need for manual HTML parsing while handling browser rendering and rate limits automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why AI Agents Need G2 Data
&lt;/h2&gt;

&lt;p&gt;AI agents building software comparison RAG pipelines require real-world user feedback. G2 hosts millions of public reviews, feature ratings, and market categorizations. Accessing this data enables agents to perform specific tasks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Software Comparison Research&lt;/strong&gt;: Agents can pull feature matrices and user sentiment to compare tools dynamically, generating unbiased recommendations based on empirical data.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Competitor Intelligence&lt;/strong&gt;: Pipelines can monitor a competitor's page for new negative reviews, alerting product teams to specific missing features.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Category Monitoring&lt;/strong&gt;: Agents can track entire software categories to identify emerging tools and shift market position strategies.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Why Raw HTTP Requests Fail for Agents
&lt;/h2&gt;

&lt;p&gt;Giving an LLM a standard HTTP client tool usually leads to pipeline failure. Target sites like G2 employ sophisticated rate limiting and browser fingerprinting. Standard GET requests fail to render client-side JavaScript, triggering bot detection mechanisms immediately.&lt;/p&gt;

&lt;p&gt;When this happens, the agent receives an HTML challenge page instead of data. This pollutes the context window. It wastes token budgets on retries. Often, the LLM hallucinates answers based on incomplete security page text. Agents need structured data, not raw DOM elements and CAPTCHA challenges.&lt;/p&gt;

&lt;h2&gt;
  
  
  Connecting Your Agent to G2 via AlterLab
&lt;/h2&gt;

&lt;p&gt;The solution is an intermediary tool that handles the transport layer and returns clean JSON. AlterLab provides this infrastructure. Before implementing the tool, follow our &lt;a href="https://dev.to/docs/quickstart/installation"&gt;getting started guide&lt;/a&gt; to configure your environment and API keys.&lt;/p&gt;

&lt;p&gt;You have two primary approaches: the Extract API for structured data and the Scrape API for raw HTML. &lt;/p&gt;

&lt;h3&gt;
  
  
  The Extract API Approach
&lt;/h3&gt;

&lt;p&gt;The Extract API is designed specifically for AI agents. You define a schema, and the API returns a JSON object matching that schema. This minimizes context window usage. Review the full &lt;a href="https://dev.to/docs/extract"&gt;Extract API docs&lt;/a&gt; for advanced schema configurations.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="agent_extract.py" {3-9}&lt;/p&gt;

&lt;p&gt;client = alterlab.Client("YOUR_API_KEY")&lt;/p&gt;

&lt;h1&gt;
  
  
  Structured extraction gets clean data without parsing HTML
&lt;/h1&gt;

&lt;p&gt;result = client.extract(&lt;br&gt;
    url="&lt;a href="https://g2.com/categories/marketing-automation" rel="noopener noreferrer"&gt;https://g2.com/categories/marketing-automation&lt;/a&gt;",&lt;br&gt;
    schema={&lt;br&gt;
        "products": ["string"],&lt;br&gt;
        "top_features": ["string"],&lt;br&gt;
        "average_rating": "number"&lt;br&gt;
    }&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;print(result.data) # Clean structured dict, ready for your LLM&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;




```bash title="Terminal"
curl -X POST https://api.alterlab.io/api/v1/extract \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://g2.com/categories/marketing-automation", 
    "schema": {"products": ["string"]}
  }'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  The Scrape API Approach
&lt;/h3&gt;

&lt;p&gt;If your agent operates in a Python environment and prefers to use tools like BeautifulSoup locally, you can use the Scrape API. This returns the raw HTML after full JavaScript rendering.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="agent_scrape.py" {3-4}&lt;/p&gt;

&lt;p&gt;client = alterlab.Client("YOUR_API_KEY")&lt;br&gt;
html_content = client.scrape(url="&lt;a href="https://g2.com/categories/crm%22" rel="noopener noreferrer"&gt;https://g2.com/categories/crm"&lt;/a&gt;)&lt;/p&gt;

&lt;h1&gt;
  
  
  Agent can now parse the full DOM locally
&lt;/h1&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


## Using the Search API for G2 Queries

Agents rarely know exact URLs in advance. A user might prompt the agent with "Compare the top CRM tools on G2." The agent must first search to find the correct pages. 

The AlterLab Search API allows agents to execute queries and retrieve organic results, which they can then feed into the Extract API.



```python title="agent_search.py" {3-7}

client = alterlab.Client("YOUR_API_KEY")

search_results = client.search(
    query="site:g2.com best crm software 2026",
    num_results=3
)

for result in search_results.data:
    print(result.url)
    # Agent iterates over URLs to extract reviews
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  MCP Integration
&lt;/h2&gt;

&lt;p&gt;If you use Claude, Cursor, or an MCP-compatible framework, you do not need to write custom Python tools. You can use the AlterLab MCP server. It exposes the Extract, Scrape, and Search endpoints directly to the model as native tool calls.&lt;/p&gt;

&lt;p&gt;To configure this environment, read the &lt;a href="https://alterlab.io/docs/tutorials/ai-agent" rel="noopener noreferrer"&gt;AlterLab for AI Agents&lt;/a&gt; tutorial. Once connected, Claude can autonomously search G2, extract schemas, and synthesize answers without additional wrapper code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building a Software Comparison Research Pipeline
&lt;/h2&gt;

&lt;p&gt;Let us build a complete function-calling pipeline. This example shows the logical flow of an agent receiving a user query, fetching G2 data, and generating a final report.&lt;/p&gt;



&lt;p&gt;```python title="comparison_pipeline.py" {16-25}&lt;/p&gt;

&lt;p&gt;alterlab_client = alterlab.Client("YOUR_ALTERLAB_KEY")&lt;br&gt;
llm_client = openai.Client(api_key="YOUR_OPENAI_KEY")&lt;/p&gt;

&lt;p&gt;def get_g2_product_data(url: str) -&amp;gt; str:&lt;br&gt;
    """Tool provided to the LLM to fetch G2 data."""&lt;br&gt;
    result = alterlab_client.extract(&lt;br&gt;
        url=url,&lt;br&gt;
        schema={&lt;br&gt;
            "product_name": "string",&lt;br&gt;
            "overall_rating": "number",&lt;br&gt;
            "recent_reviews": [{"pros": "string", "cons": "string"}]&lt;br&gt;
        }&lt;br&gt;
    )&lt;br&gt;
    return json.dumps(result.data)&lt;/p&gt;

&lt;p&gt;tools = [{&lt;br&gt;
    "type": "function",&lt;br&gt;
    "function": {&lt;br&gt;
        "name": "get_g2_product_data",&lt;br&gt;
        "description": "Extracts structured product data and reviews from a G2 URL.",&lt;br&gt;
        "parameters": {&lt;br&gt;
            "type": "object",&lt;br&gt;
            "properties": {&lt;br&gt;
                "url": {"type": "string", "description": "The G2 product URL"}&lt;br&gt;
            },&lt;br&gt;
            "required": ["url"]&lt;br&gt;
        }&lt;br&gt;
    }&lt;br&gt;
}]&lt;/p&gt;

&lt;h1&gt;
  
  
  Agent execution loop
&lt;/h1&gt;

&lt;p&gt;messages = [{"role": "user", "content": "Compare the recent pros and cons of Product A vs Product B based on their G2 pages. Product A: &lt;a href="https://g2.com/products/a/reviews" rel="noopener noreferrer"&gt;https://g2.com/products/a/reviews&lt;/a&gt;. Product B: &lt;a href="https://g2.com/products/b/reviews.%22%7D" rel="noopener noreferrer"&gt;https://g2.com/products/b/reviews."}&lt;/a&gt;]&lt;/p&gt;

&lt;p&gt;response = llm_client.chat.completions.create(&lt;br&gt;
    model="gpt-4o",&lt;br&gt;
    messages=messages,&lt;br&gt;
    tools=tools&lt;br&gt;
)&lt;/p&gt;

&lt;h1&gt;
  
  
  In a complete application, you handle the tool_calls,
&lt;/h1&gt;

&lt;h1&gt;
  
  
  append the JSON results to messages, and call the LLM again.
&lt;/h1&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


When scaling this pipeline across thousands of products, check [AlterLab pricing](/pricing) to model your API usage costs. The Extract API significantly reduces LLM token costs by dropping heavy HTML markup before the data reaches your context window.

&amp;lt;div data-infographic="try-it" data-url="https://g2.com" data-description="Extract structured G2 data for your AI agent"&amp;gt;&amp;lt;/div&amp;gt;

## Key Takeaways

1.  **Skip the DOM**: Giving your agent raw HTML wastes tokens and increases latency. Always use structured extraction endpoints.
2.  **Automate Transport**: Offload browser rendering and rate limiting to AlterLab so your agent focuses entirely on reasoning and synthesis.
3.  **Use MCP for Zero-Code Tools**: Connect Claude or Cursor directly to AlterLab via MCP to grant instant web data access without writing custom Python wrappers.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>aiagents</category>
      <category>datapipelines</category>
      <category>llm</category>
      <category>rag</category>
    </item>
    <item>
      <title>Realtor.com Data API: Extract Structured JSON in 2026</title>
      <dc:creator>AlterLab</dc:creator>
      <pubDate>Wed, 17 Jun 2026 20:08:59 +0000</pubDate>
      <link>https://dev.to/alterlab/realtorcom-data-api-extract-structured-json-in-2026-11nj</link>
      <guid>https://dev.to/alterlab/realtorcom-data-api-extract-structured-json-in-2026-11nj</guid>
      <description>&lt;p&gt;&lt;em&gt;Disclaimer: This guide covers extracting publicly accessible data. Always review a site's robots.txt and Terms of Service before scraping. Ensure your extraction rates respect target server limits, and remember that you are responsible for maintaining compliance with relevant Terms of Service.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;To get structured realtor.com data via API, use the AlterLab Extract endpoint. You provide the target listing URL and a JSON schema defining your required fields, such as price, bedrooms, and address. AlterLab handles the underlying infrastructure, JavaScript execution, and AI-driven data mapping, returning clean, strictly typed JSON ready for immediate integration into your data pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why use Realtor.com data?
&lt;/h2&gt;

&lt;p&gt;Engineers and data teams require reliable access to real-estate data for various programmatic use cases. When building systems dependent on housing market information, raw data velocity and accuracy define the success of the application. &lt;/p&gt;

&lt;p&gt;Common applications for a real-estate data API include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Machine Learning and AI Training:&lt;/strong&gt; Feeding localized housing market data into models to predict pricing trends, neighborhood appreciation, or rental yield forecasting.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;RAG Pipelines:&lt;/strong&gt; Supplying real-time, ground-truth property data to Large Language Models so they can answer user queries about specific market conditions without hallucinating.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Market Analytics:&lt;/strong&gt; Building internal dashboards that track inventory velocity, average days on market, and price-per-square-foot variations across target zip codes.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What data can you extract?
&lt;/h2&gt;

&lt;p&gt;When accessing public listing pages, you can extract a comprehensive set of property attributes. The key to building a resilient pipeline is targeting the exact fields you need rather than downloading the entire document.&lt;/p&gt;

&lt;p&gt;Publicly available data points typically include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Core Property Attributes:&lt;/strong&gt; Bedrooms, bathrooms, total square footage, lot size, and year built.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Pricing Information:&lt;/strong&gt; Current asking price, price per square foot, and historical price changes if listed on the public page.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Location Data:&lt;/strong&gt; Full address, neighborhood, city, state, and zip code.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Listing Metadata:&lt;/strong&gt; Days on market, listing agent or brokerage name, and property status (active, pending, sold).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Features and Amenities:&lt;/strong&gt; Garage capacity, heating/cooling systems, HOA fees, and architectural style.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The extraction approach
&lt;/h2&gt;

&lt;p&gt;Extracting realtor.com json extraction data using traditional methods requires downloading HTML and writing CSS or XPath selectors to locate specific DOM nodes. This approach is fundamentally fragile.&lt;/p&gt;

&lt;p&gt;Modern web applications use dynamic rendering, heavily minified JavaScript, and utility-first CSS frameworks. Class names like &lt;code&gt;css-1xj2b&lt;/code&gt; change with every deployment. A simple A/B test changing the layout of the property gallery will silently break your parsing logic, leading to null values or, worse, misaligned data entering your database.&lt;/p&gt;

&lt;p&gt;The modern standard is using a data API. Instead of asking "how do I parse the HTML," you define the data structure you want and let an AI extraction layer map the visual page content to your schema. AlterLab handles the proxy rotation, JavaScript rendering, and LLM-powered extraction in a single API call. &lt;/p&gt;

&lt;p&gt;Before proceeding to the code, ensure you have set up your environment by reviewing our &lt;a href="https://dev.to/docs/quickstart/installation"&gt;Getting started guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick start with AlterLab Extract API
&lt;/h2&gt;

&lt;p&gt;The AlterLab Extract API requires two primary inputs: the target URL and a JSON schema. The schema dictates the shape, types, and descriptions of the data you expect.&lt;/p&gt;

&lt;p&gt;Here is the primary implementation for a realtor.com data extraction python pipeline.&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```python title="extract_realtor-com.py" {5-12,35}&lt;/p&gt;

&lt;p&gt;client = alterlab.Client("YOUR_API_KEY")&lt;/p&gt;

&lt;p&gt;schema = {&lt;br&gt;
  "type": "object",&lt;br&gt;
  "properties": {&lt;br&gt;
    "address": {&lt;br&gt;
      "type": "string",&lt;br&gt;
      "description": "The address field"&lt;br&gt;
    },&lt;br&gt;
    "price": {&lt;br&gt;
      "type": "string",&lt;br&gt;
      "description": "The price field"&lt;br&gt;
    },&lt;br&gt;
    "bedrooms": {&lt;br&gt;
      "type": "string",&lt;br&gt;
      "description": "The bedrooms field"&lt;br&gt;
    },&lt;br&gt;
    "bathrooms": {&lt;br&gt;
      "type": "string",&lt;br&gt;
      "description": "The bathrooms field"&lt;br&gt;
    },&lt;br&gt;
    "sqft": {&lt;br&gt;
      "type": "string",&lt;br&gt;
      "description": "The sqft field"&lt;br&gt;
    },&lt;br&gt;
    "listing_date": {&lt;br&gt;
      "type": "string",&lt;br&gt;
      "description": "The listing date field"&lt;br&gt;
    }&lt;br&gt;
  }&lt;br&gt;
}&lt;/p&gt;

&lt;p&gt;result = client.extract(&lt;br&gt;
    url="&lt;a href="https://realtor.com/example-page" rel="noopener noreferrer"&gt;https://realtor.com/example-page&lt;/a&gt;",&lt;br&gt;
    schema=schema,&lt;br&gt;
)&lt;br&gt;
print(result.data)&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


For environments where Python is not the primary language, or for testing directly from your terminal, you can interact with the API using standard HTTP tools.



```bash title="Terminal" {4-7}
curl -X POST https://api.alterlab.io/v1/extract \
  -H "X-API-Key: YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://realtor.com/example-page",
    "schema": {"properties": {"address": {"type": "string"}, "price": {"type": "string"}, "bedrooms": {"type": "string"}}}
  }'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Review the complete parameter list and advanced configuration options in the &lt;a href="https://dev.to/docs/api/extract"&gt;Extract API docs&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Define your schema
&lt;/h2&gt;

&lt;p&gt;The schema is the most critical component of your request. AlterLab uses standard JSON Schema validation to ensure the LLM output exactly matches your pipeline requirements. &lt;/p&gt;

&lt;p&gt;While the quick start example uses string types for simplicity, production pipelines should enforce strict typing. When you specify an integer or a boolean, the Extract API guarantees the output will match that type, stripping out extraneous text like currency symbols or commas.&lt;/p&gt;

&lt;p&gt;Consider this refined schema for structured realtor.com data:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;```json title="schema.json" {6,11,16}&lt;br&gt;
{&lt;br&gt;
  "type": "object",&lt;br&gt;
  "properties": {&lt;br&gt;
    "price_usd": {&lt;br&gt;
      "type": "integer",&lt;br&gt;
      "description": "The current asking price in US Dollars. Return only the number, no symbols."&lt;br&gt;
    },&lt;br&gt;
    "bedrooms": {&lt;br&gt;
      "type": "integer",&lt;br&gt;
      "description": "Total number of bedrooms."&lt;br&gt;
    },&lt;br&gt;
    "has_garage": {&lt;br&gt;
      "type": "boolean",&lt;br&gt;
      "description": "True if the property has a garage, false otherwise."&lt;br&gt;
    },&lt;br&gt;
    "status": {&lt;br&gt;
      "type": "string",&lt;br&gt;
      "enum": ["active", "pending", "sold", "unknown"],&lt;br&gt;
      "description": "The current market status of the listing."&lt;br&gt;
    }&lt;br&gt;
  },&lt;br&gt;
  "required": ["price_usd", "bedrooms", "status"]&lt;br&gt;
}&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


By providing detailed descriptions and utilizing JSON Schema constraints like `enum` and `required`, you explicitly instruct the extraction engine on how to handle edge cases and normalize the data before it reaches your application.

## Handle pagination and scale

Single property extraction is useful for specific lookups, but building comprehensive datasets requires processing thousands of URLs. When scaling your realtor.com api structured data pipeline, synchronous HTTP requests become a bottleneck.

For high-volume workloads, you must transition to asynchronous batch processing. This allows you to queue thousands of URLs and let AlterLab manage concurrency, rate limits, and retries.

The following example demonstrates how to dispatch a batch of extraction jobs and process them asynchronously.



```python title="batch_extract.py" {11-14,20-22}

client = alterlab.AsyncClient("YOUR_API_KEY")

async def process_listings(urls, schema):
    tasks = []

    # Queue all URLs concurrently
    for url in urls:
        task = client.extract.create(
            url=url,
            schema=schema
        )
        tasks.append(task)

    # Wait for all extraction jobs to complete
    results = await asyncio.gather(*tasks, return_exceptions=True)

    for result in results:
        if isinstance(result, Exception):
            print(f"Extraction failed: {result}")
        else:
            print(f"Success: {result.data['price']} for {result.url}")

# List of target public URLs
urls = [
    "https://realtor.com/property-1",
    "https://realtor.com/property-2",
    "https://realtor.com/property-3"
]

# Run the async loop
asyncio.run(process_listings(urls, target_schema))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This asynchronous pattern maximizes throughput while respecting network boundaries. It is highly recommended to combine this with webhook delivery for massive batches, allowing your server to receive pushed data asynchronously rather than holding open connections.&lt;/p&gt;

&lt;p&gt;Cost management is critical when scaling up. AlterLab ensures you are only billed for successful payload deliveries. Evaluate your expected volume and review the &lt;a href="https://dev.to/pricing"&gt;AlterLab pricing&lt;/a&gt; to model your pipeline costs accurately.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;p&gt;Building a robust realtor.com data api integration relies on moving away from brittle DOM parsing and adopting schema-driven extraction.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Eliminate parsing logic:&lt;/strong&gt; Stop writing and maintaining regex and XPath for dynamic React applications.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Enforce data contracts:&lt;/strong&gt; Use strict JSON Schemas to guarantee the shape and types of the data entering your database.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Scale asynchronously:&lt;/strong&gt; Use batching and async clients to process thousands of public listings efficiently.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Focus on the pipeline:&lt;/strong&gt; Offload infrastructure, proxy management, and site changes to AlterLab so your team can focus on data utilization.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Deploying an AI-powered extraction pipeline ensures your data operations remain resilient against front-end changes, delivering clean, actionable real-estate data continuously.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>realestate</category>
      <category>dataextraction</category>
      <category>api</category>
    </item>
  </channel>
</rss>
