DEV Community: Tinyfishie

How to Extract Structured Data from A Website

Tinyfishie — Tue, 19 May 2026 16:14:36 +0000

TinyFish agents are cloud-based browser sessions that navigate any website — static, JavaScript-rendered, or requiring multi-step workflows — and return machine-readable structured data without requiring you to manage browser infrastructure, proxy rotation, or session handling.

The right tool for structured data extraction depends entirely on what kind of website you're dealing with. This guide presents a four-tier framework: most sites fall into Tier 1 or 2, where you don't need TinyFish at all. Tier 3 and 4 are where managed infrastructure earns its cost.

Prerequisites for the code examples: Python 3.8+. Install: pip install requests feedparser playwright tinyfish

Responsible use note: Always review a site's terms of service before automated data extraction. Public-facing data for research, competitive intelligence, and non-commercial analysis is generally low-risk. Extraction at scale for data resale or targeting platform users requires careful legal review.

What Counts as "Structured Data Extraction"?

The goal is machine-readable output — JSON, CSV, a typed object — rather than raw HTML or a PDF dump. A scraper that returns <div class="price">$29.99</div> isn't done; a scraper that returns { "price": 29.99, "currency": "USD" } is.

The right tool depends on what kind of site you're dealing with. Tier 1 sites are trivially easy. Tier 4 requires the most capability. Most real-world extraction projects span multiple tiers — a single pipeline might hit Tier 1 sources for some data and Tier 3 sources for others.

Four Tiers of Website Complexity

Tier	Site type	What makes it hard	Right tool	Code complexity
1	Has an API or RSS feed	Nothing — use the API	`requests` + JSON	Trivial
2	JS-rendered, no API	Content loads after JS executes	Playwright (local) or TinyFish Fetch (managed)	Low
3	Strict automation requirements at scale	Infrastructure gaps cause failures in production	TinyFish Fetch with browser	Medium
4	Authenticated or multi-step workflow	Session state, conditional navigation, decisions	TinyFish Web Agent	Handled for you

Tiers 3 and 4 aren't edge cases — they're where production data pipelines typically land once you move beyond toy examples.

Tier 1 — Sites with APIs or RSS Feeds

Always check for an official API before writing a scraper. Many sites that look like scraping targets have well-documented APIs that return exactly the structured data you need.

Where to look: /robots.txt, the site footer ("Developers" / "API" links), documentation subdomains, RapidAPI, or just searching "[site name] API documentation".

import requests

# Example: a site that returns JSON at a predictable endpoint
response = requests.get(
    "https://data-source.example.com/api/v1/products",
    headers={"Authorization": "Bearer your_api_key"},
    params={"category": "electronics", "format": "json"}
)
products = response.json()
print(products[0])  # {'id': '123', 'name': 'Widget A', 'price': 29.99}

RSS feeds are Tier 1 for content monitoring:

import feedparser

feed = feedparser.parse("https://news-source.example.com/feed.xml")
articles = [{"title": e.title, "url": e.link, "published": e.published} for e in feed.entries]

This is the best case. It's fast, free, respects the site's intended access pattern, and won't break when page layouts change.

Tier 2 — JavaScript-Rendered Pages

When a site loads content after JavaScript executes, requests alone returns an empty or incomplete page. You need a browser that runs JavaScript.

Two equally valid approaches, each with a clear use case:

Playwright — the right choice when you're running locally, need a controlled environment, or are integrating into an existing test suite. Free to use, open-source, excellent documentation.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://js-rendered-site.example.com/catalog")
    page.wait_for_selector(".product-card")  # wait for JS to render

    products = page.evaluate("""
        () => Array.from(document.querySelectorAll('.product-card')).map(el => ({
            name: el.querySelector('.name').innerText,
            price: el.querySelector('.price').innerText,
        }))
    """)
    browser.close()

TinyFish Fetch API — the right choice when you want zero infrastructure overhead: no browser process to manage, no selector maintenance when the site updates, no proxy setup. Returns markdown or structured text from the live rendered page.

import requests, os

response = requests.post(
    "https://api.fetch.tinyfish.ai",
    headers={"X-API-Key": os.environ["TINYFISH_API_KEY"]},
    json={"urls": ["https://js-rendered-site.example.com/catalog"], "format": "markdown"}
)
results = response.json().get("results", [])
content = results[0]["text"] if results else ""  # full page content after JS renders

For simple JS-rendered pages where content is fully visible after load, either approach works. Playwright gives you more control; TinyFish Fetch eliminates infrastructure management.

Tier 3 — Sites with Strict Automation Requirements

Playwright requires significant infrastructure work to run reliably on sites with strict automation requirements at scale. The gaps compound:

Proxy management — single-IP requests fail rate limits; rotating proxies require sourcing, billing, and maintenance
Session handling — requests with stale or suspicious session fingerprints get challenged more frequently
Concurrency — running 50 Playwright instances locally saturates memory; running them in cloud containers requires container orchestration

TinyFish Fetch API handles the infrastructure layer — proxy routing, session management, and request handling at the infrastructure level. Your extraction code stays the same.

import requests, os

response = requests.post(
    "https://api.fetch.tinyfish.ai",
    headers={"X-API-Key": os.environ["TINYFISH_API_KEY"]},
    json={
        "urls": ["https://strict-requirements-site.example.com/listings"],
        "format": "markdown"
    }
)

# Check for errors — failed URLs don't consume credits
result_data = response.json()
errors = result_data.get("errors", [])
results = result_data.get("results", [])
if results and not errors:
    content = results[0]["text"]
else:
    print("Fetch incomplete. Errors:", errors)

The Fetch API is free on all TinyFish plans — zero credits consumed per request. You pay only for Web Agent steps (Tier 4). For Tier 3 use cases, this means running extraction at scale without per-request credit costs.

For running extraction across many URLs in parallel, see how to run 1,000 parallel web requests with the TinyFish Fetch API.

Tier 4 — Authenticated or Multi-Step Workflows

When extraction requires navigating through multiple pages, making decisions based on page content, or maintaining session state across requests — this is where a goal-directed web agent is the right abstraction.

Framing note: Tier 4 workflows are appropriate when you're accessing accounts and portals you are authorized to use — your own organization's tools, accounts you operate, or systems where you have explicit access rights.

from tinyfish import TinyFish
import os

client = TinyFish()  # reads TINYFISH_API_KEY from env

# Example: extract a structured report from a multi-step portal
result = client.agent.run(
    url="https://portal.example.com",
    goal="""Navigate to the Analytics section and extract the monthly summary.
    Return JSON with: { month, revenue, units_sold, top_products: string[] }
    Use the authorized credentials already stored in Vault."""
)

# COMPLETED means the session ran — not that the goal succeeded.
# Check both infrastructure-level and goal-level results:
if hasattr(result, 'status') and result.status == "FAILED":
    print("Infrastructure error:", getattr(result, 'error', 'unknown'))
else:
    data = result.result or {}
    if data.get("status") == "failure":
        print("Agent could not complete goal:", data.get("reason"))
    else:
        print("Extracted:", data)

The key distinction from Tier 3: the Web Agent doesn't just fetch a URL — it understands a goal, navigates through steps to reach it, and returns structured output. Session state, conditional navigation ("if the report isn't available yet, check back in 10 minutes"), and multi-page workflows are handled by the agent, not by your code.

Choosing the Right Approach

Does the site have an API or RSS feed?
  → Yes: Use the API (Tier 1). No further tooling needed.
  → No: Is the content JavaScript-rendered?
       → No (static HTML): requests + BeautifulSoup for simple cases.
       → Yes: Is this a local/controlled environment?
              → Yes: Playwright (Tier 2a).
              → No, or you need managed scale: TinyFish Fetch API (Tier 2b or 3).
                   → Does the workflow require multi-step navigation or session state?
                          → Yes: TinyFish Web Agent (Tier 4).
                          → No: TinyFish Fetch API (Tier 3).

Use case	Best tool
Site has a public API	Use the API
Static HTML, simple extraction	requests + BeautifulSoup
JS-rendered, local or test environment	Playwright
JS-rendered, production at scale	TinyFish Fetch
Strict automation requirements at scale	TinyFish Fetch (browser: true)
Multi-step workflow or session state	TinyFish Web Agent

Most production extraction projects combine multiple tiers. An e-commerce monitoring pipeline might use a site's API for product catalog (Tier 1), TinyFish Fetch for pricing pages (Tier 3), and TinyFish Web Agent for authenticated inventory portals (Tier 4).

The right extraction tool is the one that matches the complexity of what you're extracting. For Tier 1 and 2, the answer is almost always free open-source tools. For Tier 3 and 4, managed infrastructure pays for itself in engineering time avoided. Start with the simplest tool that works, and upgrade when the complexity demands it.

FAQ

What is the simplest way to extract structured data from a website?

The simplest approach is checking whether the site has a public API or RSS feed first — if it does, use that directly and skip browser automation entirely. For sites without APIs, Python's requests library works for static HTML. JavaScript-rendered sites require a browser; Playwright for local use, TinyFish Fetch API for managed-scale extraction.

When does Playwright break at scale?

Playwright is reliable for local and small-scale use. In production environments running hundreds or thousands of daily extractions, the infrastructure requirements compound: proxy sourcing and rotation, browser process memory limits, session fingerprint freshness, and container orchestration for concurrency. TinyFish Fetch handles this infrastructure layer, keeping your extraction code simple.

What is the TinyFish Fetch API and how is it different from requests?

TinyFish Fetch API runs a full browser session and returns the page content after JavaScript execution — similar to what Playwright returns, but without the infrastructure overhead. Unlike Python's requests, which only performs an HTTP GET and returns the raw server response, TinyFish Fetch renders the complete DOM including dynamically loaded content. The Fetch API is free on all TinyFish plans — zero credits per successful request.

When do I need the Web Agent instead of the Fetch API?

Use the Web Agent when your extraction involves decisions, navigation across multiple pages, or session state that must persist across steps. If you're fetching a single URL and parsing the response, Fetch is sufficient. If you need to "navigate to the quarterly report, download the CSV, and extract the revenue line" — that's a multi-step goal requiring the Web Agent.

Is extracting publicly visible data legal?

Public-facing data extraction for research, competitive intelligence, and non-commercial purposes is generally low-risk in most jurisdictions. Terms of service are contractual (not legal) restrictions and their enforceability varies by context. Data that requires authentication — content behind login pages — is in a different category: only access portals and accounts you are authorized to use. Consult legal counsel for commercial data products or any extraction at significant scale.

How do I extract structured data from a site that requires login?

Use TinyFish Web Agent with Vault, which stores credentials securely and injects them into the browser session. The agent navigates the authentication flow using credentials for accounts you are authorized to access, then continues to the target content. This is appropriate for your own organizational tools, SaaS dashboards, and portals where you hold the account.

Related Reading:

Want to scrape the web without getting blocked? Try TinyFish — a browser API built for AI agents and developers.

Fetching Data from a Large URL List: The Complete Decision Guide

Tinyfishie — Tue, 19 May 2026 11:13:31 +0000

You have a list of 500 URLs — competitor product pages, supplier portals, job listings, or real estate listings. You need the data from each one.

The answer to "which tool fetches this data reliably" depends on what's in that list — not on how many URLs there are.

What's in your list → which tool:

All static HTML, no strict automation requirements → requests + httpx (fastest, cheapest)
JavaScript-rendered content, no strict automation requirements → Playwright or Crawlee
Mixed list with some protected sites → Playwright + proxy rotation
Protected or authenticated URLs at scale → TinyFish Web Agent
Massive volume (100K+) of public pages → Scrapy

(Not sure what's in your list? The URL classification probe in the Mixed List section below runs in seconds before you commit to a full crawl.)

The Tool That Fits the List

Static HTML at Volume: requests + asyncio

If your URLs are documentation pages, blog posts, static product catalogs, or any content that loads fully in the initial HTML response, Python's requests library with async execution is the fastest and cheapest option—often by a large margin.

import asyncio
import httpx
async def fetch(client: httpx.AsyncClient, url: str) -> dict:
    try:
        r = await client.get(url, timeout=15)
        return {"url": url, "status": r.status_code, "html": r.text}
    except Exception as e:
        return {"url": url, "error": str(e)}

async def crawl_list(urls: list[str], concurrency: int = 20) -> list:
    results = []
    async with httpx.AsyncClient(follow_redirects=True) as client:
        for i in range(0, len(urls), concurrency):
            batch = urls[i:i + concurrency]
            batch_results = await asyncio.gather(*[fetch(client, url) for url in batch])
            results.extend(batch_results)
            print(f"Processed {min(i + concurrency, len(urls))}/{len(urls)}")
    return results

with open("urls.txt") as f:
    urls = [line.strip() for line in f if line.strip()]

results = asyncio.run(crawl_list(urls))

In our testing, this handles 1,000 static URLs in under a minute on a standard laptop. For 100K+ URLs, Scrapy's built-in scheduler, downloader middleware, and item pipeline make more sense—it handles deduplication, retry logic, and output formatting at Scrapy's architecture level.

Where this breaks down: Any URL that requires JavaScript execution. If the page shows a loading spinner and populates content after load, requests returns the spinner HTML, not the content.

JavaScript Content: Playwright with Batching

For lists where content loads via JavaScript—React SPAs, infinite scroll, dynamic filtering, price tables that render after an API call—you need a real browser.

import asyncio
from playwright.async_api import async_playwright

async def fetch_js(page, url: str) -> dict:
    try:
        await page.goto(url, wait_until="networkidle", timeout=30000)
        content = await page.content()
        return {"url": url, "html": content}
    except Exception as e:
        return {"url": url, "error": str(e)}

async def crawl_js_list(urls: list[str], concurrency: int = 5) -> list:
    results = []
    async with async_playwright() as p:
        browser = None
        try:
            browser = await p.chromium.launch(headless=True)
            for i in range(0, len(urls), concurrency):
                batch = urls[i:i + concurrency]
                pages = [await browser.new_page() for _ in batch]
                batch_results = await asyncio.gather(*[
                    fetch_js(page, url) for page, url in zip(pages, batch)
                ])
                for page in pages:
                    await page.close()
                results.extend(batch_results)
        finally:
            if browser:
                await browser.close()
    return results

Keep concurrency low (3–8 pages) when running locally—each headless Chromium instance consumes 100–300MB. For larger lists, cloud browser infrastructure (Browserless, Browserbase) handles the browser pool so you're not resource-limited on your machine.

Where this breaks down: Sites with strict automation requirements at the network and behavioral level. JavaScript-level automation handling helps at low volume; at scale, sites with enterprise-grade access requirements become harder to handle reliably.

Sites with Strict Requirements or Authenticated Access: TinyFish

This is where simple HTTP requests stop being sufficient. Your list includes:

Product pages that return different content to automation than to browsers
Pricing pages that require login using your own authorized account
Sites with strict automation requirements that affect reliability at scale
Authenticated portals where each URL requires an authorized session

For these, maintaining a Playwright-based crawler means:

Managing automation configuration that needs ongoing updates as site requirements evolve
Building session management for authenticated URLs
Handling multi-step login flows and session state
Debugging failures that change based on site configurations you don't control

AI web agents handle this at the infrastructure level. You pass a URL and a goal; the agent handles rendering, infrastructure-level request handling, and authentication for sites where you have authorized access.

import asyncio
import aiohttp
import os

async def crawl_url(session, url: str, goal: str) -> dict:
    async with session.post(
        "https://agent.tinyfish.ai/v1/automation/run",
        headers={"X-API-Key": os.environ["TINYFISH_API_KEY"]},
        json={"url": url, "goal": goal},
        timeout=aiohttp.ClientTimeout(total=120)
    ) as resp:
        if resp.status != 200:
            return {"url": url, "result": None, "status": "HTTP_ERROR",
                    "error": await resp.text()}
        data = await resp.json()
        # "COMPLETED" means the run finished — not that the goal succeeded.
        # Check for TASK_FAILED / SITE_BLOCKED / TIMEOUT before using result.
        status = data.get("status")
        result = data.get("result")
        if status != "COMPLETED" or result is None:
            return {"url": url, "result": None, "status": status,
                    "error": data.get("error")}
        return {"url": url, "result": result, "status": status}

async def crawl_protected_list(urls: list[str], goal: str, concurrency: int = 10) -> list:
    results = []
    async with aiohttp.ClientSession() as session:
        for i in range(0, len(urls), concurrency):
            batch = urls[i:i + concurrency]
            batch_results = await asyncio.gather(*[
                crawl_url(session, url, goal) for url in batch
            ])
            results.extend(batch_results)
            print(f"Processed {min(i+concurrency, len(urls))}/{len(urls)}")
    return results

urls = ["https://protected-site.com/product/1", "https://protected-site.com/product/2"]
goal = "Extract the product name, current price, and availability status. Return as JSON."
results = asyncio.run(crawl_protected_list(urls, goal))

The concurrency limit is determined by your plan—10 concurrent agents on Starter, 50 on Pro. For a 1,000-URL list on Pro, that's 20 sequential batches of 50.

When the math shifts: requests and Playwright are cheaper per-URL on cooperative, stable sites. The cost calculation changes when you include the full stack: cloud browser pools, proxy subscriptions, and ongoing maintenance. In our experience, production scrapers for sites that actively update their access requirements need patches every few weeks — that maintenance overhead accumulates faster than per-URL cost alone suggests. For URL lists with meaningful protected or authenticated content, the total operational cost typically exceeds TinyFish's per-step pricing before you reach production scale.

Handling the Mixed List

Real URL lists are rarely uniform. A supplier monitoring list might include:

60% static pricing pages (requests would work)
30% JavaScript-rendered product tables (Playwright needed)
10% authenticated portals with strict automation requirements (agents needed)

The practical approach: categorize your list before you crawl it. A quick HEAD request or a sample run reveals which URLs respond to simple HTTP requests vs. which require rendering vs. which block automation. Route each category to the appropriate tool. The 10% that requires agents is where reliability actually matters — authentication failures and automation blocks are what stall production workflows, not the cooperative pages.

To classify URLs before routing them, a quick probe is faster than a full crawl:

import httpx
import random

def classify_url(url: str, timeout: int = 10) -> str:
    """Returns 'static', 'js', or 'blocked' based on a quick probe."""
    try:
        r = httpx.get(url, timeout=timeout, follow_redirects=True,
                      headers={"User-Agent": "Mozilla/5.0"})
        if r.status_code in (401, 403, 429):
            return "blocked"
        html = r.text
        js_signals = [
            len(html) < 500,                          # near-empty response
            '<div id="root">' in html,              # React
            '<div id="app">' in html,               # Vue
            "ng-version=" in html,                    # Angular
            "window.__NUXT__" in html,                # Nuxt
            html.count("<p") < 2 and len(html) < 2000, # minimal real content
        ]
        return "js" if any(js_signals) else "static"
    except Exception:
        return "blocked"

# Sample 10% before committing to the full crawl
sample = random.sample(urls, min(50, len(urls)))
categories: dict[str, list] = {"static": [], "js": [], "blocked": []}
for url in sample:
    categories[classify_url(url)].append(url)

print(f"Static: {len(categories['static'])}, JS: {len(categories['js'])}, Blocked: {len(categories['blocked'])}")
# Route the full list to the matching tool based on these proportions

A 429 response means rate-limited — retry with backoff before escalating. A 403 indicates access is blocked or restricted; retrying with the same tool won't help. A near-empty response or JS framework marker means JS rendering is needed. Clean HTML with visible <p> tags is static.

Scale Considerations

List size	Tool	Rough time (10 concurrent)
100–1,000 static	requests/httpx	1–5 min
100–1,000 JS	Playwright	5–20 min
100–1,000 protected	TinyFish agents	10–30 min
10,000+ static	Scrapy	Hours, distributed
10,000+ JS or protected	Infrastructure + agents	Plan accordingly

For very large lists (100K+), distributed architecture matters regardless of tool—whether that's Scrapy's built-in scheduler, a task queue like Celery, or submitting batches to an async agent API and polling for results.

FAQ

What's the fastest way to fetch data from a large URL list in Python?

For static HTML content, httpx with asyncio is the fastest approach—you can process 20–50 URLs simultaneously with a single machine and finish 1,000 URLs in under a minute. The key is async execution: sequential requests would take 10–15x longer for the same list. For JavaScript-rendered content, Playwright in async mode with 5–10 concurrent browser pages is the practical ceiling before memory constraints become a factor on standard hardware.

How do I improve reliability when fetching data from many URLs?

Rate limiting is the first line: 1–2 requests per second per domain for most sites, slower for aggressive protection. Rotate user agents across requests. For moderate protection, requests with a realistic user agent and reasonable delays works. For sites with enterprise-grade automation detection, JavaScript-level automation plugins help at low volume but degrade at scale — TinyFish provides infrastructure-level browser sessions that are more reliable for protected sites at production scale.

Should I use Scrapy or Playwright for a large URL list?

Scrapy if your URLs return static HTML and you need high volume (10K+) with built-in scheduling, retry logic, and output pipelines. Playwright if URLs require JavaScript execution. The two aren't mutually exclusive—Scrapy has a Playwright middleware (scrapy-playwright) that handles JS rendering within Scrapy's architecture. For lists with mixed content types, start with Scrapy for the static subset and use a separate Playwright job for the JS-heavy URLs.

How do I deduplicate URLs before crawling?

Normalize URLs first: lowercase the scheme and domain, sort query parameters alphabetically, strip tracking parameters (utm_*, ref=, fbclid=), and resolve relative URLs to absolute. Python's urllib.parse.urlparse plus a set for deduplication handles most cases. For large lists with near-duplicate URLs (same page, different session IDs), a URL fingerprinting library like w3lib.url.canonicalize_url gives more aggressive deduplication.

When does crawling a URL list require authentication?

When the target pages are behind login walls that your team has authorized access to—supplier pricing portals, internal tools, subscription content, or any page that redirects to a login page for unauthenticated requests. Signs your list needs auth: all results return the same HTML (the login page), response sizes are suspiciously uniform, or you see redirect chains ending at /login. For authenticated list crawling at scale, session management becomes the primary complexity—handling login flows, session expiry, and re-authentication across many concurrent workers. TinyFish handles session management and multi-step login flows for sites where you have authorized account access — you provide credentials, the agent handles the rest.

Try TinyFish Free

500 free steps, no credit card. The fastest way to test whether TinyFish fits your workflow.

Start free →

Web Agents for Sales Intelligence: Conference Lead Extraction at Scale

Tinyfishie — Tue, 19 May 2026 10:06:52 +0000

Conference websites are a signal-dense environment for sales teams. Exhibitors have budget. Sponsors have intent. Speakers have influence. Attendees self-select into a category of people who care enough about a topic to show up in person. For a well-targeted go-to-market motion, a single conference list can be worth more than months of cold outreach to a generic ICP database.

The problem is extraction. Most conference websites are built differently. Exhibitor lists appear in different formats — some as HTML tables, some as filterable grids, some as PDFs, some as paginated directories with inconsistent naming conventions. Speaker pages vary. Sponsor tiers are labeled differently. And most importantly: the information you need is rarely in a format your CRM can ingest directly.

Traditional automation — extraction with CSS selectors — breaks the moment a conference organizer redesigns their website, switches platforms, or adds a new filter layer. Which they do, regularly, because conference websites are rebuilt for each event. An extraction that worked for a 2025 conference often fails completely for the same conference in 2026, because the organizer switched from a custom site to Hopin or Swapcard or their own new platform.

Web agents handle this because they read the page rather than pattern-match against it. This article covers how that works in practice, with working code for each pattern.

Web agents for sales intelligence are automated programs that navigate live web pages — conference exhibitor directories, sponsor listings, speaker profiles — on behalf of a sales team, extracting structured prospect data without requiring custom code per site. Unlike traditional scrapers that depend on fixed CSS selectors, a web agent interprets page content by goal: given the instruction "find all exhibitors and return their company name, website, and category," the agent reads the page as a browser would, handles pagination and dynamic rendering, and returns clean JSON regardless of how the underlying HTML is structured. The result is a scalable data pipeline that monitors dozens of events monthly and feeds enriched leads directly into a CRM.

When does a web agent apply to sales intelligence?

Your lead sources are conference and event websites — exhibitor lists, speaker directories, sponsor pages
Most event sites have different structures — making traditional extraction brittle
You're covering dozens of events monthly — the scale makes manual extraction untenable
You need structured output into your CRM — not PDFs or copy-paste, but clean data your tools can process
Time-to-data matters — conference lead data has a shelf life; the sooner it's in your pipeline, the more valuable it is

The operational problem at scale

For sales teams tracking dozens of conferences per month, each conference website is different. Every time a site updates its layout or changes platforms, platform-specific scrapers break. The engineering overhead of maintaining those scrapers across that many events is disproportionate to the value of any individual conference.

Web agents navigate conference websites by reading page content and structure directly — adapting to each site's layout without requiring custom code per event. When a site changes layouts, the agent adapts. When a new conference is added, it requires a goal prompt, not a new scraper.

The operational outcome: dozens of events monitored monthly, structured lead data delivered automatically, zero custom code required per event.

The technical pattern: structured lead extraction from event sites

📸 IMAGE — Diagram showing web agents extracting structured lead data from diverse conference websites and delivering to CRM

Conference sites present a range of extraction challenges — paginated exhibitor lists, filterable sponsor grids, speaker bio pages. The agent handles all of these with the same underlying approach: describe the goal, let the agent navigate.

Installation

pip install tinyfish
export TINYFISH_API_KEY=sk-tinyfish-*****

Exhibitor and sponsor extraction

import asyncio
import json
from datetime import datetime, timezone
from tinyfish import AsyncTinyFish, BrowserProfile

client = AsyncTinyFish()

CONFERENCE_EVENTS = [
    # URLs below are illustrative — replace with verified event page URLs before running
    {
        "event_name": "SaaStr Annual 2026",
        "exhibitors_url": "https://tech-conference-example.com/exhibitors",
        "target_type": "exhibitor",
    },
    {
        "event_name": "Dreamforce 2026",
        "exhibitors_url": "https://enterprise-summit-example.com/sponsors/",
        "target_type": "sponsor",
    },
    {
        "event_name": "RevOps Summit 2026",
        "exhibitors_url": "https://revenue-summit-example.com/speakers",
        "target_type": "speaker",
    },
]

async def extract_event_leads(event: dict) -> dict:
    target = event["target_type"]

    goal = f"""
    Extract all {target}s listed on this page.

    For each {target}, extract:
    {{
        "company_name": "string",
        "website": "URL or null",
        "category": "industry or booth category if shown, else null",
        "tier": "sponsor tier (Gold/Silver/Bronze/etc.) or null if not applicable",
        "contact_name": "primary contact name if shown, else null",
        "contact_title": "job title if shown, else null",
        "description": "one-sentence company description if shown, else null"
    }}

    If the list is paginated, click through all pages before returning.
    If there is a "Load More" button, click it until all items are loaded.
    Scroll down to ensure all visible items are captured.

    Return JSON:
    {{
        "event_name": "{event['event_name']}",
        "target_type": "{target}",
        "total_count": number,
        "leads": [ ...array of lead objects above... ]
    }}

    If the list requires login to your authorized account to view, return total_count as 0 and leads as [].


![Diagram showing web agents extracting structured lead data from diverse conference websites and delivering to CRM](https://cdn.sanity.io/images/nhc04xln/production/33e59ca62cc7afbfb67a6125992ee957b1abf1b6-1024x1024.png)

# (continued)
    Do not click any registration or payment links.
    """

    response = await client.agent.run(
        url=event["exhibitors_url"],
        goal=goal,
        browser_profile=BrowserProfile.MANAGED,
    )

    # For debugging: response.streaming_url has a live browser replay (valid 24h)
    # response.result is shaped by the goal — leads array is the key output
    result = response.result or {}

    if result.get("status") == "failure":
        return {
            "event_name": event["event_name"],
            "target_type": target,
            "leads": [],
            "error": result.get("reason", "goal_failed"),
            "extracted_at": datetime.now(timezone.utc).isoformat(),
        }

    return {
        "event_name": event["event_name"],
        "target_type": target,
        "total_count": result.get("total_count", 0),
        "leads": result.get("leads", []),
        "extracted_at": datetime.now(timezone.utc).isoformat(),
    }

async def main():
    tasks = [extract_event_leads(event) for event in CONFERENCE_EVENTS]
    results = await asyncio.gather(*tasks)

    total_leads = sum(r["total_count"] for r in results)
    print(f"Total leads extracted: {total_leads} across {len(results)} events")
    print(json.dumps(results, indent=2))

asyncio.run(main())

Output schema

{
  "event_name": "SaaStr Annual 2026",
  "target_type": "exhibitor",
  "total_count": 312,
  "leads": [
    {
      "company_name": "Acme Software",
      "website": "https://vendor-a.example.com",
      "category": "CRM & Sales Tools",
      "tier": null,
      "contact_name": null,
      "contact_title": null,
      "description": "AI-powered CRM for sales teams"
    },
    {
      "company_name": "BuildIt Inc.",
      "website": "https://vendor-b.example.com",
      "category": "DevTools",
      "tier": "Gold Sponsor",
      "contact_name": "Sarah Chen",
      "contact_title": "VP of Sales",
      "description": null
    }
  ],
  "extracted_at": "2026-03-27T09:00:01Z"
}

Enriching leads with company-level data

Raw exhibitor lists give you company names. What your CRM needs is enriched records — employee count, funding stage, tech stack, recent news. A second agent pass can pull this from each company's website or LinkedIn company page.

async def enrich_company(company_name: str, website: str) -> dict:
    goal = f"""
    Extract company information for: {company_name}
    Website: {website or f'search for {company_name} company website'}

    Extract:
    {{
        "company_name": "{company_name}",
        "employee_count_range": "1-10 / 11-50 / 51-200 / 201-500 / 500+ — estimate from About or LinkedIn",
        "funding_stage": "Bootstrapped / Seed / Series A / Series B / Series C+ / Public / Unknown",
        "hq_location": "City, Country or null",
        "industry": "primary industry category",
        "founded_year": number or null,
        "tagline": "one-line description from homepage or null"
    }}

    Navigate to the website's About page or homepage.
    Do not fill in any contact forms.
    """

    response = await client.agent.run(
        url=website,  # Provide the company website URL directly (LinkedIn or company site)
        goal=goal,
        browser_profile=BrowserProfile.MANAGED,
    )

    result = response.result or {}
    if result.get("status") == "failure":
        return {"company_name": company_name, "error": result.get("reason")}

    return result

# Enrich all leads from an extraction run
async def enrich_all(leads: list) -> list:
    tasks = [
        enrich_company(lead["company_name"], lead.get("website"))
        for lead in leads
        if lead.get("company_name")
    ]
    return await asyncio.gather(*tasks)

The two-pass pattern — extract from event site, then enrich from company sites — separates concerns cleanly. The first pass is optimized for navigating conference-specific layouts. The second pass is optimized for company data extraction, which follows more consistent patterns across company websites.

Handling the common edge cases

Paginated exhibitor lists — The goal prompt includes "click through all pages before returning." For very long lists (500+ exhibitors), break into page-by-page runs and aggregate results, rather than running one very long session that risks timeout.

Filterable grids — Some conference sites show exhibitors in a filterable grid by category or tier. Instruct the agent to clear all filters first, or specify which filter to apply: "Select the 'Gold Sponsor' filter before extracting."

PDF exhibitor lists — Some conferences publish exhibitor lists as PDFs linked from the main page. The agent can navigate to the link, open the PDF, and extract text content. Include in the goal: "If the exhibitor list is a linked PDF, open it and extract the company names."

Login-gated lists — Some premium conference platforms require registration with your own account to view the full attendee list. The goal above returns an empty array in this case. For events where you have authorized credentials, add login steps to the goal for your own accounts.

Sites that load content after scroll — The goal includes "Scroll down to ensure all visible items are captured." For infinite-scroll lists, add: "Continue scrolling until no new items appear."

Scaling across your conference calendar

For a sales team tracking a conference calendar, the right architecture runs this as a scheduled job: extract leads from each event site as it goes live, enrich them, and push to your CRM.

CONFERENCE_CALENDAR = [
    {"event_name": "SaaStr Annual", "exhibitors_url": "...", "date": "2026-09-15"},
    {"event_name": "Dreamforce",    "exhibitors_url": "...", "date": "2026-09-17"},
    # add your full event calendar
    # URLs below are illustrative — replace with verified URLs before running
]

# Cost estimate (PAYG at $0.015/credit):
# Extraction: 1 agent run per conference, ~10-20 steps (navigation + pagination)
#   → ~$0.15-$0.30 per conference
# Enrichment: 1 agent run per company, ~4 steps each (navigate + find About + extract)
#   → 300 companies × 4 credits × $0.015 = ~$18 per conference
# Total per conference: ~$18-20 for a fully enriched lead list
#
# Note: run steps vary with site complexity. Starter plan ($15/mo) covers ~1,650 credits —
# enough for roughly 4 full extraction + enrichment runs per month.

Cost breakdown (PAYG at $0.015/credit): extraction runs as one agent session per conference — ~10-20 steps for a paginated list, roughly $0.15-$0.30. The enrichment pass runs one agent session per company: at ~4 steps each, 300 companies cost approximately $18. Total: roughly $18-20 for a fully structured, CRM-ready lead list. Manual research at equivalent depth costs hours of analyst time per event.

Build vs. buy: when extraction is still the right answer

If your conference intelligence needs are narrow — one or two annual events, stable sites, static HTML lists — a simple extraction is cheaper and faster to build. Use it.

The case for web agents compounds as the number of events scales, as event site diversity increases, and as the maintenance cost of broken scrapers accumulates. At dozens of events monthly, each with a different site structure, maintaining platform-specific scrapers is a full-time job that produces inconsistent results. A goal-based agent is a one-time investment that generalizes across events.

The other factor is time-to-data. A extraction that breaks on Monday morning when the conference site updates means your team gets stale data for however long it takes engineering to fix it. An agent that reads the page as a human does degrades more gracefully — it might miss some edge cases, but it doesn't break silently.

Get started

The free tier gives you 500 credits — enough to extract a full exhibitor list from one or two conference sites and see structured output before committing to production volume.

Start free — 500 credits, no credit card required

For sales teams or platforms tracking dozens of events monthly, contact our enterprise team for volume pricing.

FAQ

How does the agent handle conference sites that change every year?

Because the agent reads the page based on intent ("find the exhibitor list") rather than fixed selectors, it adapts to annual redesigns without requiring code changes. The goal prompt may need minor tuning for major structural changes, but it doesn't break the way a selector-based extraction does.

Can this extract attendee lists, not just exhibitors?

Attendee lists are usually login-gated on your own account or not published publicly for privacy reasons. Exhibitor, sponsor, and speaker lists are typically public. For gated attendee data in accounts you are authorized to access, include login steps in the goal with your authorized credentials.

What does COMPLETED status mean?

Infrastructure success — the browser ran. Not data extraction success. Always check result.leads — an empty array with no error may mean the list genuinely has no entries, or it may mean the page structure wasn't handled. Use the streaming_url to debug ambiguous cases.

How do we push the output to our CRM?

The JSON output maps directly to standard CRM fields. Most CRMs (Salesforce, HubSpot, Pipedrive) have APIs or CSV import that accepts this structure. The enrichment pass adds the firmographic fields most CRMs expect.

Is there a limit on how many events we can run concurrently?

Per-plan concurrency limits apply. When exceeded, requests queue automatically — no errors, but later extractions take longer. For a large conference calendar running simultaneously, size your batches to your plan's concurrency limit.

Can this handle conference sites in languages other than English?

Yes. The agent reads the page as it appears — if the site is in German or Japanese, the agent navigates the German or Japanese interface. Include proxy_config with the appropriate country code for geo-restricted sites.

See It in Action

The free tier includes 500 steps — enough to run a complete sales intelligence workflow against real data before committing to a plan.

Start free, no credit card →

Web Agents for Insurance Operations: Prior Auth, Portal Monitoring, and Regulatory Tracking

Tinyfishie — Tue, 19 May 2026 10:02:49 +0000

Web Agents for Insurance Operations: Prior Auth, Portal Monitoring, and Regulatory Tracking

If you work in insurance operations, you already know the portal problem. Every payer has one. Each one is different. Some require two-factor login to your own account. Some time out after 15 minutes of inactivity. Some show prior auth status on page three of a workflow that starts on a dashboard that loads differently depending on which browser you're using.

Multiply that across 50 payers, and you have a problem that doesn't scale with headcount. Hiring more people to click through more portals isn't a solution — it's a deferral. The portals don't get simpler. The volume doesn't go down.

Web agents don't replace the portals. They handle the repetitive navigation tasks that humans currently perform manually — logging in, extracting status, and returning structured data that your systems can actually use. This article covers three use cases where that matters most: prior authorization tracking, multi-pharmacy price monitoring, and real-time regulatory document monitoring.

Scope note: This article covers insurance operations automation patterns using web agents. The use cases below represent technically valid workflows validated through internal testing, not published case studies from specific TinyFish insurance customers.

When does a web agent apply to insurance operations?

Before the use cases: if you're evaluating whether this fits your team's situation, here's the honest framework.

You're checking the same portals repeatedly — prior auth status, claim status, eligibility — on a schedule your staff maintains manually
Portal interfaces vary across payers — no two are identical, making traditional RPA brittle
You need structured output, not screenshots — downstream systems need data, not PDFs
Volume exceeds what staff can cover — 50+ portals, daily or more frequent checks
Login walls to your authorized payer portals are blocking automation — most traditional scrapers stop at the login screen. Web agents using your authorized credentials navigate login flows as a normal part of the workflow

If your team checks fewer than 10 portals and the data isn't time-sensitive, manual workflows may be cheaper. The case for automation compounds as portal count and check frequency increase.

The portal problem: why insurance operations are stuck in 2005

📸 IMAGE — Architecture diagram showing TinyFish web agents checking 50 payer portals in parallel returning structured JSON

Fifty payer portals sounds like a technology problem. It's actually a coordination problem that technology made worse.

Each payer built their portal independently, on their own timeline, for their own internal reasons. The result is a landscape where every portal has a different login flow, a different session timeout, a different way of presenting the same data. Prior auth status might be on the dashboard for one payer, buried in a sub-menu for another, and only visible after clicking through a three-step workflow for a third.

Traditional RPA — the previous generation's answer to this — was built around CSS selectors and fixed screen coordinates. When a payer updates their portal layout, the RPA script breaks. Someone has to find it, diagnose it, and fix it before the monitoring gap compounds. In a 50-portal environment, that's a near-continuous maintenance cycle.

Web agents approach portals through standard browser sessions: reading the page, understanding what's on it, and navigating based on intent rather than coordinates. When a portal layout changes, the goal — "find the prior auth status for claim ID X" — doesn't.

Benchmark: In TinyFish's internal testing, 50 payer portals were checked in 2 minutes and 14 seconds. A staff member doing the same task manually takes 45 minutes or more. At daily frequency, that's roughly 180 staff-hours per month on portal checks alone.

Use case 1: Prior authorization status tracking across 50 portals

Prior auth is time-sensitive in both directions. A pending auth that gets approved needs to trigger the next step in the care workflow. A denial needs to trigger the appeals process. Both require someone — or something — checking portal status on a regular cadence.

The code

import asyncio
import json
from datetime import datetime, timezone
from tinyfish import AsyncTinyFish, BrowserProfile

client = AsyncTinyFish()

PRIOR_AUTH_CHECKS = [
    {
        "payer": "Aetna",
        "portal_url": "https://payer-portal-a.example.com/...",
        "auth_id": "PA-2026-00441",
    },
    {
        "payer": "UnitedHealthcare",
        "portal_url": "https://payer-portal-b.example.com/...",
        "auth_id": "PA-2026-00892",
    },
    # scale to 50+ payers
]

async def check_auth_status(payer: str, portal_url: str, auth_id: str) -> dict:
    goal = f"""
    Find the prior authorization status for auth ID: {auth_id}.

    Log in if prompted using your authorized portal credentials. Navigate to the prior authorization or claims section.
    Locate the record matching auth ID {auth_id}.

    Return JSON with this exact structure:
    {{
        "auth_id": "{auth_id}",
        "status": "approved" or "pending" or "denied" or "not_found",
        "effective_date": "YYYY-MM-DD or null",
        "expiry_date": "YYYY-MM-DD or null",
        "denial_reason": "string or null"
    }}

    If you cannot find the auth ID, return status as "not_found".
    Do not click any Submit or transaction buttons.  # prevents accidental form submissions or transactions during portal navigation
    """

    response = await client.agent.run(
        url=portal_url,
        goal=goal,
        browser_profile=BrowserProfile.MANAGED,
    )

    # For debugging: response.streaming_url contains a live browser replay (valid 24h)
    # Layer 1: infrastructure failure — result is None
    if response.result is None:
        return {
            "payer": payer, "auth_id": auth_id,
            "status": "error", "error": "infrastructure_failure",
            "checked_at": datetime.now(timezone.utc).isoformat(),
        }
    # Layer 2: goal failure — agent ran but task didn't succeed

# (continued)
    result = response.result

    if result.get("status") == "failure":
        return {
            "payer": payer,
            "auth_id": auth_id,
            "status": "error",
            "error": result.get("reason", "goal_failed"),
            "checked_at": datetime.now(timezone.utc).isoformat(),
        }

    return {
        "payer": payer,
        "auth_id": auth_id,
        "status": result.get("status"),
        "effective_date": result.get("effective_date"),
        "expiry_date": result.get("expiry_date"),
        "denial_reason": result.get("denial_reason"),
        "checked_at": datetime.now(timezone.utc).isoformat(),
    }

async def main():
    tasks = [
        check_auth_status(c["payer"], c["portal_url"], c["auth_id"])
        for c in PRIOR_AUTH_CHECKS
    ]
    results = await asyncio.gather(*tasks)
    print(json.dumps(results, indent=2))

asyncio.run(main())

Note on result handling: status: "COMPLETED" means the browser ran — not that the auth ID was found. A portal that returned a "not found" page will also return COMPLETED, but result.status will be "not_found" as instructed in the goal. Always check result, not just the run status.

Note on login walls: This example assumes portal URLs for portals you are authorized to access that accept authenticated sessions or have login steps navigable by the agent. For portals with 2FA or SSO flows, additional goal steps are needed.

Output schema

[
  {
    "payer": "Aetna",
    "auth_id": "PA-2026-00441",
    "status": "approved",
    "effective_date": "2026-03-01",
    "expiry_date": "2026-08-31",
    "denial_reason": null,
    "checked_at": "2026-03-27T09:00:01Z"
  },
  {
    "payer": "UnitedHealthcare",
    "auth_id": "PA-2026-00892",
    "status": "pending",
    "effective_date": null,
    "expiry_date": null,
    "denial_reason": null,
    "checked_at": "2026-03-27T09:00:04Z"
  }
]

This output feeds directly into your care coordination workflow. Approved auths trigger the next step. Denials trigger the appeals queue. Pending items re-enter the next day's check cycle. The agent produces the data; your existing systems handle the routing.

Use case 2: Drug price monitoring across state transparency portals

Most states now publish drug pricing data through public transparency portals — mandated by law, updated regularly, and freely accessible without a contract relationship. For pharmacy directors, formulary analysts, and benefit consultants, these portals are a legitimate and underused data source. The problem is coverage: 40+ states with different portal designs, different update schedules, and no unified feed.

Monitoring them manually is the same problem as payer portal monitoring — the data exists, accessing it at scale requires either staff time or automation.

import asyncio
import json
from datetime import datetime, timezone
from tinyfish import AsyncTinyFish, BrowserProfile

client = AsyncTinyFish()

# State drug price transparency portals — publicly accessible, no contract required
STATE_DRUG_PORTALS = [
    {
        "state": "CA",
        "url": "https://www.dhcs.ca.gov/provgovpart/pharmacy/Pages/drug-pricing.aspx",
        "drug_name": "metformin 500mg",
    },
    {
        "state": "NY",
        "url": "https://www.health.ny.gov/health_care/medicaid/program/pharmacy/",
        "drug_name": "metformin 500mg",
    },
    {
        "state": "TX",
        "url": "https://hhs.texas.gov/providers/medicaid-prescription-drug-program",
        "drug_name": "metformin 500mg",
    },
    # add remaining states
]

async def check_drug_price(state: str, url: str, drug_name: str) -> dict:
    goal = f"""
    Find the current published price or reimbursement rate for: {drug_name}.

    Navigate to the drug pricing or formulary section.
    Look for a price table, search tool, or downloadable listing.

    Return JSON:
    {{
        "drug_name": "{drug_name}",
        "published_price": number or null,
        "currency": "USD",
        "price_type": "reimbursement rate / retail price / AWP / MAC — whichever applies",
        "effective_date": "YYYY-MM-DD or null",
        "source_label": "label or table header as shown on page"
    }}

    If no price is found, set published_price to null.
    Do not submit any forms or create any accounts.
    """

    response = await client.agent.run(
        url=url,
        goal=goal,
        browser_profile=BrowserProfile.LITE,  # state gov sites typically don't need stealth
    )

    # For debugging: response.streaming_url has a live browser replay (valid 24h)
    result = response.result or {}

    if result.get("status") == "failure":
        return {
            "state": state,

# (continued)
            "drug_name": drug_name,
            "error": result.get("reason", "goal_failed"),
            "checked_at": datetime.now(timezone.utc).isoformat(),
        }

    return {
        "state": state,
        "drug_name": drug_name,
        "published_price": result.get("published_price"),
        "currency": result.get("currency", "USD"),
        "price_type": result.get("price_type"),
        "effective_date": result.get("effective_date"),
        "source_label": result.get("source_label"),
        "checked_at": datetime.now(timezone.utc).isoformat(),
    }

async def main():
    tasks = [check_drug_price(p["state"], p["url"], p["drug_name"]) for p in STATE_DRUG_PORTALS]
    results = await asyncio.gather(*tasks)
    print(json.dumps(results, indent=2))

asyncio.run(main())

State transparency portals publish reimbursement rates, MAC pricing, and AWP data — the reference prices that feed formulary decisions and benefit design. Aggregating this across 40+ states concurrently takes minutes instead of days, and the output is structured data rather than a stack of PDFs to parse manually.

Use case 3: Real-time regulatory document monitoring

State insurance filings change. Rate approvals, form filings, regulatory bulletins — the information is public, posted on state insurance department websites, and directly relevant to compliance and underwriting decisions. The problem is that 50 states publish on different schedules, in different formats, with no unified feed.

Manual monitoring at this scale means either missing changes or dedicating staff to daily checks across 50 state websites. Neither is sustainable.

The agent approach: run a daily scan across state insurance department filing pages, extract new documents since the last check, and return structured metadata.

import asyncio
import json
from datetime import datetime, timezone
from tinyfish import AsyncTinyFish, BrowserProfile

client = AsyncTinyFish()

STATE_FILING_PAGES = [
    {"state": "CA", "url": "https://www.insurance.ca.gov/0400-news/0200-bulletins/"},
    {"state": "NY", "url": "https://www.dfs.ny.gov/industry_guidance/circular_letters"},
    {"state": "TX", "url": "https://www.tdi.texas.gov/rule/"},
    # add all 50 states
]

LAST_CHECK_DATE = "2026-03-26"  # replace with your stored value

async def check_state_filings(state: str, url: str) -> dict:
    goal = f"""
    Extract all regulatory documents or bulletins published after {LAST_CHECK_DATE}.

    For each document found, return:
    {{
        "title": "document title",
        "published_date": "YYYY-MM-DD",
        "document_url": "full URL",
        "category": "bulletin / circular / rate filing / form filing / other"
    }}

    Return as JSON array. If no new documents found, return an empty array [].
    Do not click any login or subscription prompts.
    """

    response = await client.agent.run(
        url=url,
        goal=goal,
        browser_profile=BrowserProfile.LITE,  # state gov sites typically don't need stealth
    )

    result = response.result or {}

    # Check for goal-level failure before checking result type
    # Without this, a failure dict would pass the isinstance check as non-list and silently return []
    if result.get("status") == "failure":
        return {
            "state": state,
            "new_documents": [],
            "error": result.get("reason", "goal_failed"),
            "checked_at": datetime.now(timezone.utc).isoformat(),
        }

    return {
        "state": state,
        "new_documents": result if isinstance(result, list) else [],
        "checked_at": datetime.now(timezone.utc).isoformat(),
    }

async def main():

# (continued)
    tasks = [check_state_filings(s["state"], s["url"]) for s in STATE_FILING_PAGES]
    results = await asyncio.gather(*tasks)

    # Filter to states with new documents
    updates = [r for r in results if r["new_documents"]]
    print(json.dumps(updates, indent=2))

asyncio.run(main())

The output feeds a compliance dashboard, triggers alerts for specific document categories, or routes to the relevant underwriting or legal team. The agent doesn't interpret the documents — it surfaces them. Interpretation stays with your team.

The compliance question: what web agents can and can't do in regulated environments

This is the question that comes up in every insurance technology conversation, so let's address it directly.

What web agents do: navigate browser sessions, extract data from pages, return structured output. They interact with portals through standard browser sessions — navigating to a page, reading what's on it, and returning the relevant data.

What web agents don't do: store protected health information, transmit data across systems, or make clinical or coverage decisions. The agent is a data extraction layer. Where that data goes next — and what compliance obligations apply to that storage and transmission — is determined by your system architecture, not the agent.

On HIPAA: A web agent accessing a payer portal doesn't inherently create new HIPAA obligations beyond those that already exist when a staff member accesses the same portal through the same browser. The agent performs authenticated browser access — reading page content through an authenticated session and returning structured data. The compliance obligations arise from what you do with the extracted data: how it's stored, who can access it, and how it's transmitted. That's your data pipeline's responsibility, not the extraction layer's.

This doesn't mean HIPAA review isn't necessary — it means the review should focus on the full data flow, not just the agent step. Any healthcare or insurance deployment should involve a legal review of the complete architecture. The agent layer is usually not where the compliance complexity lives.

Build vs. buy: when to use TinyFish vs. building your own portal integration

The alternative to a web agent for payer portal access is a direct API integration — but most payer portals don't have APIs. The ones that do (typically large national payers via HL7 FHIR) are worth pursuing first. Where APIs exist, use them.

Web agents become the right tool when:

The portal has no API and no roadmap to add one
The portal interface is stable enough to automate but too variable for traditional RPA
The volume of checks justifies automation over manual workflows
Your team can't maintain custom automation scripts as portals evolve

The build-vs-buy question for the agent infrastructure itself: building a raw HTTP automation layer that handles login walls, session management, 2FA, and concurrent execution is months of engineering work. The free tier — 500 credits, no credit card — is enough to run a proof of concept against your real target portals before that decision needs to be made.

Get started

The free tier gives you 500 credits — enough to run a prior auth status check across 50 portals and see real output before committing to anything.

For healthcare and insurance teams with compliance requirements or enterprise-scale portal monitoring needs, contact our enterprise team.

FAQ

Can web agents handle your own payer portals that require 2FA or SSO login?

Agents can navigate multi-step authentication workflows, including TOTP-based MFA and standard SSO redirects, for portals you're authorized to access. Include the authentication steps for your authorized portal in the goal prompt, or use TinyFish's credential vault for secure credential management. SSO with federated identity is more complex and depends on the specific implementation. Include explicit authentication steps for your authorized portal in your goal prompt and use the credential vault for secure credential storage.

Does this work for all 50 payer portals or just the major ones?

Any publicly accessible portal — including smaller regional payers — is reachable. Success rate depends on the portal's complexity and technical protection level, not its size. The agent navigates based on what it sees on screen, so portal size is not a limiting factor.

What happens when a payer portal updates its layout?

The goal-based approach is more resilient to layout changes than selector-based RPA. When a portal redesigns, the goal — "find prior auth status for ID X" — remains valid. In practice, goal prompts may need minor tuning after major redesigns, but the maintenance burden is substantially lower than maintaining CSS selectors.

Is the extracted data HIPAA-compliant?

HIPAA obligations arise from how extracted data is stored, accessed, and transmitted — not from the extraction mechanism itself. A web agent accessing a payer portal creates the same data handling obligations as a staff member accessing the same portal. The compliance review should cover your full data pipeline, not just the agent step. Any healthcare deployment should involve legal review of the complete architecture before going to production.

How do I handle portals that time out mid-session?

Include session management instructions in the goal: "if a timeout or re-login prompt appears on your authorized portal, log in again with the provided credentials and resume." The agent follows standard re-authentication flows natively.

See It in Action

The free tier includes 500 steps — enough to run a complete insurance data extraction workflow against real data before committing to a plan.

Start free, no credit card →

When Web Agents Fail: Debugging Goal-Based Automation

Tinyfishie — Tue, 19 May 2026 10:01:42 +0000

Web agent debugging separates infrastructure failures (the run didn't complete) from goal failures (the run completed but returned wrong data) — two categories that require completely different fixes.

The agent ran. It returned COMPLETED. But the price field is empty — or it pulled the wrong product, or it returned results from only the first page of a multi-page portal.

That's the problem with agent debugging: the failure isn't explicit. A Playwright script throws a timeout or a selector error. An agent returns a status that says it finished — and the problem only surfaces when you check the data downstream.

Understanding why agents fail differently from scripts is the foundation of effective web agent debugging. This guide covers both failure layers and a systematic approach to each.

Why Agent Failures Feel Different from Script Failures

A scripted automation fails predictably. The selector didn't match, the element wasn't found, the HTTP request timed out — there's a specific line where execution stopped and an error you can read.

A goal-directed agent has no single execution path. It plans, navigates, adapts. That flexibility is what makes it useful on dynamic pages, but it also means there's no breakpoint. The agent chose a sequence of steps you didn't write, and somewhere in that sequence it went wrong — but the run still completed.

This is a structural difference, not a reliability problem. Agents trade determinism for adaptability. The tradeoff is that debugging requires a different mental model.

The good news: most agent failures cluster into two categories — a pattern TinyFish has observed consistently across production runs at different scales and page types. Each category has a clear diagnostic path. Once you know which type you have, the fix is usually straightforward.

Related: What Is an Agentic Browser?

Two Distinct Failure Types (And Why They Need Different Fixes)

The most important diagnostic question isn't "what went wrong" — it's "did the run complete at all?"

Layer 1 — Infrastructure failure: The run didn't complete. The result is None, the status indicates an error, or you got no response at all.

Layer 2 — Goal failure: The run completed (status: COMPLETED), but the result is wrong, incomplete, or inconsistent with what you asked for.

These two failures have completely different root causes and require different debugging approaches. Treating a goal failure as an infrastructure problem sends you in the wrong direction immediately.

The check pattern:

import requests

TINYFISH_API_KEY = "your_api_key"

response = requests.post(
    "https://agent.tinyfish.ai/v1/automation/run",
    headers={"X-API-Key": TINYFISH_API_KEY},
    json={
        "goal": "Extract the product name, price, and availability from this listing",
        "url": "https://portal.example.com/products/item-123"
    }
)


![Decision tree showing two web agent failure paths: infrastructure failure on the left and goal failure on the right, with diagnostic steps for each](https://cdn.sanity.io/images/nhc04xln/production/b17d16ff1e2e2de063ea8f961517ea4a9a51df86-1024x1024.png)

# Layer 1: Did the run complete?
if response.status_code != 200:
    print(f"Infrastructure failure: HTTP {response.status_code}")
    # Debug infrastructure (see next section)
else:
    result = response.json()
    if result.get("status") != "COMPLETED":
        print(f"Infrastructure failure: {result.get('status')}")
    else:
        # Layer 2: Did the goal succeed?
        data = result.get("result", {})
        if not data.get("price"):
            print("Goal failure: run completed but price missing")
            # Debug goal specification (see goal failure section)

COMPLETED is not the same as "goal succeeded." It means the infrastructure layer finished. The goal evaluation is always separate.

Infrastructure Failures: The Run Didn't Complete

Infrastructure failures mean the run itself didn't complete. The result status tells you what happened:

Status	What it means	First thing to check
`TASK_FAILED`	The agent encountered a condition it couldn't continue from	Check session state: does the page require authentication or specific cookies?
`SITE_BLOCKED`	The target site's configuration prevented this run from completing	Review whether the site requires specific access patterns, session state, or authentication on an account you control
`MAX_STEPS_EXCEEDED`	The agent reached the step limit without completing	The goal is too broad. Split it: instead of "Extract all product data and submit the order form," use two separate calls — one to extract, one to submit. Each sub-task should complete in fewer than 20 steps.
`TIMEOUT`	The run exceeded the time limit	Check if the page loads slowly; try increasing timeout in your request parameters
`None` response	Network or API issue	Check your API key and network connectivity

On SITE_BLOCKED: This status means the target site's configuration — session requirements, access patterns, or content policies — prevented the run from completing. Check what the site actually requires: does it need authentication on an account you control? Does it require specific session cookies? Retry after addressing the underlying access requirement, not the symptom.

Infrastructure failures generally aren't retryable without changing something. A TIMEOUT might resolve on retry; a SITE_BLOCKED on a page that requires authentication won't.

Goal Failures: COMPLETED Doesn't Mean Done

Goal failures are subtler: the run completed, status is COMPLETED, but the data is wrong or missing. Three common patterns:

Pattern 1: Missing fields
The agent returned a result but specific fields are empty or null.

Symptom: result.data.get("price") is None, but the page clearly shows a price.
Cause: The goal didn't specify the field explicitly enough. The agent extracted what it interpreted as relevant — just not the right thing.
Fix: Name the fields explicitly in the goal.

Pattern 2: Wrong element
The agent extracted data — but from the wrong element or section of the page.

Symptom: The data looks plausible but doesn't match the specific item you targeted.
Cause: The page has multiple similar-looking elements and the goal didn't specify which one.
Fix: Add context about the element's location or distinguishing characteristic.

Pattern 3: Inconsistent results across runs
The same goal produces different results when run multiple times on the same URL.

Symptom: Run 1 extracts $29.99; Run 2 extracts $24.99; the page shows both a regular and sale price.
Cause: The goal doesn't specify which value to use when multiple options exist.
Fix: Add specificity: "Extract the current sale price, not the original price."

To see which pattern you have, use streaming to watch the agent's steps:

import requests

TINYFISH_API_KEY = "your_api_key"

with requests.post(
    "https://agent.tinyfish.ai/v1/automation/run-sse",
    headers={
        "X-API-Key": TINYFISH_API_KEY,
        "Accept": "text/event-stream"
    },
    json={
        "goal": "Extract the product name, price, and availability",
        "url": "https://portal.example.com/products/item-123"
    },
    stream=True
) as response:
    for line in response.iter_lines():
        if line:
            print(line.decode())  # Watch what the agent navigates and extracts

The step stream shows what the agent actually saw and acted on — which element it identified as the price, which page state it was in when it extracted. This is the closest equivalent to adding console.log to a script.

Using the Browser API for Step-Level Debugging

When the step stream shows the agent going wrong but you're not sure why, the next step is to inspect the actual page state the agent encountered.

The Browser API gives you a raw CDP connection to inspect the exact DOM state the web agent encountered — useful when step-stream output is ambiguous. It's a separate TinyFish product from the Web Agent and requires its own session.

Identify the failing step from the stream, then use the Browser API to reproduce that step and inspect the DOM directly:

import requests
from playwright.sync_api import sync_playwright

TINYFISH_API_KEY = "your_api_key"

# Create a Browser API session
response = requests.post(
    "https://api.browser.tinyfish.ai",
    headers={"X-API-Key": TINYFISH_API_KEY},
    json={}
)
session = response.json()
cdp_url = session["cdp_url"]  # WebSocket endpoint for CDP connection

# Connect Playwright to the TinyFish cloud browser
with sync_playwright() as p:
    browser = p.chromium.connect_over_cdp(cdp_url)
    page = browser.contexts[0].pages[0]

    # Navigate to where the agent went wrong
    page.goto("https://portal.example.com/products/item-123")

    # Inspect the actual DOM state
    price_elements = page.query_selector_all(".price, [data-price], .product-price")
    for el in price_elements:
        print(el.inner_text(), el.get_attribute("class"))

This shows you exactly what the agent was looking at: how many price elements were on the page, what loaded dynamically after the initial render. Once you understand the page structure the agent encountered, the goal fix usually becomes clear.

The Real Fix Is Usually the Goal Text

Most goal failures are fixable by improving the goal text. The agent is doing what it was asked — the ask just wasn't specific enough.

Fix 1: Specify the output schema explicitly — start here first

❌ Vague:

Get the product information from this page

✅ Specific:

Extract exactly these fields:
- product_name: the main product title (not the category or brand)
- price: the current price in USD (use the sale price if shown, otherwise the regular price)
- availability: one of 'in_stock', 'out_of_stock', or 'limited'
Return null for any field not found on the page.

The second version tells the agent what to call each field, where to look for it, how to handle ambiguity, and what to do when something is missing. The reason this works: explicit schemas reduce the agent's interpretation space. When the goal says "return null for any field not found," the agent no longer needs to decide between returning an empty string, a placeholder, or a label text — there's one valid answer.

Fix 2: Scope the task tightly

❌ Broad:

Get all the prices from this site

✅ Scoped:

Extract the price of the item currently shown on this product detail page.
Do not navigate to other pages. Stop after extracting the price.

Explicit boundaries prevent the agent from interpreting a broad goal broadly. Without scope constraints, an agent that can navigate will navigate — not because something went wrong, but because the goal didn't say not to.

Fix 3: Add handling for missing data

If the price is not visible on the page, return null for the price field.
Do not return 'price unavailable' or similar strings — return null.

Agents improvise when data is missing unless you tell them not to. Without a "return null" instruction, an agent might reasonably return the label text, a related value, or an empty string — all of which look different in downstream code.

Fix 4: Add page context when the layout is unusual

The product price appears below the product image and above the Add to Cart button.
Extract only that price — ignore any prices shown in the 'You might also like' section.

This works because the agent uses your spatial description as a disambiguation rule. When it encounters two elements that both look like prices, "below the product image" narrows it to one.

When the Problem Is the Site, Not the Goal

Sometimes the agent is following the goal correctly, but the page itself behaves inconsistently across runs.

Signs it's a site problem:

The goal text is specific and well-scoped
Results vary across runs on the same URL with the same goal
The step stream shows the agent reaching the right element but extracting different values

Common site-side causes:

Content loads asynchronously: The agent may extract a value before the final price loads, or after a different variant's price has already populated the element.

Conditional elements: Promotional banners, A/B test variants, or personalized content may appear in some sessions but not others, changing the page structure the agent encounters.

Session state: Prices, availability, or content may differ based on login state or session cookies. If the agent runs without the expected session state, it sees a different version of the page.

Authentication expiry: For portals that require authentication on accounts you control, the session may expire mid-run on longer tasks. The agent completes but sees the logged-out version of the page for later steps.

Diagnostic approach: Run the same goal 3 times on the same URL and compare results. If results differ: site problem. If results are consistently wrong in the same way: goal problem.

For content timing issues, add an explicit instruction to the goal: "Wait for the price element to fully load before extracting."

Start debugging your web agent

500 free steps, no credit card required. Run the two-layer check pattern against your actual targets.

FAQ

What does COMPLETED mean in web agent results?
COMPLETED means the infrastructure layer finished without an error — not that the goal succeeded. Always validate returned data: check that required fields are present, non-null, and match expected values. This is the single most common source of silent bugs in agent pipelines.

Why does my web agent return different results each time?
Run the same goal 3 times on the same URL. If results vary: site problem (dynamic content, A/B tests, session state). If results are consistently wrong in the same way: goal problem. The distinction matters because the fix is completely different.

Can I retry automatically when an agent fails?
For TIMEOUT and transient infrastructure errors: yes, a single retry is reasonable. For SITE_BLOCKED or TASK_FAILED: retrying without changing the goal or session state rarely helps. For MAX_STEPS_EXCEEDED: the goal needs to be split before retrying — adding retries without splitting will hit the limit again.

What does SITE_BLOCKED mean?
The target site's configuration prevented the run from completing — typically a session, authentication, or access pattern requirement. Identify what the site actually requires for the content you're trying to access, address that requirement, then retry.

How do I know when to stop tuning the goal and accept the result?
If you've applied Fix 1 (explicit schema) and the agent still returns inconsistent results after 3 runs, the problem is likely the page — not the goal. At that point, use the Browser API to inspect what the page actually serves to an automated session, rather than continuing to refine goal text.

Related reading:

Want to scrape the web without getting blocked? Try TinyFish — a browser API built for AI agents and developers.

How to Write Goals That Get Consistent Results from AI Agents

Tinyfishie — Tue, 19 May 2026 09:57:25 +0000

When you write a prompt for a chatbot, the worst outcome is a bad answer. You try again.

When you write a goal for a web agent, the worst outcome is wrong data that passes downstream validation, an extraction that returns half the results on every run, or — in a transactional workflow — an action you didn't intend. A bad goal doesn't fail loudly. It fails quietly, returning a result that looks plausible until someone checks it against the source.

Goal writing for web agents has its own patterns. This guide covers the four elements every reliable goal needs, and a reference table of goal patterns for common use cases.

Why Goal Writing Is Different from Prompt Engineering

Prompt engineering is about getting a model to respond well. Goal writing is about instructing an agent to act reliably.

The distinction matters in practice:

A prompt gets evaluated once. The model reads it, generates a response, done.
A goal gets interpreted at each decision point as the agent navigates. Every time the agent decides whether to click something, where to look next, or how to structure its output, it refers back to the goal.

This means a vague goal doesn't produce one bad response — it produces inconsistent decisions across every step of every run. Two agents running the same vague goal against the same page can take different paths and return different results.

Prompt engineering advice — be specific, give examples, define the format — applies here too. But goal writing adds two things that chatbot prompting doesn't need: scope (where on the page to operate) and fallback behavior (what to return when expected elements aren't there).

The Four Elements of a Reliable Goal

Every goal that produces consistent results has these four properties. Missing any one of them is the most common source of extraction failures.

Element	What it defines	What happens without it
Scope	Which section/page to focus on	Agent wanders, extracts from wrong sections
Task	What specific action or extraction to perform	Agent guesses what counts as completion
Output format	The structure of the returned data	Each run returns data shaped differently
Fallback behavior	What to do when expected elements are missing	Silent failures, partial results, null fields

Weak goal:

Get the product information.

Strong goal:

In the main product listing grid (not featured or sponsored products),
extract the name, current price, and availability status for each product.
Return as JSON array: [{"name": string, "price": string, "availability": "in_stock" | "out_of_stock" | "unknown"}].
If a field is not visible for a product, use null — do not navigate to individual product pages.
Limit to 20 products.

The weak goal has none of the four elements. The strong goal has all of them.

Specifying Output Format

Unspecified output format is the single most common cause of inconsistent agent results. When you don't define the structure, the agent makes its own decisions — and those decisions can vary between runs.

From free-form to schema

from tinyfish import TinyFish

client = TinyFish()

# ❌ No output format — agent decides the structure
response = client.agent.run(
    url="https://marketplace.example.com/listings",
    goal="Extract the product listings."
)
# Result varies: sometimes a list, sometimes a dict, sometimes nested

# ✅ Explicit schema — consistent every time
response = client.agent.run(
    url="https://marketplace.example.com/listings",
    goal=(
        "Extract each product listing on this page. "
        'Return as JSON array: [{"id": string, "title": string, '
        '"price": number, "currency": "USD", "seller": string}]. '
        "If a field is not present, use null."
    )
)

Choose the right schema style for your use case

For structured extraction (most common): Specify field names, types, and what null means.

Return as JSON: {"title": string, "price": number, "in_stock": boolean}

For list extraction with many items: Define the array shape and element schema.

Return as JSON array, one object per result:
[{"rank": number, "url": string, "title": string, "snippet": string}]

For multi-value fields: Be explicit about how to handle them.

If multiple prices are shown (sale price + original price), return both:
{"sale_price": number, "original_price": number, "currency": string}
If only one price is shown, use {"price": number, "currency": string}

For boolean fields: Define what each state means.

"availability": true if "In Stock" or "Available" is shown, false if "Out of Stock"
or "Unavailable", null if no availability information is visible

Scoping Goals to Specific Page Sections

Broad goals produce broad results. A page that has a main listing grid, a featured products sidebar, a recently viewed section, and a sponsored row will produce overlapping, duplicate data if your goal doesn't specify which section to use.

Add location context

# ❌ Broad — agent extracts from all sections
"Extract all product prices on this page."

# ✅ Scoped — agent focuses on the right section
"In the main product listing grid (not the featured sidebar, sponsored section,
or recently viewed row), extract the price and name for each product."

Scope patterns by page type

E-commerce listing page:

In the main search results grid, extract... (not sponsored listings at the top)

Pricing page with comparison table:

In the pricing comparison table, extract each plan's name, monthly price,
and the features listed under 'Included'. Do not extract from the FAQ section.

Search results:

Extract the organic search results (not ads, not 'People also ask', not featured snippets).
For each result: title, URL, and snippet text.

Portal with sidebar navigation:

In the main content area (not the sidebar navigation), extract...

When in doubt, describe what NOT to include. Exclusions are often easier to specify than inclusions.

Writing Fallback Instructions

Silent failures happen when the agent completes a run, returns a result, but the expected element wasn't present — and the goal gave no instruction for that case. The agent fills the gap with its best guess.

Explicit fallback instructions tell the agent exactly what to return when something is missing:

Pattern: missing field

Extract the product's SKU from the product details section.
If no SKU is visible on this page, return null for that field.
Do not navigate to other pages to find it.

Pattern: multiple values found

Extract the current price. If both a sale price and a regular price are shown,
return both: {"sale_price": number, "original_price": number}.
If only one price is shown, return {"price": number}.

Pattern: conditional content

Extract the inventory count if displayed. If inventory count is not shown
but the item is listed as available, return {"in_stock": true, "quantity": null}.
If the item shows "Out of Stock", return {"in_stock": false, "quantity": 0}.

Pattern: pagination

Extract listings from this page only. If the page shows fewer than 10 listings,
return what's available — do not navigate to the next page.

The rule: every state the page could be in should have an explicit instruction. If your goal covers the happy path but not the edge cases, you'll get inconsistent results whenever a page varies from the expected state.

Safety Instructions for Transactional Workflows

For goals that involve forms, account actions, or any workflow with write access, explicitly define what the agent should and should not do. Safety instructions don't limit what your agent can do — they scope the task to precisely the intended action.

response = client.agent.run(
    url="https://portal.example.com/orders",
    goal=(
        "Log in using the provided credentials. "
        "Navigate to the Orders section and extract the 10 most recent orders. "
        "For each order: order ID, date, status, and total amount. "
        "Return as JSON array. "
        "IMPORTANT: Do not click any buttons that modify order status. "
        "Do not submit any forms. Do not proceed to any checkout or payment screens. "
        "If you reach a page that requires payment information, stop and return "
        '{"error": "reached_payment_page"}.'
    )
)

The safety instruction block at the end of a transactional goal is the standard pattern:

IMPORTANT: Do not [specific prohibited action]. Do not [second prohibited action].
If you encounter [edge case], stop and return {"error": "[description]"}.

Authenticated workflows operate on accounts you're authorized to access — safety instructions define the scope of what the agent does within that access, giving you precise control over each run.

Goal Patterns by Use Case

Copy-pasteable patterns for common scenarios. Replace the bracketed placeholders with your specifics.

Use case	Goal pattern	Key elements to customize
Product price extraction	`"In the [section] on this page, extract the [name/price/availability] for each product. Return as JSON array: [{\"name\": string, \"price\": number, \"currency\": string}]. If price is not shown, use null."`	Section name, field list, currency
Search result extraction	`"Extract the organic search results (not ads). For each: title, URL, snippet. Return as JSON array, max [N] results."`	N, any additional fields
Multi-step form submission	`"Fill the [form name] with: [field]: [value], [field]: [value]. Click [button]. IMPORTANT: Do not proceed past the confirmation screen. Return the confirmation message or order ID shown."`	Form fields, stop condition
Authenticated portal data	`"Navigate to [section] using the provided credentials. Extract [data]. Return as JSON. Do not submit any forms or modify any data."`	Section, data fields
Price comparison across pages	`"Extract the current [plan name] price from the pricing page. Return: {\"plan\": string, \"monthly_price\": number, \"annual_price\": number, \"currency\": string}. Use null for any price not shown."`	Plan name, price fields
Inventory monitoring	`"Check the availability status of [product identifier] on this page. Return: {\"available\": boolean, \"quantity\": number or null, \"status_text\": string}. Capture the exact text shown for availability."`	Product ID, status fields
Document/content extraction	`"Extract the [content type] from the [section] of this page. Do not extract navigation, footers, or sidebar content. Return as plain text preserving paragraph breaks."`	Content type, exclusions

Test these patterns on your actual target pages.

TinyFish gives you 500 free credits to run goal-based agents against any URL — no credit card required. Start with one of the patterns above and see what consistent, structured output looks like.

Frequently Asked Questions

How long should an agent goal be?

As long as it needs to be to cover all four elements — scope, task, output format, and fallback behavior. A one-line goal that doesn't specify output format will produce inconsistent results. A five-sentence goal that covers all four elements will produce consistent results. Length isn't a signal of quality; completeness is. In practice, most reliable goals run 50–150 words.

Should I put the output format at the beginning or end of the goal?

End. Start with scope and task so the agent understands what it's doing before it encounters the format instruction. Ending with the format means the agent is already oriented when it reads the schema — it's not trying to simultaneously understand what it's extracting and how to structure it.

Can I use the same goal for multiple URLs?

Yes — as long as the page structure is consistent across those URLs. If the target field (price, availability, etc.) appears in a different location or under a different element type on some pages, your results will vary. For large URL lists with structural variance, write a classification step first and route to different goals by page type.

What's the difference between a goal and a script (Playwright, Selenium)?

A script defines every step explicitly: click this, wait for that, read this selector. A goal defines the desired outcome and lets the agent figure out the steps. Scripts are brittle when page structure changes; goals adapt. Goals require more careful output specification because there's no explicit selector telling the agent exactly which element to read — the agent infers from the goal text.

How do I debug a goal that's producing inconsistent results?

Use streaming to watch the agent's steps during a run — this shows you where it diverges from the expected path. Then check: (1) Is the scope ambiguous? The agent might be looking in the wrong section. (2) Is the output format specified? Unspecified format produces structural variation. (3) Is there a fallback for the failing case? A missing fallback produces silent failures. See the full debugging guide for the complete diagnostic sequence.

Build a Real-Time Medicine Price Comparison Tool with Parallel AI Agents

Tinyfishie — Tue, 19 May 2026 09:57:24 +0000

TinyFish parallel agents are cloud-based browser sessions that run simultaneously across multiple pharmacy websites and return structured pricing data — product name, price, dosage form, and stock status — normalized into a consistent schema for side-by-side comparison.

GoodRx works well for US consumers looking up drug prices at major chains. For developers building price transparency tools in other markets — or for any market where pharmacy pricing APIs don't exist — the only real-time data source is the pharmacy's own website. And pharmacy websites don't offer APIs.

This tutorial builds a real-time medicine price comparison tool using parallel TinyFish agents, using Vietnamese pharmacy chains as the concrete example. The architecture generalizes to any market: replace the pharmacy URLs and update the currency normalization logic.

Build a medicine price comparison tool in 4 steps:

Define your pharmacy list with search URLs
Write a goal prompt that extracts consistent fields across different pharmacy sites
Run agents in parallel with Promise.allSettled
Normalize price strings and rank by cost

Why Pharmacy Price Comparison Is Hard to Build

Pharmacy pricing is inherently fragmented. In most markets, chains set their own prices independently — and they don't publish those prices through APIs.

The engineering problems stack up quickly:

No unified API — unlike retail (Amazon Product Advertising API) or travel (OTA affiliate APIs), pharmacy chains don't have a developer ecosystem. Each chain is its own data island.
Price variation is real and fast-moving — prices change with promotions, supplier contracts, and regional stock. Data from last week is often wrong. Real-time extraction is the only reliable source.
Sequential scraping multiplies wait time — checking 5 pharmacy chains one after another means 5× the latency. A user waiting for a price comparison shouldn't wait 5× longer because the architecture is sequential.

Parallel browser agents solve all three: no API required, results are real-time, and checking 5 chains takes the same wall-clock time as checking one.

Approach	Best when	Breaks when
Parallel browser agents	No API exists, multi-market, real-time data needed	High-frequency automated runs (100+/hour)
Sequential scraping	Single pharmacy, low request volume	Latency is 5× longer per chain at scale
Third-party API (GoodRx, etc.)	US market, high-volume production use	API unavailable in your market or too expensive

Access scope: This tool reads only public product search pages — the same pricing data any consumer sees when visiting a pharmacy website. It does not access prescription systems, patient portals, or any authenticated pharmacy resources.

For healthcare developers, this pattern applies to any pharmacy market. The example below uses five Vietnamese chains — Long Chau, Pharmacity, An Khang, Guardian, and Medicare — because Vietnam has no pharmacy price API and a fragmented pricing landscape. The same code runs for pharmacy chains in any country.

Choosing the right TinyFish API for this project: TinyFish offers two tools with different cost profiles. The Fetch API (api.fetch.tinyfish.ai) retrieves a page at a known URL — 1 credit = 15 pages. The Agent API (agent.tinyfish.ai) handles pages requiring a search query or navigation — 1 credit per step. Most pharmacy price lookups require a search step before prices appear, so this tutorial uses Agent API. If a product has a stable direct URL on a given pharmacy's site, Fetch API is significantly cheaper for that call.

The Parallel Agent Architecture

One agent per pharmacy, all running simultaneously. Each agent navigates to the pharmacy's search page, finds the medicine, and returns structured data. Promise.allSettled ensures that a slow or temporarily unavailable pharmacy doesn't delay results from the others.

Prerequisites: Node.js 18+, TypeScript, and a TinyFish API key.

npm install @tiny-fish/sdk
npm install -D ts-node typescript
export TINYFISH_API_KEY=your_key_here

Place index.ts and normalize.ts in the same directory, then run:

npx ts-node index.ts

import { TinyFish } from "@tiny-fish/sdk";
import { normalizePharmacyResult } from "./normalize";

const client = new TinyFish(); // reads TINYFISH_API_KEY from env

// URLs verified May 2026 — verify before deploying, as pharmacy search paths change
const PHARMACIES = [
  { name: "Long Chau",   url: "https://pharmacy-a.example.com/tim-kiem?key=" },
  { name: "Pharmacity", url: "https://pharmacy-b.example.com/search?q=" },
  { name: "An Khang",   url: "https://pharmacy-c.example.com/catalogsearch/result?q=" },
  { name: "Guardian",   url: "https://pharmacy-d.example.com/search?query=" },
  { name: "Medicare",   url: "https://pharmacy-e.example.com/search?s=" },
];

const buildGoal = (medicineName: string): string => `
  Search for "${medicineName}".
  Find the medicine in the search results and return a JSON object with:
  - productName: string (exact name as displayed)
  - price: string (price exactly as shown, including currency symbols and separators)
  - currency: "VND"
  - dosageForm: string (tablet, capsule, liquid — in Vietnamese or English as shown)
  - stockStatus: string (in stock / out of stock — in Vietnamese or English as shown)
  - url: string (direct product page link)

  Return the single most relevant result for "${medicineName}".
  If no match is found, return null.
`;

async function compareMedicinePrices(medicineName: string) {
  const query = encodeURIComponent(medicineName);

  const requests = PHARMACIES.map((pharmacy) =>
    client.agent
      .run({
        url: pharmacy.url + query,
        goal: buildGoal(medicineName),
      })
      .then((response) => {
        // response.result is the parsed JavaScript value returned by the agent.
        // null is a valid result — the medicine was not found at this pharmacy.
        const raw = response.result as { productName?: string; price?: string; dosageForm?: string; stockStatus?: string; url?: string } | null;
        return {
          pharmacy: pharmacy.name,
          result: raw ? normalizePharmacyResult(raw) : null,
        };
      })
  );

  const settled = await Promise.allSettled(requests);

  const results = settled.map((r, i) => ({
    pharmacy: PHARMACIES[i].name,
    // error: true = infrastructure failure (timeout/network), distinct from null result (not found)
    ...(r.status === "fulfilled" ? r.value : { result: null, error: true }),
  }));

  // Sort found results by price (ascending), unfound results last
  return results.sort((a, b) => {
    if (!a.result?.priceVnd) return 1;
    if (!b.result?.priceVnd) return -1;
    return a.result.priceVnd - b.result.priceVnd;
  });
}

// Usage
const results = await compareMedicinePrices("Paracetamol 500mg");
results.forEach((r) => {
  if (r.result) {
    console.log(`${r.pharmacy}: ${r.result.priceVnd?.toLocaleString()}₫ — ${r.result.stockStatus}`);
  } else {
    console.log(`${r.pharmacy}: not found${r.error ? " (error)" : ""}`);
  }
});

Concurrency note: The Free plan supports 2 concurrent agent runs — 5 pharmacies run in approximately 3 batches. The Starter plan (10 concurrent) runs all five simultaneously. Each agent step consumes 1 credit — a typical product search runs 3–6 steps.

Price Normalization — The Real Engineering Challenge

The parallel agents are the easy part. Price strings are the hard part.

Five pharmacy websites display the same price in five different formats:

Long Chau:   "125,000₫"
Pharmacity:  "125.000 VNĐ"
An Khang:    "125.000đ"
Guardian:    "125,000"
Medicare:    "125000 đồng"

All five mean the same thing: 125,000 Vietnamese dong. But if you put these strings side by side without normalization, they're incomparable. The sort function breaks. The low-price highlight breaks. The entire comparison breaks.

Create normalize.ts:

// normalize.ts

function parseVndPrice(raw: string): number | null {
  // Remove all currency labels and symbols
  const stripped = raw
    .replace(/₫|VNĐ|VND|đồng|dong|đ/gi, "")
    .trim();

  // VND uses either comma or dot as the thousands separator.
  // Key insight: if there's no decimal component after the separator,
  // it's a thousands separator — not a decimal point.
  // "125,000" → 125000. "125.000" → 125000. "125000" → 125000.
  const digits = stripped.replace(/[^0-9]/g, "");
  const num = parseInt(digits, 10);
  return isNaN(num) ? null : num;
}

const DOSAGE_MAP: Record<string, string> = {
  "viên nén": "tablet",
  "viên nang": "capsule",
  viên: "tablet",
  ống: "vial",
  gói: "sachet",
  chai: "bottle",
  hộp: "box",
};

function normalizeDosageForm(raw: string): string {
  const lower = raw.toLowerCase();
  for (const [vi, en] of Object.entries(DOSAGE_MAP)) {
    if (lower.includes(vi)) return en;
  }
  return raw;
}

function normalizeStockStatus(
  raw: string
): "in_stock" | "out_of_stock" | "limited" | "unknown" {
  const lower = raw.toLowerCase();
  if (lower.includes("còn hàng") || lower.includes("in stock")) return "in_stock";
  if (lower.includes("hết hàng") || lower.includes("out of stock")) return "out_of_stock";
  if (lower.includes("sắp hết") || lower.includes("limited")) return "limited";
  return "unknown";
}

export function normalizePharmacyResult(raw: {
  productName?: string;
  price?: string;
  dosageForm?: string;
  stockStatus?: string;
  url?: string;
}) {
  return {
    productName: raw.productName ?? null,
    priceVnd: raw.price ? parseVndPrice(raw.price) : null,
    currency: "VND",
    dosageForm: raw.dosageForm ? normalizeDosageForm(raw.dosageForm) : null,
    stockStatus: raw.stockStatus ? normalizeStockStatus(raw.stockStatus) : "unknown",
    url: raw.url ?? null,
  };
}

Why capture price as a string first. Asking the agent to return the price as a number risks losing the formatting context needed to parse it correctly. "125.000" is ambiguous: is that 125,000 VND or 125.00 USD? Capturing the raw display string — "125.000 VNĐ" — preserves the currency label and makes the parsing logic unambiguous. Parse on your side, where you know the market context.

Generalizing to other currencies. Update parseVndPrice for your local format. European currencies often use period as thousands separator and comma as decimal (€1.250,50). US/UK use comma as thousands separator. The normalization pattern is the same — strip labels, identify the separator convention for your market, parse to integer or float.

Adapting the Pattern to Your Pharmacy Market

To adapt for a different country:

Replace PHARMACIES URLs with local chain search pages — any URL that accepts a medicine name query parameter works
Update the goal prompt if your market uses different field terminology
Update parseVndPrice for local currency format and separators
Update DOSAGE_MAP if the language differs (the agent extracts in whatever language the site uses)

Extension: price alert emails. Run the comparison on a schedule and compare with the previous day's output. When a price drops below a threshold, send an email via nodemailer or a transactional email service.

Extension: brand vs. generic comparison. Add a brandType field to the goal prompt: "brandType: 'brand' | 'generic' | 'unknown' (based on whether the product name contains the generic INN or is a brand name)".

Extension: trend tracking. Save each run's output to a database table keyed by (pharmacy, productName, date). Querying across dates gives you a price trend line per pharmacy — useful for transparency reports or consumer price alerts.

For the same parallel extraction pattern applied to other multi-site scenarios, see how this pattern works for hotel price comparison across booking platforms.

Pharmacy pricing is one of the most fragmented data problems in healthcare — hundreds of chains, no unified API, prices that change daily. Parallel agents solve the fragmentation: five chains, one function call, real-time results. The normalization layer solves the format problem: five different price strings, one comparable number. Together, they're the foundation of any serious price transparency tool.

FAQ

Can this build a medicine price comparison tool without pharmacy APIs?
Yes. This tool reads public product search pages — the same pricing data any consumer sees when visiting the pharmacy's website without logging in. No pharmacy API partnership, data licensing agreement, or institutional approval is required. This is the practical solution for markets where pharmacy price APIs don't exist.

How does the price normalization handle different currency formats?
Pharmacy websites display prices in different formats even within the same country. The parseVndPrice function strips currency labels and symbols, then removes all non-digit characters to extract the numeric value. For VND, this handles "125,000₫", "125.000 VNĐ", and "125000 đồng" identically. For other currencies, update the label strip list and handle the local decimal/thousands separator convention in your parsing function.

Why capture the price as a string from the agent rather than asking for a number?
Asking the agent to return a number risks losing the formatting context needed to parse correctly. "125.000" is ambiguous — it's 125,000 VND in Vietnam but 125.0 in USD notation. Capturing the raw display string ("125.000 VNĐ") preserves the currency label and makes your parsing logic unambiguous. Parse on your side where you know the market context.

What happens when a pharmacy doesn't carry the medicine?
null is a valid agent result — the medicine wasn't found. This is distinct from error: true, which indicates an infrastructure failure (network timeout, session error). Null results appear at the bottom of sorted output, allowing found results to float to the top regardless of which pharmacies carry the medicine.

How do I add more pharmacy chains?
Add entries to the PHARMACIES array with the chain name and its search URL. The goal prompt's natural language description adapts to different site structures — you may need to adjust field names in the goal if a chain uses unusual terminology for stock status or dosage form.

Can I schedule this for daily price tracking?
Yes. Run the comparison with a cron job or cloud scheduler and save results to a database table keyed by pharmacy, product name, and date. Comparing today's output with yesterday's automatically surfaces price changes. Set a threshold (e.g., any price drop > 10%) to trigger an email alert via a transactional email service.

Related Reading:

Deploy This with a Free Account

The complete workflow above runs on TinyFish's free tier. 500 free steps, no credit card — enough to deploy this project and validate it against real data before choosing a plan.

Get your free API key →

Want to scrape the web without getting blocked? Try TinyFish — a browser API built for AI agents and developers.

Build a Video Game Price Comparison Tool Across 10 Platforms

Tinyfishie — Tue, 19 May 2026 09:31:43 +0000

TinyFish parallel agents are cloud-based browser sessions that query multiple game storefronts simultaneously and return structured pricing data — game title, current price, discount percentage, and direct purchase URL — normalized across currencies and formats into a single ranked list.

Game prices vary wildly across storefronts and change constantly during sales. IsThereAnyDeal and CheapShark aggregate some of this, but neither covers every platform — itch.io is out, regional stores are out, newer storefronts take months to appear. When a game isn't in the database, you're back to checking each store manually.

This tutorial builds a multi-platform game price comparison tool using TinyFish agents. Your configured storefronts are checked simultaneously — major PC storefronts, indie platforms, regional stores, or any public game store with a search page. You get a ranked price list in the time it takes to manually open two tabs.

Build a video game price comparison tool in 4 steps:

Define your platform list with store search URLs
Write a goal prompt that extracts price, discount, and link consistently
Run all 10 agents in parallel with Promise.allSettled
Normalize currencies and rank by discounted price

Why Building Beats Using CheapShark or IsThereAnyDeal

CheapShark and IsThereAnyDeal are good starting points. They cover major PC storefronts reliably — and their APIs are free. For straightforward PC game price lookup, they're often sufficient.

Browser agents make sense when:

Your target platform isn't indexed — regional stores, newer indie platforms, and console storefronts have partial or no CheapShark coverage
You need real-time data — API-backed aggregators cache prices, sometimes aggressively. During major sale events, cached prices can lag the actual storefront by hours
You're building beyond price lookup — availability status, regional pricing differences, bundle detection, and DLC pricing require reading the actual store page rather than an API summary

The engineering tradeoff is straightforward:

Approach	Best for	Limitation
CheapShark / IsThereAnyDeal API	Quick PC storefront lookup, free, no setup	Doesn't cover all storefronts; cached data
Browser agents (this tutorial)	All 10 platforms, real-time, fully customizable	Credit cost per query; rate-limited at Free tier

Choosing the right TinyFish API for this project: TinyFish offers two tools with different access patterns. The Fetch API (api.fetch.tinyfish.ai) retrieves a page at a known URL — free (0 credits). The Agent API (agent.tinyfish.ai) handles storefronts requiring a search step — 1 credit per step. For storefronts with stable product page URLs, Fetch API can retrieve the price directly (free, 0 credits). For storefronts that require a search query, Agent API is required (1 credit per step). Where you have direct product URLs, Fetch API is the right choice — free on all plans.

Video Game Price Comparison: Running Parallel Agents

Prerequisites: Node.js 18+, TypeScript, and a TinyFish API key.

npm install @tiny-fish/sdk
npm install -D ts-node typescript
export TINYFISH_API_KEY=your_key_here

import { TinyFish } from "@tiny-fish/sdk";

const client = new TinyFish(); // reads TINYFISH_API_KEY from env

// Replace these with the storefronts you want to compare.
// Before adding any storefront, check its Terms of Service regarding automated access.
// Many storefronts have public APIs (e.g. CheapShark, IsThereAnyDeal) that are
// the preferred integration method — use browser agents only for stores without APIs.
const PLATFORMS = [
  { name: "Storefront A", url: "https://example-store-a.com/search?q=" },
  { name: "Storefront B", url: "https://example-store-b.com/games/search?query=" },
  { name: "Storefront C", url: "https://example-store-c.com/search?term=" },
  // Add more storefronts here — any public search URL works
];

const buildGoal = (gameTitle: string, platformName: string): string => `
  Search for the game "${gameTitle}" on this storefront.
  Return a JSON object with:
  - gameTitle: string (exact title as shown on this store)
  - price: string (current price exactly as displayed, including currency symbol)
  - originalPrice: string or null (if there is an active discount, the original/crossed-out price)
  - discountPercent: number or null (e.g. 40 for 40% off, null if not discounted)
  - currency: string (3-letter code: "USD", "EUR", "GBP", etc.)
  - platform: "${platformName}" (e.g. "Storefront A" or "Storefront B")
  - url: string (direct link to this game's store page)
  - available: boolean (true if the game is purchasable; false if unavailable, region-locked, or subscription-only)

  If the game is not found on this storefront, return null.
  Return only the most relevant match for "${gameTitle}".
`;

async function compareGamePrices(gameTitle: string) {
  const query = encodeURIComponent(gameTitle);

  const requests = PLATFORMS.map((platform) =>
    client.agent
      .run({ url: platform.url + query, goal: buildGoal(gameTitle, platform.name) })
      .then((response) => {
        // response.result is the parsed JavaScript object returned by the agent.
        // null = game not found. Check for goal failure vs legitimate null:
        const result = response.result as unknown;
        if (result && typeof result === "object" && !Array.isArray(result) &&
            (result as Record<string, unknown>).status === "failure") {
          return { platform: platform.name, result: null };
        }
        return { platform: platform.name, result: result ?? null };
      })
  );

  const settled = await Promise.allSettled(requests);
  return settled.map((r, i) => ({
    platform: PLATFORMS[i].name,
    // error: true = network/timeout failure, distinct from null (game not found)
    ...(r.status === "fulfilled" ? r.value : { result: null, error: true }),
  }));
}

Why Promise.allSettled not Promise.all: A single storefront timing out or returning an error shouldn't cancel the nine successful results. allSettled ensures every agent completes and reports independently.

Concurrency note: The Free plan supports 2 concurrent agent runs — 10 platforms run in approximately 5 batches. The Starter plan (10 concurrent) runs all 10 simultaneously. Each agent step consumes 1 credit. Querying all 10 platforms for one game title typically costs 30–60 credits depending on site complexity.

Normalizing Prices Across 10 Currencies and Formats

Ten storefronts return ten different price formats: "$29.99", "€24,99", "£19.99", "AU$39.95", "Free", "N/A (subscription only)". Without normalization, sorting by price is impossible.

// normalize.ts

const CURRENCY_TO_USD: Record<string, number> = {
  USD: 1.0,
  EUR: 1.09,  // update these rates from an exchange rate API in production
  GBP: 1.27,
  AUD: 0.65,
  CAD: 0.73,
};

function parsePriceToUsd(raw: string, currency: string): number | null {
  if (!raw || raw.toLowerCase().includes("free")) return 0;
  if (raw.toLowerCase().includes("n/a") || raw.toLowerCase().includes("subscription")) return null;

  // Strip currency symbols and labels, normalize decimal separator
  // Remove all non-numeric chars except dot; handle comma as thousands separator
  const cleaned = raw.replace(/[^0-9.,]/g, "").replace(/,/g, "");
  const amount = parseFloat(cleaned);
  if (isNaN(amount)) return null;

  const rate = CURRENCY_TO_USD[currency] ?? 1.0;
  return Math.round(amount * rate * 100) / 100; // round to cents
}

function calcSavings(
  currentUsd: number | null,
  originalUsd: number | null
): { savingsUsd: number; discountPct: number } | null {
  if (currentUsd === null || originalUsd === null || originalUsd <= 0) return null;
  const savingsUsd = originalUsd - currentUsd;
  const discountPct = Math.round((savingsUsd / originalUsd) * 100);
  return savingsUsd > 0 ? { savingsUsd, discountPct } : null;
}

export function normalizeGameResult(raw: {
  price?: string;
  originalPrice?: string;
  currency?: string;
  discountPercent?: number | null;
  available?: boolean;
  [key: string]: unknown;
}) {
  const currency = raw.currency ?? "USD";
  const priceUsd = parsePriceToUsd(raw.price ?? "", currency);
  const originalUsd = raw.originalPrice
    ? parsePriceToUsd(raw.originalPrice, currency)
    : null;
  const savings = calcSavings(priceUsd, originalUsd);

  return {
    ...raw,
    priceUsd,
    originalUsd,
    effectiveDiscountPct: raw.discountPercent ?? savings?.discountPct ?? 0,
    savingsUsd: savings?.savingsUsd ?? 0,
  };
}

Three normalization decisions worth explaining:

Currency conversion at query time. Build the lookup table once, apply it at normalization — don't call an exchange rate API per result. In production, fetch rates daily and cache them.
"Free" returns 0. Sorting by price puts free games at the top, which is almost always the right behavior for a deals tool.
null for subscription/unavailable. A subscription-only game shouldn't appear as $0 — it's not a direct purchase price. null price pushes it to the bottom of sorted output.

Putting it together:

import { normalizeGameResult } from "./normalize";

const rawResults = await compareGamePrices("Elden Ring");

const normalized = rawResults
  .filter((r) => r.result !== null)
  .map((r) => ({ ...r, result: normalizeGameResult(r.result as Record<string, unknown>) }))
  .sort((a, b) => {
    const aPrice = a.result.priceUsd ?? Infinity;
    const bPrice = b.result.priceUsd ?? Infinity;
    return aPrice - bPrice;
  });

normalized.forEach((r) => {
  const disc = r.result.effectiveDiscountPct > 0 ? ` (${r.result.effectiveDiscountPct}% off)` : "";
  console.log(`${r.platform}: $${r.result.priceUsd?.toFixed(2)}${disc} → ${r.result.url}`);
});

Run with npx ts-node index.ts.

Building a Deal Alert Bot

The comparison tool becomes a deal monitor with a scheduled run and a threshold check:

import { WebhookClient } from "discord.js";

const DEAL_THRESHOLD_USD = 15; // alert when any platform drops below this
const webhook = new WebhookClient({ url: process.env.DISCORD_WEBHOOK_URL! });

async function checkForDeals(gameTitle: string) {
  const results = await compareGamePrices(gameTitle);
  const normalized = results
    .filter((r) => r.result !== null)
    .map((r) => ({ ...r, result: normalizeGameResult(r.result as Record<string, unknown>) }));

  const deals = normalized.filter(
    (r) => (r.result.priceUsd ?? Infinity) < DEAL_THRESHOLD_USD
  );

  if (deals.length > 0) {
    const message = deals
      .map((d) => `**${d.platform}**: $${d.result.priceUsd?.toFixed(2)} ${d.result.effectiveDiscountPct > 0 ? `(-${d.result.effectiveDiscountPct}%)` : ""} — ${d.result.url}`)
      .join("\n");
    await webhook.send({ content: `🎮 Deal alert for **${gameTitle}**:\n${message}` });
  }
}

// Run daily via cron or GitHub Actions
await checkForDeals("Elden Ring");
await checkForDeals("Baldur's Gate 3");

Extension: Sale season monitor. Run the comparison daily during major sale periods (typically mid-June and late November for PC storefronts). When any game in your watchlist drops more than 50%, trigger the webhook immediately. Track prices over time to see which stores run simultaneous sales and which don't.

Extension: multi-game watchlist. Pass an array of game titles to compareGamePrices in parallel via Promise.all — 5 games × 10 platforms = 50 concurrent agent runs on a Pro plan.

For the same 10-platform parallel pattern applied to retail products, see how parallel agents work for medicine price comparison across pharmacy chains.

Game stores update prices hourly during sales. Checking 10 platforms manually for a single title takes long enough to miss a flash deal. Parallel agents collapse that to a single function call — all 10 checked simultaneously, prices normalized across currencies, ranked by actual cost. The same architecture that powers commercial deal aggregators, without the platform partnership requirements.

FAQ

Can this build a video game price comparison tool that checks all major storefronts?

Yes — browser agents navigate public storefront pages and require no per-platform API registration. Add any storefront with a public search page by appending an entry to the PLATFORMS array — including regional stores, indie platforms, and storefronts not covered by existing aggregator APIs.

When should I use browser agents instead of CheapShark API? Use this decision framework:

If you need	→ Use
Storefronts not covered by aggregator APIs	→ Browser agents (configure your own platform list)
Real-time pricing during an active sale	→ Browser agents (aggregators cache)
PC storefronts only, fast prototype	→ CheapShark API (free, no credits needed)
Availability, bundle status, or regional pricing	→ Browser agents (read the actual store page)

In practice: start with CheapShark for the platforms it covers, add browser agents for the gaps.

Why do existing APIs like CheapShark miss some platforms?

CheapShark and IsThereAnyDeal aggregate prices from storefronts that maintain API partnerships or accept data feeds. Platforms without those partnerships — itch.io, console storefronts, newer indie platforms — don't appear. Browser agents have no such dependency: they read the public storefront pages that any customer visits, regardless of whether the platform has an API program.

How does the price normalization handle "Free" or subscription-only titles?

The parsePriceToUsd function returns 0 for "Free" (putting free games first in sorted output) and null for titles that are only available through a subscription service (not a direct purchase price). null prices sort to the bottom — a subscription title isn't a direct purchase price and shouldn't displace paid options.

What happens when a storefront doesn't carry the game?

null is the correct result — the game isn't listed on that platform. This is distinct from error: true (network failure or timeout). The comparison output shows null-result platforms as "not found" in the display layer, so users can see at a glance which stores carry the title.

How many platforms can run simultaneously?

The Free plan (PAYG, 500 credits to start) supports 2 concurrent agent runs — 10 platforms run in 5 batches. The Starter plan ($15/mo, 10 concurrent) runs all 10 at once. Each agent step consumes 1 credit; a typical 10-platform query costs 30–60 credits total depending on site complexity and JavaScript rendering requirements.

Can I extend this to track prices over time?

Yes. Save each run's output to a database keyed by game title, platform, and timestamp. Plotting priceUsd over time reveals sale patterns — which storefronts discount simultaneously, how long post-launch discounts take to appear across different store types, and which platforms hold full price longest. Add a cron job to run daily checks for your watchlist.

Deploy This with a Free Account

The complete workflow above runs on TinyFish's free tier. 500 free steps, no credit card — enough to deploy this project and validate it against real data before choosing a plan.

Get your free API key →

Related Reading:

Want to scrape the web without getting blocked? Try TinyFish — a browser API built for AI agents and developers.

TinyFish vs Playwright: When to Use Each for Web Automation

Tinyfishie — Tue, 19 May 2026 08:59:35 +0000

Playwright is running fine. Then you scale to 500 concurrent sessions. Or you add sites with strict automation requirements and reliability drops. Or you realize your team spends more time managing browser infrastructure than building the product.

That's usually when the search starts.

This article is for developers who already use Playwright and want to understand exactly where TinyFish fits — and where it doesn't. Not a takedown. A map.

What Playwright Does (And Why It's Excellent)

Playwright is one of the best browser automation tools available. It controls real Chromium, Firefox, and WebKit browsers with a clean, well-designed API. The ecosystem is mature: thorough documentation, active maintenance from the Microsoft team, and broad community support.

For developers, the appeal is direct. You write the code. You control everything. Intercept requests, manage cookies, record HAR files, take screenshots, assert DOM elements. The degree of control Playwright gives you over browser behavior is genuinely hard to replicate elsewhere.

The critical thing to understand about Playwright: it's a library. It doesn't provide infrastructure. You write the code, you run the infrastructure. That's not a criticism — it's the design. Playwright does exactly what it was built to do.

Where Playwright Hits Limits at Scale

The infrastructure model that makes Playwright flexible at small scale becomes operational overhead as you grow.

Browser server management. Running Playwright at scale means managing fleets of headless Chromium instances — 100–300MB memory per instance. At 50 concurrent sessions, you're making decisions about server scaling, spot instance reliability, and container orchestration.

Proxy configuration. Sites with strict automation requirements often need residential IP rotation. Building proxy rotation into Playwright means integrating a proxy provider, handling session affinity, managing per-request IP logic, and debugging reliability when rotation fails or IP reputations degrade.

Maintenance overhead. When a target site changes its structure or tightens its automation requirements, your scraper breaks. Diagnosing and fixing these failures is real engineering work — work that doesn't ship product, but blocks pipelines.

Reliability degrades with volume. A Playwright script that works at 10 requests/hour can fail at 500/hour — not because the code changed, but because infrastructure factors create failure modes that weren't visible at low volume. Setting up reliable automation on sites with strict requirements takes significant infrastructure work with Playwright alone.

If your targets are static pages on cooperative domains at modest volume, this overhead is manageable. If you're running production pipelines at scale across varied sites, the infrastructure becomes a full-time engineering problem.

For more on where Playwright's reliability degrades by site type, see Scraping Dynamic Websites: When Playwright Is the Right Tool.

Test TinyFish against your hardest URLs — 500 free credits, no credit card

What TinyFish Adds

📸 IMAGE — Architecture comparison showing self-hosted Playwright stack versus TinyFish managed browser infrastructure

Your existing Playwright scripts work. The change is minimal: create a TinyFish browser session, then connect your Playwright code to it.

TinyFish exposes a Browser API that returns a CDP WebSocket endpoint. Since Playwright's connect_over_cdp() method already speaks CDP, connecting to TinyFish instead of launching a local browser requires two lines — create a session, then connect. Everything after that is unchanged:

import os
from playwright.async_api import async_playwright

async with async_playwright() as p:
    # Before: local browser
    # browser = await p.chromium.launch()

    # After: TinyFish managed browser — create a session, then connect
    from tinyfish import TinyFish
    session = TinyFish().browser.sessions.create()
    browser = await p.chromium.connect_over_cdp(session.cdp_url)
    page = await browser.new_page()
    await page.goto("https://example.com")
    content = await page.content()
    await browser.close()

Everything after the connection line — your selectors, page interactions, extraction logic — stays identical.

What changes is the infrastructure layer. Browser servers managed. Residential proxy routing included. Cold starts under 250ms. Browser sessions run with full infrastructure support — proxy routing, session state, and browser configuration are handled server-side. TinyFish achieves up to 85% success rate on sites with strict automation requirements across standard commercial site categories (source: tinyfish.ai).

The biggest difference between TinyFish and Playwright isn't what the browser can do — it's what you no longer have to build. TinyFish removes the infrastructure layer entirely: browser servers, proxy management, reliability handling. Your Playwright code connects unchanged; the operational overhead disappears.

One distinction worth noting: TinyFish also offers a Web Agent layer for goal-based tasks, separate from the Browser API. If your use case is "navigate to X, extract Y, handle the flow autonomously," the Agent endpoint may serve you better than CDP. But for teams with existing Playwright code who want to remove browser infrastructure, the Browser API is the direct path.

Playwright vs TinyFish: Side-by-Side

	Playwright	TinyFish Browser API
Hosting model	Self-hosted (you manage servers)	Cloud-managed
Setup	Library install; infra setup varies	API key, SDK, CLI, or MCP
Maintenance	You own: server ops, proxy config, failure debugging	Managed
Browser cold start	Depends on your hardware	<250ms
Infrastructure reliability	Depends on your implementation	Managed, included
Max concurrent (managed)	Limited by your servers	10 (Starter) / 50 (Pro)
Cost at low volume	Low — library is free	$0.015/credit PAYG
Cost at high volume	Server + proxy + engineering time	$0.012/credit (Pro)
CDP compatible	Native	✅ — existing Playwright scripts work
Full browser control	✅ Complete	✅ via CDP

If Playwright wins a row, it wins. The table is accurate.

One context the table doesn't capture: Playwright's "low" cost at low volume is the library cost. The real cost at scale includes server infrastructure, proxy subscriptions, and the engineering time spent on maintenance. That's what TinyFish replaces — and it's why the economics shift faster than the per-step price suggests.

When to Stick With Playwright

Use Playwright here. It's the right tool.

UI and functional testing. Playwright is purpose-built for testing — test runners, assertions, traces, screenshot comparison. TinyFish is a data extraction and automation platform, not a testing framework.
Simple, stable targets. Working with a small number of cooperative sites whose structure doesn't change? If you're comfortable managing the browser process and the maintenance load is genuinely zero, Playwright is the leaner choice. The threshold isn't about volume — it's about whether infrastructure has become a problem yet.
Local development and prototyping. Building and debugging automation flows is faster with local Playwright — full visibility into browser state, easy debugging, no API latency.
When you need complete control. Custom browser configurations, extension loading, low-level network manipulation — Playwright's architecture gives direct access to things the Browser API abstracts.
Early-stage validation. If you're running quick experiments to test whether a data source is useful before committing to a pipeline, Playwright locally is the fastest path to an answer.

When TinyFish Makes Sense

TinyFish's value is clearest when the infrastructure problem becomes visible. That moment comes sooner than most teams expect.

Production pipelines at scale. When thousands of concurrent sessions make browser infrastructure an ops problem, managed infrastructure is worth the per-step cost.

Teams that don't want to own browser ops. If browser infrastructure isn't a core competency you want to build, outsourcing it is rational — your engineers ship product, not browser fleet maintenance.

Sites with strict automation requirements. Sites that require more reliable infrastructure-level request handling — where JavaScript-layer plugins fail at scale — benefit from TinyFish's infrastructure-level approach. Infrastructure-level request handling is built in at the execution layer, not patched in afterward.

Goal-based multi-step tasks. When your automation requires reasoning about what to do next rather than a fixed selector sequence, the Web Agent layer handles tasks that would require complex conditional Playwright logic.

Mixed complexity URL lists. If your list includes simple pages alongside sites that require more sophisticated handling, routing to TinyFish selectively while keeping Playwright for simple cases is a valid tiered approach.

Can They Work Together?

Yes. This is the most important point in this article: Playwright and TinyFish are not competing tools. They're different layers.

Playwright is the automation library. TinyFish is the infrastructure layer. You can run them in parallel — Playwright for tests and simple sites, TinyFish for production-scale or complex extraction — without any conflict.

The migration is genuinely minimal. Here's a complete working example:

import asyncio
import os
from playwright.async_api import async_playwright

async def scrape(url: str) -> str:
    async with async_playwright() as p:
        # Replace launch() with a TinyFish session — the rest of your code is unchanged
        from tinyfish import TinyFish
        session = TinyFish().browser.sessions.create()
        browser = await p.chromium.connect_over_cdp(session.cdp_url)
        page = await browser.new_page()
        await page.goto(url, wait_until="networkidle")
        content = await page.content()
        await browser.close()
        return content

result = asyncio.run(scrape("https://example.com"))

Your selectors, event handlers, extraction logic — unchanged. The browser runs in TinyFish's managed infrastructure instead of your server.

For session configuration, concurrent session management, and authentication handling, the full documentation is at docs.tinyfish.ai/browser-api.

Read the Browser API docs — connection options, session config, concurrent setup

If you want to try it against your current Playwright scripts, the Getting Started guide walks through first connection in under 10 minutes.

Try TinyFish free — 500 credits, no credit card required

FAQ

Is TinyFish a Playwright alternative or a complement?

Both, depending on what you need. For teams with existing Playwright code, TinyFish's Browser API is a complement — your scripts run unchanged, just on managed infrastructure instead of your server. As a fuller alternative, TinyFish makes sense when you need infrastructure-level reliability at scale or goal-based agent automation. Most teams end up using both: Playwright for testing and simple sites, TinyFish for production-scale or complex extraction.

Can I use my existing Playwright scripts with TinyFish?

Yes. TinyFish's Browser API returns a CDP WebSocket endpoint. Playwright's connect_over_cdp() method connects to any CDP endpoint. The migration is minimal: create a TinyFish browser session, then pass session.cdp_url to playwright.chromium.connect_over_cdp(). Everything after that line — selectors, interactions, extraction logic — runs unchanged.

When does Playwright become impractical at scale?

Common signals: you're spending more engineering time on browser infrastructure than on automation logic; reliability issues appear that aren't in your code; proxy management has become its own project; or target sites with strict requirements need ongoing maintenance as they evolve. If these apply at your current volume, the infrastructure overhead has exceeded the cost of a managed alternative.

How does TinyFish handle sites with strict automation requirements?

TinyFish achieves up to 85% success rate on sites with strict automation requirements across standard commercial site categories (source: tinyfish.ai). Sites that JavaScript-layer plugins can't reliably handle are within TinyFish's standard coverage. For sites outside this coverage, failures are transparent — you won't get silent incorrect results.

What's the cost difference between Playwright and TinyFish at scale?

Playwright's library is free, but at scale the real costs are server infrastructure, proxy subscriptions, and engineering time for maintenance. TinyFish charges $0.012–$0.015 per credit depending on plan, with browser infrastructure, residential proxy, and LLM inference included. Playwright wins at low volume and simple sites; the economics shift toward TinyFish when infrastructure overhead becomes a material cost. The Pro plan ($150/mo) includes 16,500 credits with 50 concurrent agents.

Try TinyFish Free

If the selector maintenance and infrastructure overhead are the bottleneck — not the control — the free tier is the fastest way to compare against your current Playwright setup.

500 free steps, no credit card required.

Start free →

No browser to install, no selectors to write. Describe what you need, get back structured JSON.

How to Choose a Web Automation Tool by Page Volume (With Real Cost Estimates)

Tinyfishie — Tue, 19 May 2026 08:59:07 +0000

Most web automation tool comparisons treat page volume as a footnote. It isn't.

The tool that handles 500 pages a day beautifully will silently degrade at 50,000. The infrastructure that's cost-effective at 10,000 pages becomes the most expensive option in the room at 500,000. And the free tier that feels like a reasonable starting point has a ceiling that catches most teams by surprise somewhere in the middle of a project.

Page volume and access requirement level are the two primary variables that determine your tool decision — more than AI capability, ease of use, or no-code vs. code. Get either wrong and you're either paying for infrastructure you don't need or running a pipeline that breaks under load exactly when it matters.

This guide maps each volume tier to the tools that actually work at that scale, with real cost estimates at each level so you can make the comparison with numbers rather than intuition.

How to Estimate Your Page Volume

Quick decision rules before the detail:

Under 1K pages/day — free tiers work; pick for convenience, not capability.
1K–10K pages/day — managed tools beat self-hosted once you count setup time.
10K–100K pages/day — engineering maintenance cost exceeds tool subscription cost; factor both.
100K+ pages/day — you're buying infrastructure, not a tool; build vs. buy is the real decision.

Before matching tools to volume tiers, you need an accurate number. Teams consistently underestimate this, and the underestimate is what causes mid-project tool switches.

The formula:

Daily pages = (number of target URLs) × (crawl frequency per day) × (pages per URL path)

A few scenarios to calibrate against:

Competitor price monitoring across 50 e-commerce sites, updated daily: If each site has ~200 product pages, that's 10,000 pages/day. If you need hourly updates, that's 240,000 pages/day.
Lead enrichment from 2,000 company profile pages, run once a week: ~285 pages/day on average. Looks small — until you need it done in a 2-hour window, which effectively makes it ~1,000 pages/hour.
News monitoring across 30 publications, 4x daily: If each publication has ~50 new articles per cycle, that's 6,000 pages/day.

The number that matters for tool selection isn't the total — it's the peak load your pipeline needs to sustain, and whether you need it done in a tight time window or can spread it across the day.

Volume Tier 1: Under 1,000 Pages Per Day

What this looks like

One-off research pulls. Small recurring monitors. Proof-of-concept scrapes before committing to a larger pipeline. A freelancer pulling a client's competitor catalog. A researcher collecting data from academic directories.

What works

At this volume, almost any tool works. The decision is about convenience and your technical comfort level, not about infrastructure.

Free options that are genuinely capable here:

Web Scraper Chrome Extension (free): Works for public, unprotected pages. No scheduling, no parallelism, but for a one-time pull of a few hundred rows it's the fastest path to a CSV.
ParseHub free tier: 5 projects, up to 200 pages per run. If your target has fewer than 200 pages, this is a complete solution at zero cost.
Octoparse free tier: 2 simultaneous scrapers, 10 tasks limit, up to 50K rows/month export. Better for recurring small-volume scrapes than ParseHub, but verify the task and row limits against your actual target before committing.
TinyFish free tier: 500 credits, no credit card. The value here isn't the volume — it's that you get to test an AI agent against your actual target site, including any access requirements it enforces.

What to watch for: Free tiers hide their ceilings. ParseHub's 200-page-per-run limit is the one most teams hit mid-project. If your target has 250 product pages, you're already over the limit. Verify the ceiling against your actual target page count before building a workflow around any free tier.

Real cost at this volume

Tool	Monthly cost at ~500 pages/day	Notes
Web Scraper Extension	$0	No scheduling, uses your IP
ParseHub	$0 (free tier)	200 pages/run limit
Octoparse	$0 (free tier)	Local runs only
TinyFish	$0 (free tier)	500 credits total, not per day
Scrapy (self-hosted)	$0 + server cost (~$5–10/mo VPS)	Requires Python setup

Volume Tier 2: 1,000 to 10,000 Pages Per Day

What this looks like

A small team's recurring data feed. Daily price monitoring across dozens of sites. A startup's competitive intelligence pipeline. Most "we scrape data to inform our product decisions" use cases live here.

What works

This is where free tiers run out and you start paying for infrastructure. The key trade-off at this volume is between simplicity (managed cloud tools) and cost efficiency (self-hosted frameworks).

Managed cloud tools (simpler, higher per-page cost):

Apify: Solid at this volume. A well-configured Actor running 5,000 pages/day typically costs $30–60/month in compute. The marketplace of pre-built Actors covers most common targets (Amazon, LinkedIn, Google Maps) and gets you to first data in under ten minutes without writing selectors. For targets outside the catalog, you're writing and maintaining custom Actors — factor that time in.
TinyFish Browser API: $15/month (Starter) for developers already using Playwright, Puppeteer, or Selenium. Connects via CDP over WebSocket — no SDK swap, you point your existing browser automation at TinyFish's endpoint instead. Sub-250ms cold start means parallelism scales cleanly without queuing delays. Best fit at this tier: developers who want managed browser infrastructure without rebuilding their scraping stack.
TinyFish Web Agent (Starter, $15/month, 1,650 credits): Better fit when your target requires multi-step navigation or authentication rather than straightforward page extraction. A simple extraction runs 2–3 steps/page; an authenticated flow runs 8–10.

The Browser API and Web Agent share the same credit pool, so you can mix both within one plan depending on what each target requires.

Self-hosted frameworks (more work, lower marginal cost):

Scrapy: Free to run, but you're paying for a server and your own time. A $20/month cloud instance handles this volume easily. The real cost is the 4–8 hours of setup time and ongoing maintenance when target sites change. If your targets are static HTML with no strict access requirements, this is the most cost-efficient option at this volume.

What to watch for: At 1,000–10,000 pages/day, you're large enough that sites with strict access requirements become a more significant factor. A managed tool that includes proxy rotation (like TinyFish) absorbs that cost into the subscription. A self-hosted Scrapy setup needs a separate proxy budget — residential proxies (e.g., Bright Data) run ~$8/GB PAYG at this tier, which adds $20–80/month depending on page weight.

Real cost at this volume

Estimated monthly cost at 5,000 pages/day:

Tool	Base cost	Proxy cost	Estimated total/mo
Scrapy (self-hosted)	~$20 (server)	$30–80 (if needed)	$20–100
Apify (pay-as-you-go)	~$40–60 (compute)	Separate	$40–140
TinyFish Starter	$15	Included	$15
TinyFish Pro	$150	Included	$150

Note: TinyFish pricing includes browsers, proxies, and AI inference. Apify and Scrapy costs are compute only — add proxy costs separately for protected sites.

Volume Tier 3: 10,000 to 100,000 Pages Per Day

What this looks like

A mid-size company's market intelligence operation. An e-commerce brand monitoring pricing across hundreds of competitor sites. A SaaS product that needs fresh web data as a core feature. This is where scraping stops being a side project and becomes infrastructure.

What works

At this volume, the hidden cost of scraping is no longer the tool subscription — it's engineering time. Selector-based scrapers break when target sites update. Proxy pools need management. Failure monitoring becomes a dedicated function. The teams that underestimate this end up with a part-time engineer whose primary job is keeping the scraping pipeline alive.

Managed infrastructure wins on total cost here:

TinyFish *(Browser API, for teams migrating from Playwright/Puppeteer): $150/month for 16,500 credits, 50 concurrent sessions; pay-as-you-go at $0.015/credit. TinyFish billing depends on which API you use — the calculations below assume 10 seconds per page; actual costs vary with page load time and workflow complexity.

Browser API (billed per time — 1 credit = 4 minutes, minimum 1 minute):

10 sec/page → rounds up to 1 min → 0.25 credits/page

50,000 pages/day × 0.25 credits × 30 days = 375,000 credits/month

→ PAYG: ~$5,625/month | Pro plan (overage at $0.012/credit): ~$4,452/month

Web Agent (billed per step — 1 credit = 1 step; for complex multi-step workflows):

~3 steps/page × 50,000 pages/day × 30 days = 4,500,000 steps/month

→ PAYG: ~$67,500/month | Not practical for bulk simple extraction at this volume.

Most bulk extraction pipelines at this tier use the Browser API. Web Agent pricing is designed for complex authenticated workflows where the automation value per workflow justifies the cost.*

Apify (Starter plan): Starts at $29/month; the Scale plan ($199/month) is typically required at 50,000 pages/day. At this volume, expect $200–500/month in compute, plus significant proxy costs for protected sites. Custom Actors require ongoing maintenance.
Bright Data: At this volume, Bright Data's Scraping Browser becomes relevant — a fully managed Chrome instance with built-in proxy rotation. Cost is primarily proxy bandwidth: residential proxies at ~$8/GB (PAYG; source: brightdata.com, April 2026). A 50,000-page/day operation scraping typical retail pages (~500KB each) uses roughly 25GB/day — approximately $6,000/month in proxy costs alone. Bright Data makes sense when geographic targeting or anti-detection reliability is the primary requirement, not as a general-purpose option.

Self-hosted at this volume:

Scrapy + infrastructure: Technically possible, but at 50,000 pages/day you need distributed infrastructure — multiple servers, a job queue (Redis or Celery), a monitoring stack, and proxy management. A realistic infrastructure budget is $200–500/month, plus 20+ hours/month of engineering maintenance. Justified if you have a dedicated data engineering team and highly customized requirements.

What to watch for: This is the volume tier where silent failure becomes a serious business problem. A pipeline that silently returns empty results for three days at 50,000 pages/day is a data quality incident, not a minor inconvenience. Factor monitoring and alerting into your tool evaluation — not just happy-path performance.

Real cost at this volume

Estimated monthly cost at 50,000 pages/day, assuming a mixed target set of simple and JS-heavy sites requiring managed browser infrastructure:

Tool	Estimated total/mo	Selector maintenance	Failure visibility
Scrapy + proxies	$2,000–2,300 ⁽¹⁾	High (you own it)	Manual
Apify (custom Actors)	$500–900	Medium (Actor updates)	Dashboard
Bright Data (proxy infra)	$4,500–6,000+ ⁽²⁾	High (your scrapers)	Manual
TinyFish Browser API (PAYG)	~$5,625 ⁽³⁾	None	Built-in
TinyFish Browser API (Pro)	~$4,452 ⁽³⁾	None	Built-in

⁽¹⁾ Scrapy estimate: ~$200–500/month compute (industry estimate, no official source; based on 3–5 VPS instances + job queue) + ~$1,800/month residential proxy for ~30% protected pages (15,000 pages/day × 500KB × 30 days = 225GB × $8/GB). Compute only would be $200–500/month — proxy is the larger cost at this volume.

⁽²⁾ Bright Data: residential proxy at $8/GB PAYG (source: brightdata.com, April 2026). 750GB/month for a mixed site set × $8 = $6,000/month.

⁽³⁾ TinyFish: based on tinyfish.ai/pricing (April 2026) + assumed 10 sec/page (minimum 1 min billing = 0.25 credits/page). Actual costs vary with page load time. See calculation detail in the section above.

The TinyFish number looks higher than Scrapy until you add engineering time. At $150/hour for a developer, 20 hours/month of maintenance is $3,000 — not in the tool budget, but real cost.

Volume Tier 4: 100,000+ Pages Per Day

What this looks like

Enterprise-scale data operations. Google-scale inventory aggregation. A rideshare company collecting millions of pricing variables monthly. Financial services firms monitoring hundreds of regulatory portals in real time. This is not a side project.

What works

At this volume, you're buying infrastructure, not tools. The question is whether you build it or buy it.

Build: A custom distributed scraping stack — Scrapy or custom crawlers running on Kubernetes, Bright Data or a private proxy pool for IP management, a data pipeline for cleaning and delivery. Engineering cost to build: 3–6 months of a senior engineer's time. Ongoing maintenance: a dedicated team. Justified for organizations with highly specific data requirements, existing data engineering capacity, and volume that makes the economics work.

Buy: TinyFish's enterprise tier is designed for this. At this tier, the economics shift from per-page cost to total infrastructure cost — the platform is running production workflows at this scale across multiple enterprise customers. The value proposition at this tier isn't the per-page cost — it's that you're buying a system that's already been hardened at that scale, with the reliability and compliance requirements enterprise operations need. Custom pricing at this tier; contact sales for specifics.

What to watch for: At 100,000+ pages/day, the decision isn't really between tools — it's between building and buying. Both have merit depending on your engineering resources and how central web data collection is to your product. The right question isn't "which tool is cheapest per page?" It's "how much of our engineering capacity do we want this to consume?"

The Full Picture: Volume × Site Complexity

Volume alone doesn't determine your tool. Site complexity — how much infrastructure the target requires — is the other axis. This matrix combines both:

	Static / simple pages	JS-heavy, requires managed browser	Authenticated access (your own accounts)
< 1K pages/day	Free tools (ParseHub, Octoparse)	TinyFish free tier	TinyFish free tier
1K–10K pages/day	Scrapy (self-hosted) or Apify	Apify or TinyFish Starter	TinyFish Starter/Pro
10K–100K pages/day	Scrapy + infra, Apify, or TinyFish Pro	Apify or TinyFish Pro	TinyFish Pro
100K+ pages/day	Custom stack or TinyFish Enterprise	TinyFish Enterprise	TinyFish Enterprise

The pattern: at low volume on simple sites, almost anything works and the cheapest option wins. As volume or site complexity increases, the tools that don't require ongoing maintenance become progressively more cost-effective when you count engineering time.

The Cost Calculation Most Teams Get Wrong

Every tool comparison in this category focuses on subscription price. The number that actually determines total cost is:

Total cost = tool subscription + proxy costs + (engineering hours × hourly rate)

Scrapy is free. But if a developer spends 15 hours/month keeping selectors current, that's $2,250/month at $150/hour — more expensive than any managed tool at comparable volume. The teams that make this mistake are the ones who calculate tool cost from the pricing page and engineering time from zero.

The inversion point — where managed infrastructure becomes cheaper than self-hosted — happens somewhere between 5,000 and 20,000 pages/day for most teams, depending on target site complexity and how often sites update their frontend.

How to Estimate Your Starting Point

If you're not sure where your project falls, start with the TinyFish free tier (500 credits, no credit card). Run it against your actual target site. The results tell you three things at once: whether AI-based extraction handles your target's structure, what your step-per-page ratio looks like for cost projection, and whether there's strict access requirements you didn't know about.

That's a better calibration than any estimate you can make from a pricing page.

Start free — 500 credits, no credit card required

Frequently Asked Questions

How much does web scraping cost?

It depends on volume and tool choice, but the honest answer is that the subscription price is rarely the whole number. At under 1,000 pages/day, free tiers from ParseHub, Octoparse, and TinyFish cover most use cases at zero cost. At 5,000 pages/day, expect $15–100/month depending on whether you need strict access handling. At 50,000 pages/day, total cost including infrastructure and proxy fees typically runs $2,000–5,600/month depending on tool and proxy requirements — and if you're on a self-hosted setup, add engineering maintenance time on top of that. The full formula is: tool subscription + proxy costs + (engineering hours × hourly rate). Teams that only look at the subscription line consistently underestimate real cost by 2–3x.

What counts as a "page" for automation tool pricing?

It depends on the tool. For Scrapy and most traditional scrapers, a page is one HTTP request. For AI web agents like TinyFish, the unit is a "step" — a discrete action (navigate, click, extract). A single page extraction might require 2–5 steps; a multi-step authenticated workflow might require 10–15. Always ask vendors for step-to-page ratios for your specific use case before committing to a plan.

Is Scrapy actually free at high volume?

The software is open source, but the infrastructure isn't free. At 50,000 pages/day you need distributed computing, job queues, monitoring, and proxy pools. A realistic total infrastructure cost is $400–800/month, plus ongoing engineering time. Scrapy is the most cost-efficient option when you have the engineering capacity to run it — it's not free, it's a trade of money for engineering time.

What happens if I exceed my plan's page or step limit?

Most managed tools handle this differently. Apify charges compute unit overages at the pay-as-you-go rate. TinyFish offers pay-as-you-go at $0.015/credit as an alternative to the monthly plan (Pro plan overages bill at $0.012/credit). Note that 1 credit covers 1 agent step, 4 minutes of browser session, or 15 page fetches — the effective per-page cost depends on which API you use. Scrapy has no limit — your infrastructure is the ceiling. Plan for overages before you hit them; discovering them during a critical run is a bad time to learn the policy.

How do I know if my volume estimate is accurate?

It usually isn't, in the direction of underestimation. The most common mistake: counting target URLs but not accounting for crawl frequency, or not including the pages you need to navigate through to reach the data (pagination, category pages, authentication flows). Add 30–50% to your estimate before selecting a plan tier.

Related reading:

The Best Web Scraping Tools in 2026 — Ranked and Reviewed →

📸 IMAGE — Matrix showing web automation tool recommendations by page volume and access requirement level

Try TinyFish Free

500 free steps, no credit card. The fastest way to test whether TinyFish fits your workflow.

Start free →

Want to scrape the web without getting blocked? Try TinyFish — a browser API built for AI agents and developers.

Headless Browser vs AI Agents for Web Automation: How to Choose

Tinyfishie — Tue, 19 May 2026 08:58:40 +0000

**CHANGES IN THIS REVISION — **Docs link anchor text corrected ("docs.tinyfish.ai/web-agent" → "docs.tinyfish.ai").

Your Playwright script runs. The selector finds the element. The data comes back clean. Then the site ships a new layout, a cookie banner appears in a language your script doesn't expect, and a modal blocks the next step. You fix the script. The site ships another update. You fix the script again.

At some point the maintenance math stops working. That's when developers start asking about AI agents.

This article is for developers who already use Playwright, Puppeteer, or Selenium and want an honest answer to whether AI agents change their automation stack — and when.

What Is a Headless Browser?

You've used it. Playwright, Puppeteer, Selenium.

A headless browser is a browser engine without a graphical interface, controlled by code. It loads pages, executes JavaScript, renders content, and follows the instructions your script provides. It's a powerful, precise tool — and that precision is both its strength and its constraint.

A headless browser does exactly what your script tells it to do. No more, no less. If your script says "click the button with ID submit-btn," it clicks that button. If the button is now named confirm-btn, the script fails.

That determinism is valuable in the right context. It's also the root of the maintenance problem.

What Is an AI Web Agent?

The key distinction isn't capability — it's direction.

A headless browser is instruction-following: you write the exact steps, the browser executes them. An AI web agent is goal-directed: you describe the outcome you want, the agent works out the steps itself.

With Playwright: page.click('#submit-btn') — you specify the action.

With a web agent: "Submit the form and return the confirmation number" — you specify the goal.

TinyFish's Web Agent takes a URL and a plain-language goal, navigates the page, makes decisions about what to do next, handles unexpected states, and returns a structured result. The agent handles the selector logic. You handle the goal definition.

This comes with a real trade-off: agents are less predictable than scripts. If you need pixel-perfect determinism — test assertions against specific DOM states, for example — an agent introduces variability a script doesn't. For workflows where the goal matters more than the exact path, that trade-off favors agents.

For a deeper introduction to what web agents can do in practice, see What Is a Web Agent? The Complete Guide to AI Browser Agents in 2026.

The Honest Comparison

	Playwright / Puppeteer	TinyFish Web Agent
How you control it	Write exact steps (click, fill, navigate)	Describe the goal in plain language
Setup complexity	Library install, you manage scripts	API call with goal string
Maintenance over time	High — selectors break when sites change	Lower — agent adapts to layout changes
Handles unexpected states	No — script fails on unexpected conditions	Yes — agent reasons about what to do next
Infrastructure required	You manage browser servers, proxies	Managed by TinyFish
Determinism	High — predictable, exact	Lower — path may vary between runs
Cost at low volume	Library free; infra + proxy + maintenance extra	$0.015/credit PAYG — no infra overhead
Cost at high volume	Server fleet + proxy subscriptions + ongoing maintenance = significant ops cost	$0.012/credit (Pro), all infra included
Best for	UI testing, known stable sites, local dev	Dynamic sites, goal-based tasks, production scale

Where Playwright wins: control, predictability, local development, low-volume stable sites.

Where TinyFish wins: managed infrastructure, goal flexibility, sites with frequent layout changes.

If it's a draw on a row, it's a draw.

When a Headless Browser Is the Right Choice

Playwright isn't going anywhere. For the majority of browser automation tasks, it's still the right answer.

UI and functional testing. Playwright's native test runner, assertion API, and trace viewer are purpose-built for this. You need deterministic behavior, specific element state assertions, and reproducible test runs. Agents aren't designed for this job.

Scraping a known site with a stable layout. Documentation sites, static product catalogs, public APIs that require browser rendering — if the structure doesn't change, a Playwright script is cheaper and more predictable than an agent.

Local development and prototyping. Working locally with Playwright gives you full visibility into browser state, easy step-through debugging, and zero API latency. Starting with Playwright and migrating specific workflows to agents is a valid progression.

When you need complete control. Custom browser configurations, low-level network interception, cookie manipulation, specific extension behavior — Playwright's architecture gives you direct access to things an agent abstracts away.

Cost-sensitive early stage. At very low volume on cooperative, stable sites, Playwright's library cost is lower on paper. Once you add browser server infrastructure, proxy costs, and maintenance time, TinyFish's per-step pricing is typically comparable — and often cheaper than the full stack.

If your use case is on this list, you don't need a managed platform. Use Playwright.

When an AI Agent Adds Value

Here's the pattern that pushes teams toward agents: the script breaks. Not because of a bug — because the site changed.

A cookie banner appeared in Dutch. A login modal on the test site loaded differently on mobile. A dynamic popup blocked the next click. A form added a new required field. A confirmation step now requires checking a box that wasn't there before.

Each of these is a 20-minute fix. Multiplied across dozens of target sites, updated continuously, it becomes a full-time maintenance load that grows with the number of sites you're monitoring.

AI agents absorb this maintenance overhead. The goal stays the same; the agent adapts to the changed path.

Scenarios where agents add clear value:

Workflows with conditional decisions mid-flow. "Navigate to the product, check if it's in stock, and if so, extract the price and the lead time" — a fixed script needs to handle all possible states explicitly. An agent handles them implicitly.
Sites that update layouts frequently. E-commerce, news, job boards, real estate listings — any site where the structure changes often.
Running at production scale without owning browser infrastructure. Managing Playwright at scale means managing servers, proxies, and session handling. TinyFish removes that layer.
Tasks where defining the exact selector sequence is the bottleneck. If writing the script takes longer than the task itself, agents change the economics.

For specifics on where Playwright scripts degrade in production environments, see Scraping Dynamic Websites: When Playwright Breaks.

Test TinyFish against your hardest automation tasks — 500 free credits, no credit card

The Spectrum: Library → Cloud Browser → AI Agent

📸 IMAGE — Spectrum diagram from Playwright library to TinyFish Browser API to TinyFish Web Agent showing control vs maintenance trade-off

This isn't a binary choice. There's a spectrum — and TinyFish spans the right half of it.

Library (Playwright/Puppeteer): You write the script, you manage the infrastructure. Full control. Full maintenance responsibility.

Cloud browser (TinyFish Browser API): Same CDP interface, managed infrastructure. Your existing Playwright scripts connect with one line change. Browser servers, proxy routing, and reliability handling are managed. You keep the code.

AI Agent (TinyFish Web Agent): Submit a goal, get a structured result. The agent handles navigation, decision-making, extraction. You define what you want, not how to get it.

TinyFish spans the cloud browser and agent layers with one API key and one billing model. Use it as a CDP-compatible cloud browser — your existing Playwright code runs unchanged, you swap the launch endpoint. Or use it as a Web Agent — submit a goal, get a result. The choice is per-task, not per-account.

You're not choosing between Playwright and TinyFish. You're choosing where on this spectrum your specific task sits — and whether the infrastructure overhead of running it yourself is worth it.

The Web Agent API documentation is at docs.tinyfish.ai

For a view of how teams move from Selenium to agents over time, see From Selenium to AI Agents: A Migration Guide for Web Automation Teams.

Read the Web Agent docs — goal syntax, response handling, concurrent runs

Try TinyFish free — 500 credits, no credit card required

FAQ

Can I use my existing Playwright code with TinyFish?

Yes. TinyFish's Browser API exposes a CDP (Chrome DevTools Protocol) endpoint. Playwright's connect_over_cdp() method connects to any CDP endpoint. Create a TinyFish browser session, then pass the returned cdp_url to playwright.chromium.connect_over_cdp(session.cdp_url). Everything after that line — selectors, interactions, extraction logic — runs unchanged. You get managed infrastructure without rewriting your automation code.

What's the difference between the TinyFish Browser API and Web Agent?

The Browser API gives you a CDP-accessible managed browser — you write the Playwright or Puppeteer code that controls it. The Web Agent takes a goal in plain language and works out the steps itself, returning a structured result. Both use the same API key and billing pool. Browser API is for teams with existing Playwright code who want managed infrastructure. Web Agent is for goal-based tasks where writing the exact step sequence is the bottleneck.

Are AI agents more reliable than Playwright for web scraping?

AI agents are more reliable than Playwright scripts when sites change layouts frequently, because they navigate by page semantics rather than fixed selectors. If scripts fail because sites change layouts frequently, agents are more resilient — they adapt to changed paths. If scripts fail because of infrastructure issues (IP reputation, session handling, scale), moving to managed infrastructure helps regardless of whether you use scripts or agents. If your scripts work reliably, agents don't add reliability — they add flexibility for goal-based tasks at the cost of some determinism.

How does TinyFish's Web Agent handle sites that require authentication?

The agent handles multi-step workflows including authenticated sessions. For your own authenticated accounts, you describe the goal ("log in with these credentials and extract the invoice list from my account") and the agent manages the navigation. For authenticated workflows on your own accounts, use use_vault: true with stored credential items rather than passing credentials in the goal string.

What tasks should never use an AI agent?

UI and functional testing — you need deterministic assertions against specific DOM states, and agent variability breaks this requirement. Low-volume automation against stable, cooperative sites — the economics don't favor agents. Any workflow where you need byte-for-byte reproducibility across runs. And tasks where you need to debug exact failure points in a selector sequence — agents abstract that information away.

Try TinyFish Free

500 free steps, no credit card. The fastest way to test whether TinyFish fits your workflow.

Start free →

Web Data Extraction: From Static Pages to AI Agents

Tinyfishie — Tue, 19 May 2026 08:58:10 +0000

You need data from a website in a usable format: JSON, CSV, a clean table. The site isn't an API. It's not static HTML. It might render content via JavaScript, require authentication, or return different results to automation than to a browser.

Which tool you reach for depends entirely on which of those is true.

This guide covers all four tiers of website complexity with working code and honest tool recommendations. Tiers 1 and 2 don't need TinyFish — the right answer for many sites is still requests and BeautifulSoup. The guide says so.

Note: data extraction should respect each site's terms of service and robots.txt. The tools and techniques here are for legitimate use cases: price monitoring, market research, internal data pipelines, and publicly available information.

What Counts as "Structured Data Extraction"?

Dumping raw HTML isn't extraction. Structured data extraction means getting machine-readable output — JSON, CSV, a clean table — where the fields map to the information you actually need: price, title, availability, review count.

The right approach depends on one question: what kind of site is this?

The answer falls into four tiers, each requiring a different tool.

Four Tiers of Website Complexity

Tier	Site type	What makes it hard	Right tool	Code complexity
1	Has an API or RSS feed	Nothing — use the API	`requests` + JSON	Minimal
2	JS-rendered, no strict requirements	Content loads after JS runs	Playwright or TinyFish Fetch	Medium
3	Strict automation requirements at scale	Infrastructure complexity at volume	TinyFish Fetch API	Low (API call)
4	Authenticated or multi-step workflow	Session state + conditional decisions	TinyFish Web Agent	Low (goal string)

Tiers 1 and 2 don't require a managed platform. If you're on Tier 1 or 2 at low volume, use the simpler tool.

Tier 1 — Sites with APIs or RSS Feeds

Before writing a scraper, check for an API.

Many sites that appear to require scraping have JSON endpoints. Browser devtools → Network tab → filter for application/json requests. Product listing pages often load data from a /api/products or /v1/listings endpoint your scraper can call directly — faster, cleaner, and more stable than HTML parsing.

For sites with RSS or Atom feeds (blogs, news, job boards), the feed is already structured data.

For sites where the HTML itself is static and well-structured, requests + BeautifulSoup is the right tool. It's fast, it's free, and there's nothing to manage.

import requests
from bs4 import BeautifulSoup

def extract_products(url: str) -> list[dict]:
    r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "html.parser")

    products = []
    for item in soup.select(".product-card"):
        products.append({
            "name": item.select_one(".product-name").text.strip(),
            "price": item.select_one(".product-price").text.strip(),
            "url": item.select_one("a")["href"],
        })
    return products

results = extract_products("https://example-shop.com/products")

This handles most documentation sites, static product catalogs, blog feeds, and public data sources. If it works, stop here.

Where it breaks down: Any site that loads content via JavaScript after the initial HTML response. You'll get the page shell with empty containers, not the data.

Tier 2 — JavaScript-Rendered Pages

Single-page apps, infinite scroll, dynamic tables, price widgets that load after the initial render — these require a real browser. The HTML you get from requests is what the server sends before JavaScript runs; the content you need loads after.

Two options, both valid:

Option A: Playwright (local, controlled environments)

import asyncio
from playwright.async_api import async_playwright
from typing import Optional

async def extract_js_page(url: str) -> Optional[str]:
    async with async_playwright() as p:
        browser = None
        try:
            browser = await p.chromium.launch(headless=True)
            page = await browser.new_page()
            await page.goto(url, wait_until="networkidle")
            await page.wait_for_selector(".product-list", timeout=10000)
            content = await page.inner_text(".product-list")
            return content
        except Exception as e:
            print(f"Error loading {url}: {e}")
            return None
        finally:
            if browser:
                await browser.close()

result = asyncio.run(extract_js_page("https://example-spa.com/listings"))

Playwright is the right choice for local development, CI pipelines, and controlled environments. You manage the browser process.

Option B: TinyFish Browser API (managed infrastructure, same Playwright code)

For the same JS-rendered page, connect to TinyFish's managed browser instead of launching one locally. Your selectors and extraction logic stay identical:

import asyncio
import os
from tinyfish import TinyFish
from playwright.async_api import async_playwright
from typing import Optional

async def extract_js_page_managed(url: str) -> Optional[str]:
    async with async_playwright() as p:
        browser = None
        try:
            session = TinyFish().browser.sessions.create()
            browser = await p.chromium.connect_over_cdp(session.cdp_url)
            page = await browser.new_page()
            await page.goto(url, wait_until="networkidle")
            await page.wait_for_selector(".product-list", timeout=10000)
            content = await page.inner_text(".product-list")
            return content
        except Exception as e:
            print(f"Error loading {url}: {e}")
            return None
        finally:
            if browser:
                await browser.close()

The automation logic is unchanged. The infrastructure — browser servers, proxy routing, reliability handling — is managed by TinyFish. Browser cold starts under 250ms (source: tinyfish.ai).

For local development and direct browser control, Playwright. For production extraction without running browser infrastructure, TinyFish Browser API.

Tier 3 — Sites with Strict Automation Requirements

The challenge at this tier isn't writing the code — it's owning the infrastructure that makes the code reliable at scale.

Browser servers, proxy routing, session handling, failure recovery: these aren't one-time setup tasks. They're an ongoing operational surface that grows with your target list. TinyFish handles this infrastructure layer so your Python code stays focused on extraction logic — not ops. The same API call works whether you're running 10 requests or 10,000.

TinyFish's Browser API handles the infrastructure layer. You use the same Playwright selectors; TinyFish manages browser servers, proxy routing, and reliability underneath.

import asyncio
import os
from tinyfish import TinyFish
from playwright.async_api import async_playwright
from typing import Optional

async def extract_with_managed_browser(url: str) -> Optional[str]:
    """Extract page content using TinyFish managed browser infrastructure."""
    async with async_playwright() as p:
        browser = None
        try:
            session = TinyFish().browser.sessions.create()
            browser = await p.chromium.connect_over_cdp(session.cdp_url)
            page = await browser.new_page()
            await page.goto(url, wait_until="networkidle", timeout=30000)
            content = await page.content()
            return content
        except Exception as e:
            print(f"Error loading {url}: {e}")
            return None
        finally:
            if browser:
                await browser.close()

TinyFish achieves up to 85% success rate on sites with strict automation requirements (standard commercial sites; internal testing; source: tinyfish.ai) and a browser cold start under 250ms. For sites where these infrastructure concerns are the reliability bottleneck, the Fetch API removes that layer from your codebase.

For running this type of extraction at scale across many URLs simultaneously, see How to Monitor 1,000 Websites in Parallel with the TinyFish API.

Try TinyFish Fetch — 500 free credits, no credit card required

Tier 4 — Authenticated or Multi-Step Workflows

Some extraction targets involve authenticated workflows on your own accounts — they require:

Logging in before data is accessible
Navigation through multiple pages with conditional logic
Filling a search form and extracting results from the response
Handling state that changes between steps

Writing and maintaining the exact navigation sequence for these workflows in Playwright is high-maintenance code. Every change to the login flow or form structure breaks the script.

For goal-directed workflows, TinyFish's Web Agent takes a plain-language description of what you want and handles the navigation:

import requests
import os
from typing import Optional

def run_extraction_agent(url: str, goal: str) -> Optional[dict]:
    """Run a goal-directed extraction agent."""
    response = requests.post(
        "https://agent.tinyfish.ai/v1/automation/run",
        headers={"X-API-Key": os.environ["TINYFISH_API_KEY"]},
        json={"url": url, "goal": goal},
        timeout=120
    )
    if not response.ok:
        return None
    data = response.json()
    # "COMPLETED" means the run finished — not necessarily that the goal succeeded
    if data.get("status") != "COMPLETED" or data.get("result") is None:
        return None
    return data["result"]

# Describe what you want — the agent handles the navigation
result = run_extraction_agent(
    url="https://supplier-portal.com/pricing",
    goal="Navigate to the pricing page, extract all product SKUs and their current prices. Return as JSON with fields: sku, name, price, currency."
)

The agent handles authentication, conditional navigation, and session state. You handle the goal definition.

For complex workflows and authenticated access patterns, use TinyFish's credential vault (use_vault: true) rather than passing credentials in the goal string.

For more on goal-directed automation for complex extraction workflows, see From Selenium to AI Agents: A Migration Guide.

Choosing the Right Approach

Decision flowchart:

Does the site have a documented API or RSS feed?
  → Yes: Use the API directly. Don't scrape what you can query.
  → No: Continue

Does the page content load via JavaScript?
  → No: requests + BeautifulSoup
  → Yes: Continue

Do you need managed infrastructure (scale, reliability, no server management)?
  → No: Playwright (local browser)
  → Yes: TinyFish Fetch API

Does the workflow require authentication, multi-step navigation, or conditional decisions?
  → No: TinyFish Fetch API
  → Yes: TinyFish Web Agent

Start with the simplest tool that works. When that tool's limits appear — JavaScript rendering, infrastructure overhead at scale, multi-step workflows — the next tier is ready. Tiers 3 and 4 aren't edge cases; they're where production data pipelines typically land.

Try TinyFish free — 500 credits, no credit card required

FAQ

What's the best Python library for extracting structured data from websites?

The right Python library depends on whether your target uses static HTML or JavaScript rendering. For static HTML, requests + BeautifulSoup or lxml is the fastest and simplest. For JavaScript-rendered pages, Playwright gives you full browser control. For managed infrastructure at scale, TinyFish's Fetch API removes the server maintenance. There's no single best library — the right choice is determined by the site type, volume, and whether you want to manage browser infrastructure yourself.

How do I extract JSON data that a website loads dynamically?

Check the Network tab in browser devtools first. Filter for application/json or fetch requests. Many sites that appear to require scraping actually load their data from an API endpoint you can call directly with requests. If the data isn't available via a direct API call, you'll need a real browser (Playwright or TinyFish Fetch with browser: true) to wait for the JavaScript to execute and the data to load.

What's the difference between web scraping and structured data extraction?

Web scraping typically refers to the process of downloading web pages. Structured data extraction is what you do with those pages: parsing them to get machine-readable output in a defined schema. You can scrape without extracting (just downloading HTML) and extract without scraping (calling an API). For practical purposes, the distinction matters because it determines your tool choice: if you need raw page content, a fetcher is enough; if you need specific fields in a specific format, you also need a parsing step.

How do I handle pagination when extracting data?

For static pagination (page 1, page 2, etc.), loop over page URLs and aggregate results. For infinite scroll or dynamic pagination, you need a real browser — Playwright can wait for the "Load More" button and click it programmatically, or TinyFish's Web Agent can handle pagination as part of a goal-directed workflow. For deep pagination at scale (thousands of pages), consider whether the site has an API endpoint for its paginated data — it's almost always faster than page-by-page extraction.

When does extracting structured data become a legal concern?

The legal landscape varies by jurisdiction, site terms of service, and the type of data. In most cases, extracting publicly available data that doesn't contain personal information is legally defensible (see hiQ v. LinkedIn in the US context). Key boundaries: always respect robots.txt, don't extract and republish data in ways that compete with the source site's business, and be careful with personal data (GDPR applies even to publicly visible information about EU individuals). When in doubt, check the site's terms of service and consult legal counsel for commercial use cases.

Try TinyFish Free

500 free steps, no credit card. The fastest way to test whether TinyFish fits your workflow.

Start free →