Tinyfishie

Posted on May 19 • Originally published at tinyfish.ai

Fetching Data from a Large URL List: The Complete Decision Guide

#datafetching #urllist #webautomation #python

You have a list of 500 URLs — competitor product pages, supplier portals, job listings, or real estate listings. You need the data from each one.

The answer to "which tool fetches this data reliably" depends on what's in that list — not on how many URLs there are.

What's in your list → which tool:

All static HTML, no strict automation requirements → requests + httpx (fastest, cheapest)
JavaScript-rendered content, no strict automation requirements → Playwright or Crawlee
Mixed list with some protected sites → Playwright + proxy rotation
Protected or authenticated URLs at scale → TinyFish Web Agent
Massive volume (100K+) of public pages → Scrapy

(Not sure what's in your list? The URL classification probe in the Mixed List section below runs in seconds before you commit to a full crawl.)

The Tool That Fits the List

Static HTML at Volume: requests + asyncio

If your URLs are documentation pages, blog posts, static product catalogs, or any content that loads fully in the initial HTML response, Python's requests library with async execution is the fastest and cheapest option—often by a large margin.

import asyncio
import httpx
async def fetch(client: httpx.AsyncClient, url: str) -> dict:
    try:
        r = await client.get(url, timeout=15)
        return {"url": url, "status": r.status_code, "html": r.text}
    except Exception as e:
        return {"url": url, "error": str(e)}

async def crawl_list(urls: list[str], concurrency: int = 20) -> list:
    results = []
    async with httpx.AsyncClient(follow_redirects=True) as client:
        for i in range(0, len(urls), concurrency):
            batch = urls[i:i + concurrency]
            batch_results = await asyncio.gather(*[fetch(client, url) for url in batch])
            results.extend(batch_results)
            print(f"Processed {min(i + concurrency, len(urls))}/{len(urls)}")
    return results

with open("urls.txt") as f:
    urls = [line.strip() for line in f if line.strip()]

results = asyncio.run(crawl_list(urls))

In our testing, this handles 1,000 static URLs in under a minute on a standard laptop. For 100K+ URLs, Scrapy's built-in scheduler, downloader middleware, and item pipeline make more sense—it handles deduplication, retry logic, and output formatting at Scrapy's architecture level.

Where this breaks down: Any URL that requires JavaScript execution. If the page shows a loading spinner and populates content after load, requests returns the spinner HTML, not the content.

JavaScript Content: Playwright with Batching

For lists where content loads via JavaScript—React SPAs, infinite scroll, dynamic filtering, price tables that render after an API call—you need a real browser.

import asyncio
from playwright.async_api import async_playwright

async def fetch_js(page, url: str) -> dict:
    try:
        await page.goto(url, wait_until="networkidle", timeout=30000)
        content = await page.content()
        return {"url": url, "html": content}
    except Exception as e:
        return {"url": url, "error": str(e)}

async def crawl_js_list(urls: list[str], concurrency: int = 5) -> list:
    results = []
    async with async_playwright() as p:
        browser = None
        try:
            browser = await p.chromium.launch(headless=True)
            for i in range(0, len(urls), concurrency):
                batch = urls[i:i + concurrency]
                pages = [await browser.new_page() for _ in batch]
                batch_results = await asyncio.gather(*[
                    fetch_js(page, url) for page, url in zip(pages, batch)
                ])
                for page in pages:
                    await page.close()
                results.extend(batch_results)
        finally:
            if browser:
                await browser.close()
    return results

Keep concurrency low (3–8 pages) when running locally—each headless Chromium instance consumes 100–300MB. For larger lists, cloud browser infrastructure (Browserless, Browserbase) handles the browser pool so you're not resource-limited on your machine.

Where this breaks down: Sites with strict automation requirements at the network and behavioral level. JavaScript-level automation handling helps at low volume; at scale, sites with enterprise-grade access requirements become harder to handle reliably.

Sites with Strict Requirements or Authenticated Access: TinyFish

This is where simple HTTP requests stop being sufficient. Your list includes:

Product pages that return different content to automation than to browsers
Pricing pages that require login using your own authorized account
Sites with strict automation requirements that affect reliability at scale
Authenticated portals where each URL requires an authorized session

For these, maintaining a Playwright-based crawler means:

Managing automation configuration that needs ongoing updates as site requirements evolve
Building session management for authenticated URLs
Handling multi-step login flows and session state
Debugging failures that change based on site configurations you don't control

AI web agents handle this at the infrastructure level. You pass a URL and a goal; the agent handles rendering, infrastructure-level request handling, and authentication for sites where you have authorized access.

import asyncio
import aiohttp
import os

async def crawl_url(session, url: str, goal: str) -> dict:
    async with session.post(
        "https://agent.tinyfish.ai/v1/automation/run",
        headers={"X-API-Key": os.environ["TINYFISH_API_KEY"]},
        json={"url": url, "goal": goal},
        timeout=aiohttp.ClientTimeout(total=120)
    ) as resp:
        if resp.status != 200:
            return {"url": url, "result": None, "status": "HTTP_ERROR",
                    "error": await resp.text()}
        data = await resp.json()
        # "COMPLETED" means the run finished — not that the goal succeeded.
        # Check for TASK_FAILED / SITE_BLOCKED / TIMEOUT before using result.
        status = data.get("status")
        result = data.get("result")
        if status != "COMPLETED" or result is None:
            return {"url": url, "result": None, "status": status,
                    "error": data.get("error")}
        return {"url": url, "result": result, "status": status}

async def crawl_protected_list(urls: list[str], goal: str, concurrency: int = 10) -> list:
    results = []
    async with aiohttp.ClientSession() as session:
        for i in range(0, len(urls), concurrency):
            batch = urls[i:i + concurrency]
            batch_results = await asyncio.gather(*[
                crawl_url(session, url, goal) for url in batch
            ])
            results.extend(batch_results)
            print(f"Processed {min(i+concurrency, len(urls))}/{len(urls)}")
    return results

urls = ["https://protected-site.com/product/1", "https://protected-site.com/product/2"]
goal = "Extract the product name, current price, and availability status. Return as JSON."
results = asyncio.run(crawl_protected_list(urls, goal))

The concurrency limit is determined by your plan—10 concurrent agents on Starter, 50 on Pro. For a 1,000-URL list on Pro, that's 20 sequential batches of 50.

When the math shifts: requests and Playwright are cheaper per-URL on cooperative, stable sites. The cost calculation changes when you include the full stack: cloud browser pools, proxy subscriptions, and ongoing maintenance. In our experience, production scrapers for sites that actively update their access requirements need patches every few weeks — that maintenance overhead accumulates faster than per-URL cost alone suggests. For URL lists with meaningful protected or authenticated content, the total operational cost typically exceeds TinyFish's per-step pricing before you reach production scale.

Handling the Mixed List

Real URL lists are rarely uniform. A supplier monitoring list might include:

60% static pricing pages (requests would work)
30% JavaScript-rendered product tables (Playwright needed)
10% authenticated portals with strict automation requirements (agents needed)

The practical approach: categorize your list before you crawl it. A quick HEAD request or a sample run reveals which URLs respond to simple HTTP requests vs. which require rendering vs. which block automation. Route each category to the appropriate tool. The 10% that requires agents is where reliability actually matters — authentication failures and automation blocks are what stall production workflows, not the cooperative pages.

To classify URLs before routing them, a quick probe is faster than a full crawl:

import httpx
import random

def classify_url(url: str, timeout: int = 10) -> str:
    """Returns 'static', 'js', or 'blocked' based on a quick probe."""
    try:
        r = httpx.get(url, timeout=timeout, follow_redirects=True,
                      headers={"User-Agent": "Mozilla/5.0"})
        if r.status_code in (401, 403, 429):
            return "blocked"
        html = r.text
        js_signals = [
            len(html) < 500,                          # near-empty response
            '<div id="root">' in html,              # React
            '<div id="app">' in html,               # Vue
            "ng-version=" in html,                    # Angular
            "window.__NUXT__" in html,                # Nuxt
            html.count("<p") < 2 and len(html) < 2000, # minimal real content
        ]
        return "js" if any(js_signals) else "static"
    except Exception:
        return "blocked"

# Sample 10% before committing to the full crawl
sample = random.sample(urls, min(50, len(urls)))
categories: dict[str, list] = {"static": [], "js": [], "blocked": []}
for url in sample:
    categories[classify_url(url)].append(url)

print(f"Static: {len(categories['static'])}, JS: {len(categories['js'])}, Blocked: {len(categories['blocked'])}")
# Route the full list to the matching tool based on these proportions

A 429 response means rate-limited — retry with backoff before escalating. A 403 indicates access is blocked or restricted; retrying with the same tool won't help. A near-empty response or JS framework marker means JS rendering is needed. Clean HTML with visible <p> tags is static.

Scale Considerations

List size	Tool	Rough time (10 concurrent)
100–1,000 static	requests/httpx	1–5 min
100–1,000 JS	Playwright	5–20 min
100–1,000 protected	TinyFish agents	10–30 min
10,000+ static	Scrapy	Hours, distributed
10,000+ JS or protected	Infrastructure + agents	Plan accordingly

For very large lists (100K+), distributed architecture matters regardless of tool—whether that's Scrapy's built-in scheduler, a task queue like Celery, or submitting batches to an async agent API and polling for results.

FAQ

What's the fastest way to fetch data from a large URL list in Python?

For static HTML content, httpx with asyncio is the fastest approach—you can process 20–50 URLs simultaneously with a single machine and finish 1,000 URLs in under a minute. The key is async execution: sequential requests would take 10–15x longer for the same list. For JavaScript-rendered content, Playwright in async mode with 5–10 concurrent browser pages is the practical ceiling before memory constraints become a factor on standard hardware.

How do I improve reliability when fetching data from many URLs?

Rate limiting is the first line: 1–2 requests per second per domain for most sites, slower for aggressive protection. Rotate user agents across requests. For moderate protection, requests with a realistic user agent and reasonable delays works. For sites with enterprise-grade automation detection, JavaScript-level automation plugins help at low volume but degrade at scale — TinyFish provides infrastructure-level browser sessions that are more reliable for protected sites at production scale.

Should I use Scrapy or Playwright for a large URL list?

Scrapy if your URLs return static HTML and you need high volume (10K+) with built-in scheduling, retry logic, and output pipelines. Playwright if URLs require JavaScript execution. The two aren't mutually exclusive—Scrapy has a Playwright middleware (scrapy-playwright) that handles JS rendering within Scrapy's architecture. For lists with mixed content types, start with Scrapy for the static subset and use a separate Playwright job for the JS-heavy URLs.

How do I deduplicate URLs before crawling?

Normalize URLs first: lowercase the scheme and domain, sort query parameters alphabetically, strip tracking parameters (utm_*, ref=, fbclid=), and resolve relative URLs to absolute. Python's urllib.parse.urlparse plus a set for deduplication handles most cases. For large lists with near-duplicate URLs (same page, different session IDs), a URL fingerprinting library like w3lib.url.canonicalize_url gives more aggressive deduplication.

When does crawling a URL list require authentication?

When the target pages are behind login walls that your team has authorized access to—supplier pricing portals, internal tools, subscription content, or any page that redirects to a login page for unauthenticated requests. Signs your list needs auth: all results return the same HTML (the login page), response sizes are suspiciously uniform, or you see redirect chains ending at /login. For authenticated list crawling at scale, session management becomes the primary complexity—handling login flows, session expiry, and re-authentication across many concurrent workers. TinyFish handles session management and multi-step login flows for sites where you have authorized account access — you provide credentials, the agent handles the rest.

Try TinyFish Free

500 free steps, no credit card. The fastest way to test whether TinyFish fits your workflow.

Start free →

DEV Community