Tinyfishie

Posted on May 19 • Originally published at tinyfish.ai

How to Extract Structured Data from A Website

#webscraping #dataextraction #tutorial #playwright

TinyFish agents are cloud-based browser sessions that navigate any website — static, JavaScript-rendered, or requiring multi-step workflows — and return machine-readable structured data without requiring you to manage browser infrastructure, proxy rotation, or session handling.

The right tool for structured data extraction depends entirely on what kind of website you're dealing with. This guide presents a four-tier framework: most sites fall into Tier 1 or 2, where you don't need TinyFish at all. Tier 3 and 4 are where managed infrastructure earns its cost.

Prerequisites for the code examples: Python 3.8+. Install: pip install requests feedparser playwright tinyfish

Responsible use note: Always review a site's terms of service before automated data extraction. Public-facing data for research, competitive intelligence, and non-commercial analysis is generally low-risk. Extraction at scale for data resale or targeting platform users requires careful legal review.

What Counts as "Structured Data Extraction"?

The goal is machine-readable output — JSON, CSV, a typed object — rather than raw HTML or a PDF dump. A scraper that returns <div class="price">$29.99</div> isn't done; a scraper that returns { "price": 29.99, "currency": "USD" } is.

The right tool depends on what kind of site you're dealing with. Tier 1 sites are trivially easy. Tier 4 requires the most capability. Most real-world extraction projects span multiple tiers — a single pipeline might hit Tier 1 sources for some data and Tier 3 sources for others.

Four Tiers of Website Complexity

Tier	Site type	What makes it hard	Right tool	Code complexity
1	Has an API or RSS feed	Nothing — use the API	`requests` + JSON	Trivial
2	JS-rendered, no API	Content loads after JS executes	Playwright (local) or TinyFish Fetch (managed)	Low
3	Strict automation requirements at scale	Infrastructure gaps cause failures in production	TinyFish Fetch with browser	Medium
4	Authenticated or multi-step workflow	Session state, conditional navigation, decisions	TinyFish Web Agent	Handled for you

Tiers 3 and 4 aren't edge cases — they're where production data pipelines typically land once you move beyond toy examples.

Tier 1 — Sites with APIs or RSS Feeds

Always check for an official API before writing a scraper. Many sites that look like scraping targets have well-documented APIs that return exactly the structured data you need.

Where to look: /robots.txt, the site footer ("Developers" / "API" links), documentation subdomains, RapidAPI, or just searching "[site name] API documentation".

import requests

# Example: a site that returns JSON at a predictable endpoint
response = requests.get(
    "https://data-source.example.com/api/v1/products",
    headers={"Authorization": "Bearer your_api_key"},
    params={"category": "electronics", "format": "json"}
)
products = response.json()
print(products[0])  # {'id': '123', 'name': 'Widget A', 'price': 29.99}

RSS feeds are Tier 1 for content monitoring:

import feedparser

feed = feedparser.parse("https://news-source.example.com/feed.xml")
articles = [{"title": e.title, "url": e.link, "published": e.published} for e in feed.entries]

This is the best case. It's fast, free, respects the site's intended access pattern, and won't break when page layouts change.

Tier 2 — JavaScript-Rendered Pages

When a site loads content after JavaScript executes, requests alone returns an empty or incomplete page. You need a browser that runs JavaScript.

Two equally valid approaches, each with a clear use case:

Playwright — the right choice when you're running locally, need a controlled environment, or are integrating into an existing test suite. Free to use, open-source, excellent documentation.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("https://js-rendered-site.example.com/catalog")
    page.wait_for_selector(".product-card")  # wait for JS to render

    products = page.evaluate("""
        () => Array.from(document.querySelectorAll('.product-card')).map(el => ({
            name: el.querySelector('.name').innerText,
            price: el.querySelector('.price').innerText,
        }))
    """)
    browser.close()

TinyFish Fetch API — the right choice when you want zero infrastructure overhead: no browser process to manage, no selector maintenance when the site updates, no proxy setup. Returns markdown or structured text from the live rendered page.

import requests, os

response = requests.post(
    "https://api.fetch.tinyfish.ai",
    headers={"X-API-Key": os.environ["TINYFISH_API_KEY"]},
    json={"urls": ["https://js-rendered-site.example.com/catalog"], "format": "markdown"}
)
results = response.json().get("results", [])
content = results[0]["text"] if results else ""  # full page content after JS renders

For simple JS-rendered pages where content is fully visible after load, either approach works. Playwright gives you more control; TinyFish Fetch eliminates infrastructure management.

Tier 3 — Sites with Strict Automation Requirements

Playwright requires significant infrastructure work to run reliably on sites with strict automation requirements at scale. The gaps compound:

Proxy management — single-IP requests fail rate limits; rotating proxies require sourcing, billing, and maintenance
Session handling — requests with stale or suspicious session fingerprints get challenged more frequently
Concurrency — running 50 Playwright instances locally saturates memory; running them in cloud containers requires container orchestration

TinyFish Fetch API handles the infrastructure layer — proxy routing, session management, and request handling at the infrastructure level. Your extraction code stays the same.

import requests, os

response = requests.post(
    "https://api.fetch.tinyfish.ai",
    headers={"X-API-Key": os.environ["TINYFISH_API_KEY"]},
    json={
        "urls": ["https://strict-requirements-site.example.com/listings"],
        "format": "markdown"
    }
)

# Check for errors — failed URLs don't consume credits
result_data = response.json()
errors = result_data.get("errors", [])
results = result_data.get("results", [])
if results and not errors:
    content = results[0]["text"]
else:
    print("Fetch incomplete. Errors:", errors)

The Fetch API is free on all TinyFish plans — zero credits consumed per request. You pay only for Web Agent steps (Tier 4). For Tier 3 use cases, this means running extraction at scale without per-request credit costs.

For running extraction across many URLs in parallel, see how to run 1,000 parallel web requests with the TinyFish Fetch API.

Tier 4 — Authenticated or Multi-Step Workflows

When extraction requires navigating through multiple pages, making decisions based on page content, or maintaining session state across requests — this is where a goal-directed web agent is the right abstraction.

Framing note: Tier 4 workflows are appropriate when you're accessing accounts and portals you are authorized to use — your own organization's tools, accounts you operate, or systems where you have explicit access rights.

from tinyfish import TinyFish
import os

client = TinyFish()  # reads TINYFISH_API_KEY from env

# Example: extract a structured report from a multi-step portal
result = client.agent.run(
    url="https://portal.example.com",
    goal="""Navigate to the Analytics section and extract the monthly summary.
    Return JSON with: { month, revenue, units_sold, top_products: string[] }
    Use the authorized credentials already stored in Vault."""
)

# COMPLETED means the session ran — not that the goal succeeded.
# Check both infrastructure-level and goal-level results:
if hasattr(result, 'status') and result.status == "FAILED":
    print("Infrastructure error:", getattr(result, 'error', 'unknown'))
else:
    data = result.result or {}
    if data.get("status") == "failure":
        print("Agent could not complete goal:", data.get("reason"))
    else:
        print("Extracted:", data)

The key distinction from Tier 3: the Web Agent doesn't just fetch a URL — it understands a goal, navigates through steps to reach it, and returns structured output. Session state, conditional navigation ("if the report isn't available yet, check back in 10 minutes"), and multi-page workflows are handled by the agent, not by your code.

Choosing the Right Approach

Does the site have an API or RSS feed?
  → Yes: Use the API (Tier 1). No further tooling needed.
  → No: Is the content JavaScript-rendered?
       → No (static HTML): requests + BeautifulSoup for simple cases.
       → Yes: Is this a local/controlled environment?
              → Yes: Playwright (Tier 2a).
              → No, or you need managed scale: TinyFish Fetch API (Tier 2b or 3).
                   → Does the workflow require multi-step navigation or session state?
                          → Yes: TinyFish Web Agent (Tier 4).
                          → No: TinyFish Fetch API (Tier 3).

Use case	Best tool
Site has a public API	Use the API
Static HTML, simple extraction	requests + BeautifulSoup
JS-rendered, local or test environment	Playwright
JS-rendered, production at scale	TinyFish Fetch
Strict automation requirements at scale	TinyFish Fetch (browser: true)
Multi-step workflow or session state	TinyFish Web Agent

Most production extraction projects combine multiple tiers. An e-commerce monitoring pipeline might use a site's API for product catalog (Tier 1), TinyFish Fetch for pricing pages (Tier 3), and TinyFish Web Agent for authenticated inventory portals (Tier 4).

The right extraction tool is the one that matches the complexity of what you're extracting. For Tier 1 and 2, the answer is almost always free open-source tools. For Tier 3 and 4, managed infrastructure pays for itself in engineering time avoided. Start with the simplest tool that works, and upgrade when the complexity demands it.

FAQ

What is the simplest way to extract structured data from a website?

The simplest approach is checking whether the site has a public API or RSS feed first — if it does, use that directly and skip browser automation entirely. For sites without APIs, Python's requests library works for static HTML. JavaScript-rendered sites require a browser; Playwright for local use, TinyFish Fetch API for managed-scale extraction.

When does Playwright break at scale?

Playwright is reliable for local and small-scale use. In production environments running hundreds or thousands of daily extractions, the infrastructure requirements compound: proxy sourcing and rotation, browser process memory limits, session fingerprint freshness, and container orchestration for concurrency. TinyFish Fetch handles this infrastructure layer, keeping your extraction code simple.

What is the TinyFish Fetch API and how is it different from requests?

TinyFish Fetch API runs a full browser session and returns the page content after JavaScript execution — similar to what Playwright returns, but without the infrastructure overhead. Unlike Python's requests, which only performs an HTTP GET and returns the raw server response, TinyFish Fetch renders the complete DOM including dynamically loaded content. The Fetch API is free on all TinyFish plans — zero credits per successful request.

When do I need the Web Agent instead of the Fetch API?

Use the Web Agent when your extraction involves decisions, navigation across multiple pages, or session state that must persist across steps. If you're fetching a single URL and parsing the response, Fetch is sufficient. If you need to "navigate to the quarterly report, download the CSV, and extract the revenue line" — that's a multi-step goal requiring the Web Agent.

Is extracting publicly visible data legal?

Public-facing data extraction for research, competitive intelligence, and non-commercial purposes is generally low-risk in most jurisdictions. Terms of service are contractual (not legal) restrictions and their enforceability varies by context. Data that requires authentication — content behind login pages — is in a different category: only access portals and accounts you are authorized to use. Consult legal counsel for commercial data products or any extraction at significant scale.

How do I extract structured data from a site that requires login?

Use TinyFish Web Agent with Vault, which stores credentials securely and injects them into the browser session. The agent navigates the authentication flow using credentials for accounts you are authorized to access, then continues to the target content. This is appropriate for your own organizational tools, SaaS dashboards, and portals where you hold the account.

Related Reading:

Want to scrape the web without getting blocked? Try TinyFish — a browser API built for AI agents and developers.

DEV Community