You need data from a website. Do you write a Python scraper? Spin up Playwright? Use a browser extension? After extracting data from hundreds of different sites, I've developed a framework for choosing the right tool.
The Options
| Approach | Runs | Handles JS | Login Support | Setup Time |
|---|---|---|---|---|
| Python + requests | Server | ❌ | Manual cookies | 5 min |
| Python + BeautifulSoup | Server | ❌ | Manual cookies | 5 min |
| Playwright/Puppeteer | Server | ✅ | Scriptable | 15 min |
| Browser Extension | User's browser | ✅ | Automatic | 0 min |
| Copy-paste | User's browser | ✅ | Automatic | 0 min |
Each has tradeoffs. Let's break them down.
Option 1: Python + Requests/BeautifulSoup
Best for: Static HTML pages, APIs, automated pipelines
import requests
from bs4 import BeautifulSoup
response = requests.get("https://example.com/data")
soup = BeautifulSoup(response.text, "html.parser")
table = soup.find("table")
rows = []
for tr in table.find_all("tr"):
row = [td.get_text(strip=True) for td in tr.find_all(["td", "th"])]
rows.append(row)
Advantages:
- Fast execution
- Easy to schedule (cron, Airflow)
- No browser overhead
Limitations:
- Doesn't execute JavaScript
- Many modern sites render tables client-side
- Authentication requires manual cookie handling
When it fails: Try scraping a React-based dashboard. The HTML you get is an empty <div id="root"></div>.
Option 2: Headless Browsers (Playwright/Puppeteer)
Best for: JavaScript-rendered pages, automated pipelines, sites requiring interaction
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch();
const page = await browser.newPage();
await page.goto('https://example.com/dashboard');
// Wait for the table to render
await page.waitForSelector('table');
const data = await page.evaluate(() => {
const table = document.querySelector('table');
return Array.from(table.rows).map(row =>
Array.from(row.cells).map(cell => cell.textContent.trim())
);
});
console.log(data);
await browser.close();
})();
Advantages:
- Executes JavaScript
- Can interact with pages (clicks, scrolls, form fills)
- Scriptable authentication flows
Limitations:
- Slower than direct HTTP
- Resource intensive (memory, CPU)
- More complex error handling
- Sites may detect headless browsers
When it fails: Many sites detect Puppeteer/Playwright via navigator properties, WebGL fingerprinting, or behavioral analysis. Stealth plugins help but aren't foolproof.
Option 3: Browser Extensions
Best for: Ad-hoc extraction, authenticated sites, user-initiated workflows
A browser extension runs in the user's actual browser session, with full access to:
- Authenticated sessions
- JavaScript-rendered content
- The exact DOM the user sees
For a comparison of the best options, see 5 Best Chrome Extensions to Export Tables from Websites.
// Content script in a Chrome extension
chrome.runtime.onMessage.addListener((msg, sender, sendResponse) => {
if (msg.type === "EXTRACT_TABLE") {
const table = document.querySelector("table");
const data = Array.from(table.rows).map(row =>
Array.from(row.cells).map(cell => cell.textContent.trim())
);
sendResponse({ data });
}
});
Advantages:
- Zero authentication handling—you're already logged in
- Sees exactly what you see (no JS rendering issues)
- Works on any site (no bot detection)
- No server infrastructure needed
Limitations:
- Requires manual trigger (user clicks)
- Can't run on a schedule
- Limited to browser context
When it fails: You need to extract data from 10,000 pages automatically. Extensions are for user-initiated workflows, not batch processing.
Decision Framework
Use Python + requests when:
- ✅ Page is server-rendered HTML
- ✅ You need automated/scheduled extraction
- ✅ You're building a data pipeline
- ✅ Authentication is via API keys or simple cookies
Use Playwright/Puppeteer when:
- ✅ Page requires JavaScript rendering
- ✅ You need automated/scheduled extraction
- ✅ You need to interact (scroll, click, paginate)
- ✅ You can handle bot detection countermeasures
Use a Browser Extension when:
- ✅ You're already logged in to the site
- ✅ You need data occasionally (not automated)
- ✅ The site has strong bot detection
- ✅ You want data NOW without writing code
Just copy-paste when:
- ✅ One-time extraction
- ✅ Simple table structure
- ✅ You don't need it in a specific format
Real-World Examples
Example 1: Wikipedia Tables
Best approach: Browser extension or Python
Wikipedia is server-rendered HTML with no authentication. Python works perfectly:
import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_countries_by_population"
tables = pd.read_html(url)
df = tables[0]
But Wikipedia tables often have complex rowspans and navigation rows. A dedicated extension handles these automatically.
For a step-by-step guide, see Export Wikipedia Tables to Excel in 30 Seconds.
Example 2: Financial Dashboard (Behind Login)
Best approach: Browser extension
Your brokerage shows portfolio data after login. Options:
Python: Reverse-engineer the authentication flow, handle 2FA, maintain session cookies. Possible but fragile.
Playwright: Script the login, handle 2FA prompts, navigate to the data. Works but complex.
Extension: Log in normally, click "Export Table." Done.
For occasional exports, the extension wins on time-to-data.
Example 3: Daily Price Monitoring (1000 Pages)
Best approach: Playwright + queue
You need to check prices across 1000 product pages daily. This requires:
// Pseudocode for batch extraction
const urls = loadUrlsFromDatabase();
for (const url of urls) {
const page = await browser.newPage();
await page.goto(url);
await page.waitForSelector('.price');
const price = await page.evaluate(() =>
document.querySelector('.price').textContent
);
await saveToDatabase(url, price);
await page.close();
// Rate limiting
await sleep(randomBetween(1000, 3000));
}
Extensions can't do this—they require user interaction. Playwright is the right tool.
Example 4: One-Time Sports Stats Export
Best approach: Browser extension
FBREF has complex two-level headers. You need this season's stats once.
Python approach: 30 minutes writing a custom parser for their table structure.
Extension approach: Click export. 10 seconds.
For one-time extraction, don't over-engineer it.
Hybrid Approaches
Sometimes you need both:
-
Use extension to export authentication cookies
- Export cookies from a logged-in session
- Import into Python/Playwright for automation
-
Use extension to inspect structure, Python to scale
- Manually examine the DOM with extension tools
- Write a targeted scraper once you understand the structure
-
Use Playwright for navigation, extension for extraction
- Script navigates to the page
- Calls extension API for complex table parsing
My Stack
For the 80% of cases where I need web table data:
- One-time, authenticated: HTML Table Exporter (browser extension I built)
-
One-time, public data:
pd.read_html()or extension - Automated pipeline: Playwright with custom parsers
- API available: Direct API calls (always preferred)
The best tool depends on the specific extraction. Match the complexity of your solution to the complexity of your problem.
What's your go-to extraction approach? Learn more at gauchogrid.com/html-table-exporter or try the extension free on the Chrome Web Store.
Top comments (0)