DEV Community

Cover image for From Web Table to Pandas DataFrame in 30 Seconds
circobit
circobit

Posted on

From Web Table to Pandas DataFrame in 30 Seconds

You found the perfect dataset. It's sitting right there on a webpage, neatly formatted in an HTML table. You just need to get it into Pandas.

How hard could it be?

The One-Liner (When It Works)

Pandas has a built-in function for this:

import pandas as pd

tables = pd.read_html('https://example.com/page-with-table')
df = tables[0]  # First table on the page
Enter fullscreen mode Exit fullscreen mode

This is beautiful when it works. Three lines, done.

But here's what the tutorials don't tell you: pd.read_html() fails on a surprising number of real-world websites.

JavaScript-rendered tables? Pandas can't see them. It only reads the raw HTML.

Tables that require authentication? You'll need to handle sessions and cookies first.

Complex nested structures? The parsing might produce garbage.

Anti-scraping measures? You'll get blocked or served different content.

For simple, static HTML tables on public pages, pd.read_html() is great. For everything else, you need alternatives.

The Requests + BeautifulSoup Approach

When pd.read_html() fails, the next step is usually:

import requests
from bs4 import BeautifulSoup
import pandas as pd

response = requests.get('https://example.com/page')
soup = BeautifulSoup(response.content, 'html.parser')
table = soup.find('table', {'class': 'data-table'})

# Extract headers
headers = [th.text.strip() for th in table.find_all('th')]

# Extract rows
rows = []
for tr in table.find_all('tr')[1:]:
    row = [td.text.strip() for td in tr.find_all('td')]
    if row:
        rows.append(row)

df = pd.DataFrame(rows, columns=headers)
Enter fullscreen mode Exit fullscreen mode

This gives you more control. You can target specific tables, handle edge cases, clean data during extraction.

The downsides:

  • More code to write and maintain
  • Still can't handle JavaScript-rendered content
  • Breaks when the website structure changes

The Selenium Nuclear Option

For JavaScript-heavy sites:

from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd

driver = webdriver.Chrome()
driver.get('https://example.com/page')

# Wait for JavaScript to render
import time
time.sleep(3)

# Now parse the rendered HTML
table = driver.find_element(By.CSS_SELECTOR, 'table.data-table')
# ... extraction logic similar to BeautifulSoup

driver.quit()
Enter fullscreen mode Exit fullscreen mode

This works on almost anything. The browser renders the page fully, JavaScript and all, then you extract the data.

The cost:

  • Slow (seconds per page instead of milliseconds)
  • Requires browser driver setup
  • Resource-heavy
  • Feels like overkill for grabbing one table

The Lazy Way (My Favorite)

Here's what I actually do most of the time:

  1. Open the page in my browser
  2. Export the table to CSV with a browser extension
  3. Load it in Pandas:
df = pd.read_csv('exported_table.csv')
Enter fullscreen mode Exit fullscreen mode

Done. No scraping code. No debugging HTML selectors. No handling edge cases in Python.

The browser already rendered the JavaScript. The browser extension handles the HTML parsing and data cleaning. I get a clean CSV file that Pandas reads without issues.

I use HTML Table Exporter for this: it detects tables automatically and exports to CSV, JSON, or Excel with one click. Runs locally, no data uploaded anywhere.

For more details on JSON workflows, see Export Web Tables to JSON for Python & Pandas.

But there are other tools too. The point is: sometimes the fastest path to a DataFrame isn't through Python.

When to Use What

Here's my decision tree:

Use pd.read_html() when:

  • Simple static HTML table
  • Public page, no authentication
  • You need to automate repeated fetches

Use BeautifulSoup when:

  • pd.read_html() fails
  • You need precise control over extraction
  • Table structure is unusual

Use Selenium when:

  • JavaScript renders the table
  • You need to interact with the page first
  • Automation is required

Use browser export when:

  • One-time data grab
  • Complex page that would be painful to scrape
  • You want the data in 30 seconds, not 30 minutes

A Practical Example

Let's say you want World Bank GDP data that's displayed in a table on their site.

The pd.read_html() approach:

import pandas as pd

url = 'https://data.worldbank.org/indicator/NY.GDP.MKTP.CD'
tables = pd.read_html(url)

# Pray that it worked and figure out which table index you need
for i, table in enumerate(tables):
    print(f"Table {i}: {table.shape}")
Enter fullscreen mode Exit fullscreen mode

Sometimes this works. Sometimes the table is loaded via JavaScript and you get nothing useful.

The browser export approach:

  1. Open the page
  2. Click the export button in your extension
  3. Choose CSV
  4. df = pd.read_csv('world_bank_gdp.csv')

Total time: about 30 seconds.

I used to feel like this was "cheating"β€”real data engineers write scrapers, right? But then I realized that getting the data quickly and accurately is the actual goal. The method is just a means to an end.

The Takeaway

pd.read_html() is underrated for simple cases. Browser-based export is underrated for complex cases. And writing a custom scraper should be your last resort, not your first instinct.

Match the tool to the job. Your time is worth more than proving you can parse HTML in Python.


Learn more about HTML Table Exporter or try it free on the Chrome Web Store. What's your go-to method for grabbing web tables? Let me know in the comments.

Top comments (0)