DEV Community

Cover image for From Web Table to Pandas DataFrame in 30 Seconds
circobit
circobit

Posted on

From Web Table to Pandas DataFrame in 30 Seconds

You found the perfect dataset on a website. Now you need it in Pandas.

The traditional approach:

import pandas as pd

# Hope the website structure is simple
tables = pd.read_html('https://example.com/data')

# Guess which table you want
df = tables[0]  # Maybe? Let's see...

# Discover the problems
print(df.dtypes)
# Everything is 'object' (string)
# Numbers have commas
# Dates are unparseable
# Column names have spaces

# Spend 30 minutes cleaning...
Enter fullscreen mode Exit fullscreen mode

Let me show you a faster way.

The Problem with pd.read_html()

Pandas' read_html() is convenient but limited:

  1. No table selection — It grabs all tables. You guess which index you need.

  2. No cleaning — Numbers like "1,234,567" stay as strings.

  3. CORS issues — Many sites block programmatic access.

  4. JavaScript rendering — Dynamic tables don't exist in the raw HTML.

  5. Authentication — Can't access logged-in content.

For quick scripts, it works. For real analysis, you need something better.

The 30-Second Workflow

Here's what I actually do:

Step 1: Export from Browser (5 seconds)

Using HTML Table Exporter:

  1. Click the table I want
  2. Click the extension icon
  3. Select "For Pandas" profile
  4. Export as CSV from the highlighted table within the extension

The extension sees exactly what your browser sees—JavaScript-rendered content, authenticated pages, everything.

Step 2: Load in Pandas (5 seconds)

import pandas as pd

df = pd.read_csv('export.csv')
print(df.dtypes)
Enter fullscreen mode Exit fullscreen mode

That's it. The data is already clean.

What "Clean" Actually Means

When I export with the "For Pandas" profile, the extension handles:

Number Normalization

Before: "1.234.567,89" (European format)
After:  1234567.89     (float)

Before: "$1,234.56"
After:  1234.56
Enter fullscreen mode Exit fullscreen mode

The CSV contains normalized numbers that Pandas parses correctly:

# Without cleaning:
df['revenue'].sum()  # TypeError: can only concatenate str

# With cleaning:
df['revenue'].sum()  # 4892341.50 ✓
Enter fullscreen mode Exit fullscreen mode

Boolean Conversion

Before: "Yes", "No", "Y", "N", "True", "False"
After:  true, false
Enter fullscreen mode Exit fullscreen mode
# Filter works immediately
active_users = df[df['is_active'] == True]
Enter fullscreen mode Exit fullscreen mode

Null Handling

Before: "-", "N/A", "n/a", "", "null", "—"
After:  (empty, parsed as NaN)
Enter fullscreen mode Exit fullscreen mode
# Null detection works
df['optional_field'].isna().sum()  # Correct count
Enter fullscreen mode Exit fullscreen mode

Snake Case Headers

Before: "Revenue ($M)", "User Count", "Growth Rate %"
After:  revenue_m, user_count, growth_rate
Enter fullscreen mode Exit fullscreen mode
# Clean column access
df['revenue_m']  # Instead of df['Revenue ($M)']
Enter fullscreen mode Exit fullscreen mode

Real Example: FBRef Football Stats

Let's say I want Premier League player statistics from FBRef.

The Old Way

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = 'https://fbref.com/en/comps/9/stats/Premier-League-Stats'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Find the right table (there are many)
table = soup.find('table', {'id': 'stats_standard'})

# Parse it manually because the headers are complex
# FBRef uses grouped headers: "Playing Time" spans multiple columns
# This breaks pd.read_html()

# 45 minutes later...
Enter fullscreen mode Exit fullscreen mode

The New Way

  1. Open FBRef in browser
  2. Click extension → select table → export with "For Pandas" profile
  3. Load:
df = pd.read_csv('fbref_stats.csv')
print(df.columns.tolist())
# ['player', 'nation', 'squad', 'playing_time_mp', 
#  'playing_time_starts', 'performance_gls', ...]
Enter fullscreen mode Exit fullscreen mode

The grouped headers ("Playing Time", "Performance") are merged with sub-headers automatically.

The Code I No Longer Write

Here's cleanup code I used to write for every web scrape:

def clean_web_data(df):
    """The function I no longer need."""

    # Fix number columns
    for col in df.select_dtypes(include='object'):
        # Try to convert to numeric
        try:
            # Remove currency symbols
            cleaned = df[col].str.replace(r'[$€£¥]', '', regex=True)
            # Remove thousands separators
            cleaned = cleaned.str.replace(',', '')
            # Convert
            df[col] = pd.to_numeric(cleaned, errors='ignore')
        except:
            pass

    # Fix boolean columns
    bool_map = {
        'yes': True, 'no': False,
        'true': True, 'false': False,
        'y': True, 'n': False,
        '1': True, '0': False,
    }
    for col in df.columns:
        if df[col].str.lower().isin(bool_map.keys()).all():
            df[col] = df[col].str.lower().map(bool_map)

    # Fix null values
    null_values = ['', '-', 'N/A', 'n/a', 'null', 'NULL', '', '']
    df = df.replace(null_values, np.nan)

    # Fix column names
    df.columns = (df.columns
        .str.lower()
        .str.replace(r'[^a-z0-9]+', '_', regex=True)
        .str.strip('_'))

    return df
Enter fullscreen mode Exit fullscreen mode

This function ran on every dataset. Now the export handles it.

When to Use What

Scenario Best Approach
One-time analysis Browser export → CSV → Pandas
Repeated scraping Python script with requests
JavaScript-heavy sites Browser export (sees rendered content)
Authenticated data Browser export (uses your session)
API available Use the API directly
Simple static table pd.read_html() is fine

Pro Tip: JSON for Complex Data

For nested or typed data, export as JSON:

import pandas as pd
import json

with open('export.json') as f:
    data = json.load(f)

df = pd.DataFrame(data)
Enter fullscreen mode Exit fullscreen mode

The JSON export preserves types:

  • Numbers as numbers (not strings)
  • Booleans as booleans
  • Nulls as null
print(df.dtypes)
# player         object
# goals          int64   # Already numeric!
# is_starter     bool    # Already boolean!
# injury_date    object  # Can parse as datetime
Enter fullscreen mode Exit fullscreen mode

For more details on JSON workflows, see Export Web Tables to JSON for Python & Pandas.

The Workflow Summary

Old workflow (30+ minutes):

  1. Write scraping script
  2. Handle CORS/auth issues
  3. Parse complex HTML
  4. Clean numbers
  5. Clean booleans
  6. Clean nulls
  7. Fix column names
  8. Debug edge cases
  9. Finally: analyze

New workflow (30 seconds):

  1. Click extension
  2. Export with cleaning profile
  3. pd.read_csv()
  4. Analyze

Try It

  1. Install HTML Table Exporter
  2. Find a table you want to analyze
  3. Export with cleaning presets
  4. Load in Pandas

The free version handles basic exports. PRO adds the cleaning presets and profiles for Pandas-optimized output.

Learn more at gauchogrid.com/html-table-exporter or try it on the Chrome Web Store.


What's your current workflow for getting web data into Pandas? I'm curious how much time you spend on the cleaning step. Drop a comment below.

Top comments (0)