circobit

Posted on Mar 18

From Web Table to Pandas DataFrame in 30 Seconds

#python #pandas #datascience #productivity

You found the perfect dataset on a website. Now you need it in Pandas.

The traditional approach:

import pandas as pd

# Hope the website structure is simple
tables = pd.read_html('https://example.com/data')

# Guess which table you want
df = tables[0]  # Maybe? Let's see...

# Discover the problems
print(df.dtypes)
# Everything is 'object' (string)
# Numbers have commas
# Dates are unparseable
# Column names have spaces

# Spend 30 minutes cleaning...

Let me show you a faster way.

The Problem with `pd.read_html()`

Pandas' read_html() is convenient but limited:

No table selection — It grabs all tables. You guess which index you need.
No cleaning — Numbers like "1,234,567" stay as strings.
CORS issues — Many sites block programmatic access.
JavaScript rendering — Dynamic tables don't exist in the raw HTML.
Authentication — Can't access logged-in content.

For quick scripts, it works. For real analysis, you need something better.

The 30-Second Workflow

Here's what I actually do:

Step 1: Export from Browser (5 seconds)

Using HTML Table Exporter:

Click the table I want
Click the extension icon
Select "For Pandas" profile
Export as CSV from the highlighted table within the extension

The extension sees exactly what your browser sees—JavaScript-rendered content, authenticated pages, everything.

Step 2: Load in Pandas (5 seconds)

import pandas as pd

df = pd.read_csv('export.csv')
print(df.dtypes)

That's it. The data is already clean.

What "Clean" Actually Means

When I export with the "For Pandas" profile, the extension handles:

Number Normalization

Before: "1.234.567,89" (European format)
After:  1234567.89     (float)

Before: "$1,234.56"
After:  1234.56

The CSV contains normalized numbers that Pandas parses correctly:

# Without cleaning:
df['revenue'].sum()  # TypeError: can only concatenate str

# With cleaning:
df['revenue'].sum()  # 4892341.50 ✓

Boolean Conversion

Before: "Yes", "No", "Y", "N", "True", "False"
After:  true, false

# Filter works immediately
active_users = df[df['is_active'] == True]

Null Handling

Before: "-", "N/A", "n/a", "", "null", "—"
After:  (empty, parsed as NaN)

# Null detection works
df['optional_field'].isna().sum()  # Correct count

Snake Case Headers

Before: "Revenue ($M)", "User Count", "Growth Rate %"
After:  revenue_m, user_count, growth_rate

# Clean column access
df['revenue_m']  # Instead of df['Revenue ($M)']

Real Example: FBRef Football Stats

Let's say I want Premier League player statistics from FBRef.

The Old Way

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = 'https://fbref.com/en/comps/9/stats/Premier-League-Stats'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Find the right table (there are many)
table = soup.find('table', {'id': 'stats_standard'})

# Parse it manually because the headers are complex
# FBRef uses grouped headers: "Playing Time" spans multiple columns
# This breaks pd.read_html()

# 45 minutes later...

The New Way

Open FBRef in browser
Click extension → select table → export with "For Pandas" profile
Load:

df = pd.read_csv('fbref_stats.csv')
print(df.columns.tolist())
# ['player', 'nation', 'squad', 'playing_time_mp', 
#  'playing_time_starts', 'performance_gls', ...]

The grouped headers ("Playing Time", "Performance") are merged with sub-headers automatically.

The Code I No Longer Write

Here's cleanup code I used to write for every web scrape:

def clean_web_data(df):
    """The function I no longer need."""

    # Fix number columns
    for col in df.select_dtypes(include='object'):
        # Try to convert to numeric
        try:
            # Remove currency symbols
            cleaned = df[col].str.replace(r'[$€£¥]', '', regex=True)
            # Remove thousands separators
            cleaned = cleaned.str.replace(',', '')
            # Convert
            df[col] = pd.to_numeric(cleaned, errors='ignore')
        except:
            pass

    # Fix boolean columns
    bool_map = {
        'yes': True, 'no': False,
        'true': True, 'false': False,
        'y': True, 'n': False,
        '1': True, '0': False,
    }
    for col in df.columns:
        if df[col].str.lower().isin(bool_map.keys()).all():
            df[col] = df[col].str.lower().map(bool_map)

    # Fix null values
    null_values = ['', '-', 'N/A', 'n/a', 'null', 'NULL', '—', '–']
    df = df.replace(null_values, np.nan)

    # Fix column names
    df.columns = (df.columns
        .str.lower()
        .str.replace(r'[^a-z0-9]+', '_', regex=True)
        .str.strip('_'))

    return df

This function ran on every dataset. Now the export handles it.

When to Use What

Scenario	Best Approach
One-time analysis	Browser export → CSV → Pandas
Repeated scraping	Python script with `requests`
JavaScript-heavy sites	Browser export (sees rendered content)
Authenticated data	Browser export (uses your session)
API available	Use the API directly
Simple static table	`pd.read_html()` is fine

Pro Tip: JSON for Complex Data

For nested or typed data, export as JSON:

import pandas as pd
import json

with open('export.json') as f:
    data = json.load(f)

df = pd.DataFrame(data)

The JSON export preserves types:

Numbers as numbers (not strings)
Booleans as booleans
Nulls as null

print(df.dtypes)
# player         object
# goals          int64   # Already numeric!
# is_starter     bool    # Already boolean!
# injury_date    object  # Can parse as datetime

For more details on JSON workflows, see Export Web Tables to JSON for Python & Pandas.

The Workflow Summary

Old workflow (30+ minutes):

Write scraping script
Handle CORS/auth issues
Parse complex HTML
Clean numbers
Clean booleans
Clean nulls
Fix column names
Debug edge cases
Finally: analyze

New workflow (30 seconds):

Click extension
Export with cleaning profile
pd.read_csv()
Analyze

Try It

Install HTML Table Exporter
Find a table you want to analyze
Export with cleaning presets
Load in Pandas

The free version handles basic exports. PRO adds the cleaning presets and profiles for Pandas-optimized output.

Learn more at gauchogrid.com/html-table-exporter or try it on the Chrome Web Store.

What's your current workflow for getting web data into Pandas? I'm curious how much time you spend on the cleaning step. Drop a comment below.

DEV Community

From Web Table to Pandas DataFrame in 30 Seconds

The Problem with `pd.read_html()`

The 30-Second Workflow

Step 1: Export from Browser (5 seconds)

Step 2: Load in Pandas (5 seconds)

What "Clean" Actually Means

Number Normalization

Boolean Conversion

Null Handling

Snake Case Headers

Real Example: FBRef Football Stats

The Old Way

The New Way

The Code I No Longer Write

When to Use What

Pro Tip: JSON for Complex Data

The Workflow Summary

Try It

Top comments (0)

The Problem with pd.read_html()

The 30-Second Workflow

Step 1: Export from Browser (5 seconds)

Step 2: Load in Pandas (5 seconds)

What "Clean" Actually Means

Number Normalization

Boolean Conversion

Null Handling

Snake Case Headers

Real Example: FBRef Football Stats

The Old Way

The New Way

The Code I No Longer Write

When to Use What

Pro Tip: JSON for Complex Data

The Workflow Summary

Try It

The Problem with `pd.read_html()`