You found the perfect dataset on a website. Now you need it in Pandas.
The traditional approach:
import pandas as pd
# Hope the website structure is simple
tables = pd.read_html('https://example.com/data')
# Guess which table you want
df = tables[0] # Maybe? Let's see...
# Discover the problems
print(df.dtypes)
# Everything is 'object' (string)
# Numbers have commas
# Dates are unparseable
# Column names have spaces
# Spend 30 minutes cleaning...
Let me show you a faster way.
The Problem with pd.read_html()
Pandas' read_html() is convenient but limited:
No table selection — It grabs all tables. You guess which index you need.
No cleaning — Numbers like "1,234,567" stay as strings.
CORS issues — Many sites block programmatic access.
JavaScript rendering — Dynamic tables don't exist in the raw HTML.
Authentication — Can't access logged-in content.
For quick scripts, it works. For real analysis, you need something better.
The 30-Second Workflow
Here's what I actually do:
Step 1: Export from Browser (5 seconds)
Using HTML Table Exporter:
- Click the table I want
- Click the extension icon
- Select "For Pandas" profile
- Export as CSV from the highlighted table within the extension
The extension sees exactly what your browser sees—JavaScript-rendered content, authenticated pages, everything.
Step 2: Load in Pandas (5 seconds)
import pandas as pd
df = pd.read_csv('export.csv')
print(df.dtypes)
That's it. The data is already clean.
What "Clean" Actually Means
When I export with the "For Pandas" profile, the extension handles:
Number Normalization
Before: "1.234.567,89" (European format)
After: 1234567.89 (float)
Before: "$1,234.56"
After: 1234.56
The CSV contains normalized numbers that Pandas parses correctly:
# Without cleaning:
df['revenue'].sum() # TypeError: can only concatenate str
# With cleaning:
df['revenue'].sum() # 4892341.50 ✓
Boolean Conversion
Before: "Yes", "No", "Y", "N", "True", "False"
After: true, false
# Filter works immediately
active_users = df[df['is_active'] == True]
Null Handling
Before: "-", "N/A", "n/a", "", "null", "—"
After: (empty, parsed as NaN)
# Null detection works
df['optional_field'].isna().sum() # Correct count
Snake Case Headers
Before: "Revenue ($M)", "User Count", "Growth Rate %"
After: revenue_m, user_count, growth_rate
# Clean column access
df['revenue_m'] # Instead of df['Revenue ($M)']
Real Example: FBRef Football Stats
Let's say I want Premier League player statistics from FBRef.
The Old Way
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://fbref.com/en/comps/9/stats/Premier-League-Stats'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Find the right table (there are many)
table = soup.find('table', {'id': 'stats_standard'})
# Parse it manually because the headers are complex
# FBRef uses grouped headers: "Playing Time" spans multiple columns
# This breaks pd.read_html()
# 45 minutes later...
The New Way
- Open FBRef in browser
- Click extension → select table → export with "For Pandas" profile
- Load:
df = pd.read_csv('fbref_stats.csv')
print(df.columns.tolist())
# ['player', 'nation', 'squad', 'playing_time_mp',
# 'playing_time_starts', 'performance_gls', ...]
The grouped headers ("Playing Time", "Performance") are merged with sub-headers automatically.
The Code I No Longer Write
Here's cleanup code I used to write for every web scrape:
def clean_web_data(df):
"""The function I no longer need."""
# Fix number columns
for col in df.select_dtypes(include='object'):
# Try to convert to numeric
try:
# Remove currency symbols
cleaned = df[col].str.replace(r'[$€£¥]', '', regex=True)
# Remove thousands separators
cleaned = cleaned.str.replace(',', '')
# Convert
df[col] = pd.to_numeric(cleaned, errors='ignore')
except:
pass
# Fix boolean columns
bool_map = {
'yes': True, 'no': False,
'true': True, 'false': False,
'y': True, 'n': False,
'1': True, '0': False,
}
for col in df.columns:
if df[col].str.lower().isin(bool_map.keys()).all():
df[col] = df[col].str.lower().map(bool_map)
# Fix null values
null_values = ['', '-', 'N/A', 'n/a', 'null', 'NULL', '—', '–']
df = df.replace(null_values, np.nan)
# Fix column names
df.columns = (df.columns
.str.lower()
.str.replace(r'[^a-z0-9]+', '_', regex=True)
.str.strip('_'))
return df
This function ran on every dataset. Now the export handles it.
When to Use What
| Scenario | Best Approach |
|---|---|
| One-time analysis | Browser export → CSV → Pandas |
| Repeated scraping | Python script with requests
|
| JavaScript-heavy sites | Browser export (sees rendered content) |
| Authenticated data | Browser export (uses your session) |
| API available | Use the API directly |
| Simple static table |
pd.read_html() is fine |
Pro Tip: JSON for Complex Data
For nested or typed data, export as JSON:
import pandas as pd
import json
with open('export.json') as f:
data = json.load(f)
df = pd.DataFrame(data)
The JSON export preserves types:
- Numbers as numbers (not strings)
- Booleans as booleans
- Nulls as null
print(df.dtypes)
# player object
# goals int64 # Already numeric!
# is_starter bool # Already boolean!
# injury_date object # Can parse as datetime
For more details on JSON workflows, see Export Web Tables to JSON for Python & Pandas.
The Workflow Summary
Old workflow (30+ minutes):
- Write scraping script
- Handle CORS/auth issues
- Parse complex HTML
- Clean numbers
- Clean booleans
- Clean nulls
- Fix column names
- Debug edge cases
- Finally: analyze
New workflow (30 seconds):
- Click extension
- Export with cleaning profile
-
pd.read_csv() - Analyze
Try It
- Install HTML Table Exporter
- Find a table you want to analyze
- Export with cleaning presets
- Load in Pandas
The free version handles basic exports. PRO adds the cleaning presets and profiles for Pandas-optimized output.
Learn more at gauchogrid.com/html-table-exporter or try it on the Chrome Web Store.
What's your current workflow for getting web data into Pandas? I'm curious how much time you spend on the cleaning step. Drop a comment below.
Top comments (0)