DEV Community

Cover image for Extracting Financial Tables: Stocks, ETFs, and Market Data
circobit
circobit

Posted on

Extracting Financial Tables: Stocks, ETFs, and Market Data

Financial websites are goldmines of tabular data. Stock prices, ETF holdings, market indices, earnings reports—all sitting in HTML tables, waiting to be analyzed.

But getting that data into a usable format isn't always straightforward.

This guide covers the common challenges with financial tables and practical solutions for extracting them cleanly.

Why Financial Tables Are Tricky

Financial data has quirks that break naive extraction methods.

Mixed Number Formats

US sites show 1,234.56. European sites show 1.234,56. Some financial portals mix both depending on the data source.

Copy-paste into Excel often mangles these. A thousand becomes a decimal. A date becomes a number. Chaos.

Real-Time Updates

Many financial tables update dynamically. The DOM changes after page load. If you scrape too early, you get stale data or empty cells.

Nested Structures

Holdings tables often have expandable rows. A fund's top holdings might show 10 rows, with a "Show all 50" button hiding the rest. The hidden data exists in the page but isn't visible.

Percentage and Currency Symbols

+2.34% and $1,234 look fine to humans. To a spreadsheet, they're text strings that won't sort or calculate properly.

Common Financial Data Sources

Yahoo Finance

Stock quotes, historical data, portfolio holdings. Tables are relatively clean but update dynamically.

Tip: Wait for the page to fully load before extracting. The initial render often shows loading placeholders.

Google Finance

Minimal tables, mostly cards and charts. Less useful for bulk extraction.

For a detailed walkthrough, see our guide on exporting Google Finance tables to Excel.

Morningstar

Fund holdings, performance data, sector breakdowns. Tables are well-structured but often paginated.

Tip: Look for "Show All" or pagination controls. The first page might only show top 10 holdings.

SEC EDGAR

Filings contain tables, but they're embedded in complex documents. 10-K and 10-Q reports have financial statements as HTML tables.

Tip: The tables are there, but surrounded by legal text. Identify the specific table you need before extracting.

MarketWatch, Investing.com, Reuters

News-focused but include data tables for quotes, earnings, and economic calendars.

Extraction Strategies

Strategy 1: Direct Export

The simplest approach—if you just need the data once.

Tools like HTML Table Exporter detect tables on the page and export directly to CSV or Excel. No code required.

For financial data specifically:

  • Use cleaning presets that normalize number formats
  • Export to CSV if you'll process further; Excel if it's the final destination
  • Check that dynamic content has loaded before exporting

Strategy 2: Python + Pandas

For repeated extraction or integration into pipelines.

import pandas as pd

# Basic extraction
tables = pd.read_html('https://example.com/stock-data')
df = tables[0]  # First table on page

# With header specification
df = pd.read_html(url, header=0)[0]

# Handling encoding
df = pd.read_html(url, encoding='utf-8')[0]
Enter fullscreen mode Exit fullscreen mode

Limitations: read_html doesn't execute JavaScript. Dynamic tables won't load.

Strategy 3: Selenium for Dynamic Content

When tables load via JavaScript:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

driver = webdriver.Chrome()
driver.get('https://example.com/stock-data')

# Wait for table to load
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.TAG_NAME, "table"))
)

# Extract
html = driver.page_source
tables = pd.read_html(html)
Enter fullscreen mode Exit fullscreen mode

Trade-off: More setup, but handles JavaScript-rendered content.

Cleaning Financial Data

Raw exports usually need cleanup.

Removing Currency Symbols

df['Price'] = df['Price'].replace('[\$,]', '', regex=True).astype(float)
Enter fullscreen mode Exit fullscreen mode

Handling Percentages

df['Change'] = df['Change'].str.rstrip('%').astype(float) / 100
Enter fullscreen mode Exit fullscreen mode

Normalizing Number Formats

For European format (1.234,56 → 1234.56):

df['Value'] = df['Value'].str.replace('.', '', regex=False)
df['Value'] = df['Value'].str.replace(',', '.', regex=False).astype(float)
Enter fullscreen mode Exit fullscreen mode

Parsing Dates

Financial sites use inconsistent date formats.

df['Date'] = pd.to_datetime(df['Date'], format='mixed')
Enter fullscreen mode Exit fullscreen mode

Practical Example: ETF Holdings

Let's say you want a fund's holdings for analysis.

The manual way:

  1. Visit the fund page
  2. Find the holdings table
  3. Copy-paste into Excel
  4. Spend 20 minutes fixing formatting

The efficient way:

  1. Navigate to the holdings table
  2. Export directly to Excel with number normalization
  3. Done

The difference compounds. If you track multiple funds monthly, automation saves hours.

Edge Cases to Watch

Tables Split Across Tabs

Some sites show different data (Holdings, Performance, Risk) in tabs that load different tables. Each tab is a separate extraction.

Footnotes and Annotations

Financial tables love asterisks and daggers. 1,234* might mean "estimated" but breaks numeric parsing.

Solution: Strip non-numeric suffixes before conversion.

Header Rows Repeated

Long tables sometimes repeat headers mid-table for readability. Your extraction might include duplicate header rows as data.

Solution: Filter rows where all values match the header.

When to Automate vs. When to Export Manually

Manual export works when:

  • One-time analysis
  • Irregular schedule
  • Different tables each time

Automation works when:

  • Daily/weekly recurring pulls
  • Same tables, same structure
  • Integration into larger pipelines

For most analysts doing occasional research, manual export with a good tool is faster than building and maintaining scripts.

Summary

Financial tables are valuable but messy. The key challenges are number formats, dynamic loading, and inconsistent structure.

For quick extractions, browser-based tools handle most cases. For pipelines, Python with appropriate waits and cleaning handles the rest.

The goal isn't perfect automation—it's getting clean data with minimal friction.


Need to export financial tables quickly? Learn more at gauchogrid.com/html-table-exporter or try HTML Table Exporter free on the Chrome Web Store.

Top comments (0)