circobit

Posted on May 20

HTML Tables with Hidden Data: Scraping What You Can't See

#webdev #javascript #tutorial

The table shows 10 columns. You export it. The CSV has 10 columns.

But the page has 15 columns of data. Where did the other 5 go?

HTML tables often contain more data than what's visible. Hidden columns, data attributes, collapsed rows—all invisible to basic extraction methods.

Here's how to find and extract the data you can't see.

Types of Hidden Data

1. CSS-Hidden Columns

The simplest case: columns exist in the DOM but are hidden with CSS.

<th style="display: none;">Internal ID</th>
<td style="display: none;">12345</td>

Or via classes:

.hidden-column { display: none; }

Why sites do this: Mobile responsiveness (hide columns on small screens), internal data for JavaScript, progressive disclosure.

How to detect: Open DevTools, inspect the table, look for cells with display: none or visibility: hidden.

2. Data Attributes

HTML5 allows custom data-* attributes on any element. Tables use these to store metadata that JavaScript accesses but users don't see.

<tr data-id="12345" data-category="electronics" data-stock="47">
  <td>Laptop</td>
  <td>$999</td>
</tr>

The visible table shows "Laptop" and "$999". But the row carries three extra data points.

Common data attributes:

data-id — internal identifier
data-sort-value — numeric value for sorting (when display shows "Jan 2024" but sort needs 202401)
data-raw — unformatted value (when display shows "$1.2M" but data is 1200000)
data-href — link URL

3. Title and Tooltip Attributes

The title attribute provides hover text that often contains additional information.

<td title="Updated: 2024-01-15 14:32:00 UTC">Jan 15</td>

The cell displays "Jan 15". The full timestamp is in the tooltip.

4. Collapsed/Expandable Rows

Tables with drill-down functionality hide detail rows until clicked.

<tr class="parent-row">
  <td>Category A</td>
  <td>$50,000</td>
</tr>
<tr class="child-row" style="display: none;">
  <td>— Subcategory A1</td>
  <td>$30,000</td>
</tr>
<tr class="child-row" style="display: none;">
  <td>— Subcategory A2</td>
  <td>$20,000</td>
</tr>

The data exists. It's just not expanded.

5. Lazy-Loaded Content

Some tables show a subset and load more via JavaScript when you scroll or click "Load More."

<table id="results">
  <!-- First 20 rows rendered -->
</table>
<button onclick="loadMore()">Show More Results</button>

The remaining data isn't in the DOM until you trigger loading.

Extracting Hidden Data

Extracting CSS-Hidden Columns

With JavaScript (in browser console):

// Remove display:none from all cells
document.querySelectorAll('td, th').forEach(el => {
  el.style.display = '';
});

Now export normally—the columns are visible.

With Python + BeautifulSoup:

BeautifulSoup ignores CSS. It extracts all elements regardless of visibility.

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table')

for row in table.find_all('tr'):
    cells = [td.get_text(strip=True) for td in row.find_all(['td', 'th'])]
    print(cells)  # Includes hidden columns

Extracting Data Attributes

JavaScript approach:

const rows = document.querySelectorAll('table tr');
const data = [];

rows.forEach(row => {
  const rowData = {
    // Visible text
    cells: Array.from(row.querySelectorAll('td')).map(td => td.textContent),
    // Data attributes
    id: row.dataset.id,
    category: row.dataset.category,
    stock: row.dataset.stock
  };
  data.push(rowData);
});

console.log(JSON.stringify(data, null, 2));

Python approach:

for row in table.find_all('tr'):
    # Get data attributes
    attrs = {k: v for k, v in row.attrs.items() if k.startswith('data-')}

    # Get cell text
    cells = [td.get_text(strip=True) for td in row.find_all('td')]

    print(f"Data: {attrs}, Cells: {cells}")

Extracting Title Attributes

const cells = document.querySelectorAll('table td');
cells.forEach(cell => {
  if (cell.title) {
    console.log(`Display: ${cell.textContent}, Full: ${cell.title}`);
  }
});

Expanding Collapsed Rows

Option 1: Click all expanders

// Find and click all expand buttons
document.querySelectorAll('.expand-btn, .toggle-row').forEach(btn => {
  btn.click();
});

// Wait for DOM to update, then export
setTimeout(() => {
  // Export logic here
}, 1000);

Option 2: Remove hidden class

document.querySelectorAll('.child-row, .detail-row').forEach(row => {
  row.style.display = '';
  row.classList.remove('hidden', 'collapsed');
});

Triggering Lazy Load

This is trickier—you need to simulate the action that triggers loading.

// Scroll to bottom to trigger infinite scroll
window.scrollTo(0, document.body.scrollHeight);

// Or click "Load More" repeatedly
const loadMore = document.querySelector('.load-more-btn');
while (loadMore && !loadMore.disabled) {
  loadMore.click();
  await new Promise(r => setTimeout(r, 500)); // Wait for load
}

Real-World Example: Sort Values

Sports statistics tables often display formatted numbers but sort by raw values.

<td data-sort="1234567">1.23M</td>
<td data-sort="20240115">Jan 15, 2024</td>
<td data-sort="0.347">34.7%</td>

If you only extract the visible text, you get formatted strings that won't sort or calculate correctly.

Extraction with both values:

const cells = document.querySelectorAll('table td');
const data = Array.from(cells).map(cell => ({
  display: cell.textContent.trim(),
  sortValue: cell.dataset.sort || cell.textContent.trim()
}));

When Browser Extensions Help

Good table export tools handle some of this automatically:

CSS-hidden columns can be included via option
Data attributes can be extracted as additional columns
Number normalization can parse "$1.23M" into 1230000

HTML Table Exporter extracts visible content with proper handling of merged cells and formatting. For more complex extraction needs without code, see our guide on scraping tables from websites without code.

For data attributes and deeply hidden content, you may need the JavaScript approaches above.

The Inspection Workflow

Before extracting any table:

Open DevTools (F12)
Inspect the table element
Check for:
- display: none on columns
- data-* attributes on rows/cells
- title attributes with extra info
- Hidden child rows
- "Load more" buttons
Decide: Is the visible data enough, or do you need the hidden data?

Most of the time, visible data is sufficient. But when it's not, knowing how to find and extract hidden data makes the difference.

Summary

HTML tables can contain multiple layers of data:

Layer	How to Access
Visible text	Standard export
CSS-hidden columns	Remove display:none or use BeautifulSoup
Data attributes	JavaScript dataset or Python attrs
Title/tooltips	Extract title attribute
Collapsed rows	Expand or remove hidden class
Lazy-loaded	Trigger loading first

The data is usually there. You just need to know where to look.

Need to export the data you can see? Learn more at gauchogrid.com/html-table-exporter or try HTML Table Exporter free on the Chrome Web Store.

DEV Community