<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: hottbunny</title>
    <description>The latest articles on DEV Community by hottbunny (@hottbunny).</description>
    <link>https://dev.to/hottbunny</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3942572%2F0f2c1cd1-b577-4e1c-bd4b-e69bcf39de83.jpeg</url>
      <title>DEV Community: hottbunny</title>
      <link>https://dev.to/hottbunny</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/hottbunny"/>
    <language>en</language>
    <item>
      <title>DOM Accessibility Tree Extraction: A Reliable Method for LLMs on Dynamic Web Tables</title>
      <dc:creator>hottbunny</dc:creator>
      <pubDate>Wed, 20 May 2026 15:07:18 +0000</pubDate>
      <link>https://dev.to/hottbunny/dom-accessibility-tree-extraction-a-reliable-method-for-llms-on-dynamic-web-tables-1j5k</link>
      <guid>https://dev.to/hottbunny/dom-accessibility-tree-extraction-a-reliable-method-for-llms-on-dynamic-web-tables-1j5k</guid>
      <description>&lt;p&gt;&lt;strong&gt;Status:&lt;/strong&gt; Current best available technique as of 2026. Treat as standard practice, not a workaround.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Three naive approaches fail on modern sites:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;view-source / static fetch&lt;/strong&gt; — returns server HTML before JavaScript runs. JS-rendered tables show only empty &lt;code&gt;&amp;lt;tbody&amp;gt;&lt;/code&gt; tags.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Screenshot + OCR&lt;/strong&gt; — slow, pixel-dependent, brittle, compounds errors on numeric data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Screenshot + vision model&lt;/strong&gt; — expensive, context-limited, fails on tables larger than one viewport.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Root cause:&lt;/strong&gt; The web has shifted to client-side rendering. Data lives in JavaScript runtime state, not HTML.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Method
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Intuition:&lt;/strong&gt; Programmatic equivalent of: Highlight table → Copy → Paste into Notepad → Import to Excel → Delete irrelevant columns → Sort and count.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Steps:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Load page in headless browser (Playwright recommended) — JavaScript executes, table renders&lt;/li&gt;
&lt;li&gt;Interact with any dropdowns or filters, wait for &lt;code&gt;networkidle&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Call &lt;code&gt;inner_text()&lt;/code&gt; on the table element&lt;/li&gt;
&lt;li&gt;Write extracted text to file (audit trail, enables re-parsing)&lt;/li&gt;
&lt;li&gt;Parse in Python — split on newlines/tabs, cast numerics, filter and count&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Why It Works
&lt;/h2&gt;

&lt;p&gt;The accessibility tree is structured, semantic, not pixel-dependent, already parsed by the browser, and fast. No OCR transcription errors on numbers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pseudocode
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
from playwright.sync_api import sync_playwright
import re

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto(URL, wait_until="networkidle")

    page.select_option("select#view-filter", label="All Cities")
    page.wait_for_load_state("networkidle")

    table_text = page.query_selector("table").inner_text()
    browser.close()

with open("table_output.txt", "w") as f:
    f.write(table_text)

lines = table_text.strip().split("\n")
rows = [line.split("\t") for line in lines[1:] if line.strip()]
temps = [float(re.sub(r"[^\d.\-]", "", r[2])) for r in rows if r[2].strip()]
print(f"Below 32°F: {sum(t &amp;lt; 32 for t in temps)}")
print(f"Above 100°F: {sum(t &amp;gt; 100 for t in temps)}")


Real Example
Source: timeanddate.com, 472-city weather table, “Somewhat Popular” view.
• Execution time: ~8 seconds
• Cities below 32°F: 47
• Cities above 100°F: 12
• OCR errors: 0
Limitations
• Requires real browser runtime (Playwright/Puppeteer)
• Some sites block headless automation
• Canvas-rendered tables require  page.accessibility.snapshot()  fallback
• Infinite scroll requires simulating scroll events
• Always prefer an official API if one exists
Full writeup with detailed tips and examples on GitHub: https://github.com/hottbunny/LLM-AI-Perplexity-Skills-and-Updates/blob/hottbunny-tested-works-htmlsearchtablecrawldataretrivalskill/dom_extraction_method.md



&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>machinelearning</category>
      <category>python</category>
    </item>
  </channel>
</rss>
