<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Bonzai2Carn</title>
    <description>The latest articles on DEV Community by Bonzai2Carn (@bonzai2carn).</description>
    <link>https://dev.to/bonzai2carn</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3821639%2F5996c0fb-fd15-4c12-9a9b-9216046f0bfb.png</url>
      <title>DEV Community: Bonzai2Carn</title>
      <link>https://dev.to/bonzai2carn</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/bonzai2carn"/>
    <language>en</language>
    <item>
      <title>The Empty Quadrant: Mapping the Design Space of Frontend PDF Extraction</title>
      <dc:creator>Bonzai2Carn</dc:creator>
      <pubDate>Thu, 14 May 2026 13:30:00 +0000</pubDate>
      <link>https://dev.to/bonzai2carn/the-empty-quadrant-mapping-the-design-space-of-frontend-pdf-extraction-166g</link>
      <guid>https://dev.to/bonzai2carn/the-empty-quadrant-mapping-the-design-space-of-frontend-pdf-extraction-166g</guid>
      <description>&lt;p&gt;A user asked me a sharp question yesterday:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Looking at your extraction pipeline, pdfjs + geometryWorker + lattice + visualGridMapper, what makes this any different from any other extraction approach for frontend only, no backend or compiled engine?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It's the right question to ask any author of a tool. So I sat down and surveyed the space honestly. What I found was more interesting than my gut answer.&lt;/p&gt;

&lt;p&gt;The pipeline isn't different because of clever algorithms. The lattice reconstruction is the same lattice reconstruction every server-side tool uses. The KD-tree proximity is a textbook nearest-neighbor query. Y-band paragraph clustering is in a 1996 paper. &lt;strong&gt;The math is borrowed.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What's different is the &lt;em&gt;quadrant of the design space&lt;/em&gt; the pipeline occupies, and the architectural commitments it took to land there.&lt;/p&gt;

&lt;p&gt;This post maps that design space. It catalogs what's already in each cell, identifies the empty one, and explains why it stayed empty long enough for a niche to form.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. The 2×2 grid
&lt;/h2&gt;

&lt;p&gt;Two axes describe almost every PDF extraction project I've encountered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Approach axis&lt;/strong&gt;: deterministic vs. ML-based.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output axis&lt;/strong&gt;: visual fidelity vs. semantic structure.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Plot them and you get four cells.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                 DETERMINISTIC                 ML-BASED
                ───────────────                ────────
  SEMANTIC      pdfplumber, Tabula,            Adobe Extract API,
  STRUCTURE     Camelot, PyMuPDF               Textract, Azure DI,
  (backend)                                     transformers.js + layout
                                                models (frontend)

  VISUAL        pdf2htmlEX                     —
  FIDELITY      (frontend WASM)
  (frontend)

  TEXT-ONLY     pdfreader, pdf-extract,        tesseract.js
  STREAM        the naive getTextContent()     (OCR over rendered canvas)
                recipe
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three observations fall out of this map immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Backend dominates the deterministic-structural cell.&lt;/strong&gt; Everything serious about extracting structure from PDFs without ML lives on a server. pdfplumber, Tabula, Camelot, PyMuPDF — all Python, all backend, all decades of accumulated implementation knowledge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Frontend is well-represented but compromised.&lt;/strong&gt; Each frontend project gives up something significant. &lt;code&gt;pdf2htmlEX&lt;/code&gt; reproduces visual appearance perfectly but ships zero semantic structure. &lt;code&gt;tesseract.js&lt;/code&gt; works on scanned PDFs but throws away the native text layer that digital PDFs hand you for free. The transformers.js + layout-model approach handles weird documents but ships multi-megabyte model weights and opaque failure modes. The naive &lt;code&gt;getTextContent()&lt;/code&gt; recipe and its Y-clustering descendants give you a flat blob and don't read the operator list at all.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;There's a frontend cell that's empty.&lt;/strong&gt; Deterministic. Structural. No ML weights. No raster step. No backend.&lt;/p&gt;

&lt;p&gt;That empty cell is where this pipeline sits.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. The naive 95 percent
&lt;/h2&gt;

&lt;p&gt;Before we look at what fills the four occupied cells, it's worth establishing the baseline. Roughly 95 percent of frontend PDF extraction code in the wild does this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;pdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;pdfjsLib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getDocument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;bytes&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;promise&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="nx"&gt;pdf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;numPages&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;pdf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getPage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getTextContent&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="nx"&gt;text&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;it&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;it&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;str&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt; &lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works on a memo. It collapses on a two-column research paper. It liquefies on a table. It can't tell a heading from a paragraph. It has no concept of reading order on a complex page.&lt;/p&gt;

&lt;p&gt;Everything beyond this baseline is a project trying harder. There are four serious such projects. None of them sits in the deterministic-structural-frontend cell.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. The four occupied frontend cells
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Cell A: &lt;code&gt;pdf2htmlEX&lt;/code&gt; — visual fidelity, no semantics
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;pdf2htmlEX&lt;/code&gt; is a WASM port of an old C++ project. It walks the PDF and emits absolutely-positioned &lt;code&gt;&amp;lt;div&amp;gt;&lt;/code&gt;s that visually reproduce the source.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;style=&lt;/span&gt;&lt;span class="s"&gt;"position:absolute; top:124px; left:88px; font-size:11pt"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;A table cell&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;style=&lt;/span&gt;&lt;span class="s"&gt;"position:absolute; top:124px; left:240px; font-size:11pt"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;Another&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;style=&lt;/span&gt;&lt;span class="s"&gt;"position:absolute; top:124px; left:392px; font-size:11pt"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;Cell&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you want to render the PDF in a browser and let the user select text, this is unbeatable. If you want any semantic structure (a &lt;code&gt;&amp;lt;table&amp;gt;&lt;/code&gt;, an &lt;code&gt;&amp;lt;h1&amp;gt;&lt;/code&gt;, a paragraph block), you're back to scraping divs by their bounding boxes — the same problem the user started with.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cell B: &lt;code&gt;tesseract.js&lt;/code&gt; — OCR
&lt;/h3&gt;

&lt;p&gt;Render each page to canvas. Run OCR on the canvas. Get text + bounding boxes back.&lt;/p&gt;

&lt;p&gt;This is the right answer for &lt;strong&gt;scanned&lt;/strong&gt; PDFs that have no native text layer. It's the wrong answer for digital PDFs that already have perfect text. You're feeding selectable text through an image-to-text model and getting a degraded copy of what was already there. Plus a 2MB WASM payload, plus seconds-per-page latency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cell C: &lt;code&gt;transformers.js&lt;/code&gt; + layout models — ML-based structural
&lt;/h3&gt;

&lt;p&gt;Load a layout-aware model (DocLayout-YOLO, LayoutLM, or similar) into the browser via ONNX or transformers.js. Render each page to canvas. Run inference. Get back labeled regions: &lt;code&gt;TABLE&lt;/code&gt;, &lt;code&gt;TEXT&lt;/code&gt;, &lt;code&gt;FIGURE&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is where the modern industry is heading. It works on weird, varied document types. It generalizes. But:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The model weights are megabytes (DocLayout-YOLO Nano alone is ~6MB ONNX).&lt;/li&gt;
&lt;li&gt;First inference takes seconds.&lt;/li&gt;
&lt;li&gt;Failure modes are opaque — when the model misclassifies, you have no levers.&lt;/li&gt;
&lt;li&gt;You're shipping an ML inference engine to do something that, for digital PDFs, can be done with pure geometry.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cell D: &lt;code&gt;pdfreader&lt;/code&gt;, &lt;code&gt;pdf-extract&lt;/code&gt;, and friends — text-only Y-clustering
&lt;/h3&gt;

&lt;p&gt;These libraries take &lt;code&gt;getTextContent()&lt;/code&gt; items, cluster them by Y position, sort by X, and produce slightly more structured output than the flat-blob recipe.&lt;/p&gt;

&lt;p&gt;The fundamental limit: they only consume the text content. They never call &lt;code&gt;getOperatorList()&lt;/code&gt;. They cannot see vector lines. They cannot detect a table border, distinguish an underline from a horizontal rule, or recognize a chart axis. Their world is text and only text.&lt;/p&gt;

&lt;p&gt;For prose-heavy documents, that's fine. For anything with tables, they degrade to row-smashing.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. The empty cell, and what fills it
&lt;/h2&gt;

&lt;p&gt;The deterministic-structural-frontend cell asks for a tool that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Runs entirely in the browser. No server.&lt;/li&gt;
&lt;li&gt;Ships no ML model weights. Determinism via geometry.&lt;/li&gt;
&lt;li&gt;Reads the operator list, not just the text content. Vector-aware.&lt;/li&gt;
&lt;li&gt;Outputs semantic structure: tables with topology, headings, paragraphs, lists, reading order.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To fill it, this pipeline does the following:&lt;/p&gt;

&lt;h3&gt;
  
  
  4.1 CTM-baked vector segments
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ctmAdapter.js (simplified)&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;fnArray&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fnArray&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;OPS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;save&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;    &lt;span class="nx"&gt;ctmStack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fnArray&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;OPS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;restore&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;ctm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;ctmStack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fnArray&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;OPS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;ctm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mulMatrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ctm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;argsArray&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fnArray&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;OPS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;constructPath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Walk subpaths, transform each point through CTM × viewport.transform,&lt;/span&gt;
    &lt;span class="c1"&gt;// emit normalized H/V segment records.&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the move that puts the pipeline in a different category from cells B and D. We don't just consume text. We consume the operator list and reconstruct the page's vector skeleton in viewport coordinates. We can &lt;em&gt;see&lt;/em&gt; the table borders before any text math runs.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.2 Region-typed classification &lt;em&gt;before&lt;/em&gt; extraction
&lt;/h3&gt;

&lt;p&gt;Most pipelines run sequential passes: find tables, find paragraphs, find lists. Each pass works against the full text pool. Then you deduplicate at the end and hope the passes didn't disagree.&lt;/p&gt;

&lt;p&gt;This pipeline does the opposite. Classify regions first, then route scoped text into each region's specialist extractor. The mechanism is a single &lt;code&gt;assignedTextIndices&lt;/code&gt; set:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;lattice&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;lattices&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tableTextIndices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tm&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;textMeta&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;assignedTextIndices&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;has&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// skip consumed&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;insideBBox&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;lattice&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;bbox&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;tablePad&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;tableTextIndices&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;assignedTextIndices&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// mark as consumed&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="nx"&gt;regions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;TABLE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;lattice&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;textItemIndices&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;tableTextIndices&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;// later: paragraph/heading/list passes only see un-consumed text&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The invariant is: &lt;strong&gt;a text item belongs to exactly one region.&lt;/strong&gt; No leakage by construction. The bug class of "table text accidentally in a paragraph" is preempted, not patched.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.3 Underline-vs-border discrimination
&lt;/h3&gt;

&lt;p&gt;A naive lattice reconstructor sees every horizontal line and tries to use it as a table border. This produces phantom 1×1 tables under every underlined heading.&lt;/p&gt;

&lt;p&gt;We classify each H-segment against the text baselines using KD-tree-style proximity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;h&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;hSegs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;hY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;y1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;y2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tm&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;textMeta&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;yDist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;hY&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vy&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;yDist&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;yDist&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;
        &lt;span class="nx"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vx&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="nx"&gt;hXMax&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vx&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vWidth&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nx"&gt;hXMin&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;
        &lt;span class="nx"&gt;hLen&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vWidth&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;2.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;underlineSegIds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a horizontal line sits 0–5px below a text baseline with overlapping X-span, it's an underline. Tag it. Remove from the table-detection pool. ~99% of phantom tables disappear.&lt;/p&gt;

&lt;p&gt;I have not seen another browser-side PDF extractor that does this. Tabula has equivalents on the backend. On the frontend, every other tool I've audited just hands all H-lines to the lattice and lives with phantom tables.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.4 Topological cell-merge inference
&lt;/h3&gt;

&lt;p&gt;Naive table extractors detect cell merges by visual whitespace heuristics ("if these two cells have no visible boundary between their text, they're merged"). This is unreliable. Tables with thin internal borders look unmerged but are; tables with wide cell padding look merged but aren't.&lt;/p&gt;

&lt;p&gt;This pipeline asks the geometry directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;vLinePresent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;vLines&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;yA&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;yB&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;eps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;vLines&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;l&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;
    &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;x&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="nx"&gt;eps&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;
    &lt;span class="nx"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;yMin&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="nx"&gt;yA&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;eps&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;
    &lt;span class="nx"&gt;l&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;yMax&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nx"&gt;yB&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;eps&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Is there an actual merged vertical-line record at this X position spanning [yA, yB]? If yes, the cell boundary exists; the cells are separate. If no, extend the colspan. Topological, not visual.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.5 Nearest-cell Euclidean snap
&lt;/h3&gt;

&lt;p&gt;Strict point-in-box assignment drops text whose origin is 0.1px outside a cell, which is common because PDF rendering coordinates have jitter. We use Euclidean distance to the nearest cell center with a 15px snap threshold:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;bestR&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;bestC&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;minDist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;Infinity&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;ri&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;ri&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;numRows&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;ri&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;ci&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;ci&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;numCols&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;ci&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cols&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;ci&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;sx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;sx&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;cols&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;ci&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;ri&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;sy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;sy&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;ri&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;dist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sqrt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;dx&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="nx"&gt;dx&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;dy&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="nx"&gt;dy&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;dist&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;minDist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;minDist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;dist&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;bestR&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;ri&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;bestC&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;ci&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;minDist&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="nx"&gt;cells&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;bestR&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="nx"&gt;bestC&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(...);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Magnetic, not literal. Coordinate jitter doesn't drop data.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.6 Worker-isolated full pipeline
&lt;/h3&gt;

&lt;p&gt;Most browser PDF extractors run on the main thread. The geometry pipeline here loads PDF.js as a &lt;em&gt;nested worker&lt;/em&gt; inside the geometry worker. CTM baking, lattice reconstruction, classification, assembly — all off the main thread. The UI stays responsive on a 200-page document.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.7 Per-page streaming
&lt;/h3&gt;

&lt;p&gt;Naive extractors accumulate the whole document into one structured-clone payload at the end. That dies on large PDFs with stack-overflow errors in &lt;code&gt;postMessage&lt;/code&gt;. We emit per-page &lt;code&gt;'page'&lt;/code&gt; messages from the worker, the main thread accumulates incrementally, and the UI can show progressive results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nb"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;postMessage&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;page&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;page&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;p&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;html&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;html&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
  &lt;span class="na"&gt;tables&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tableCount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not algorithmic novelty. Engineering discipline that lets the architecture survive 76-page technical manuals.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.8 VisualGridMapper as a downstream operator
&lt;/h3&gt;

&lt;p&gt;The output isn't a dead &lt;code&gt;&amp;lt;table&amp;gt;&lt;/code&gt; string. It's a live HTML table that we can immediately remap into a Cartesian array using &lt;code&gt;VisualGridMapper&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;mapper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;VisualGridMapper&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;table&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// mapper.grid[row][col] now holds origin/spanned cell metadata.&lt;/span&gt;
&lt;span class="c1"&gt;// Transposes, merges, splits all become matrix operations.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the bridge into the table-formatter half of the platform. Other extractors stop at "here's a &lt;code&gt;&amp;lt;table&amp;gt;&lt;/code&gt;." We hand the user something they can keep manipulating mathematically.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. What's borrowed and what's new
&lt;/h2&gt;

&lt;p&gt;Worth being honest about which pieces of this are original engineering versus academic standard:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Borrowed:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The lattice algorithm itself — intersection clustering, row/column projection. Same as Tabula, Camelot, pdfplumber.&lt;/li&gt;
&lt;li&gt;Y-band paragraph clustering — pdfminer-style, in academic literature since the 90s.&lt;/li&gt;
&lt;li&gt;XY-cut column detection — known since the 80s.&lt;/li&gt;
&lt;li&gt;KD-tree spatial indexing — textbook.&lt;/li&gt;
&lt;li&gt;DOMPurify, jQuery, Monaco — off-the-shelf.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Original to this pipeline (or unusual in the niche):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The full assembly running in a Web Worker on top of PDF.js as a nested worker.&lt;/li&gt;
&lt;li&gt;The non-overlapping-region invariant via &lt;code&gt;assignedTextIndices&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;The underline-discrimination heuristic with the specific 0–5px / 2.5×-width thresholds.&lt;/li&gt;
&lt;li&gt;The coordinate-space discipline: storing both &lt;code&gt;vWidth/vFont&lt;/code&gt; (viewport) and &lt;code&gt;width/fontSize&lt;/code&gt; (PDF points) on every text-meta record, with explicit comments about which to use where.&lt;/li&gt;
&lt;li&gt;The per-page streaming pattern that survives 100+ page documents.&lt;/li&gt;
&lt;li&gt;The integration with &lt;code&gt;VisualGridMapper&lt;/code&gt; for downstream mathematical manipulation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pipeline is a composition. The composition is the contribution.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Why this cell stayed empty
&lt;/h2&gt;

&lt;p&gt;If the deterministic-structural-frontend cell is valuable, why hadn't anyone filled it?&lt;/p&gt;

&lt;p&gt;Three reasons, in order of how convincing each one is.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Economics push toward backend.&lt;/strong&gt; If you have a use case that needs structural PDF extraction, you almost certainly have a server. The serious tools live in Python and have for a decade. There's no incentive to port them unless you specifically need data to stay on the client device — which is a real but niche requirement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Existing frontend tools are anchored to other quadrants.&lt;/strong&gt; &lt;code&gt;pdf2htmlEX&lt;/code&gt; is committed to visual fidelity. &lt;code&gt;tesseract.js&lt;/code&gt; is committed to OCR. The transformers.js camp is committed to ML generalization. Each is well-architected for its quadrant and would require an architectural rewrite to drift into the deterministic-structural cell. Nobody had a reason to do that work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The pieces are scattered.&lt;/strong&gt; PDF.js gives you the operator list but assumes you'll use it for rendering. Lattice algorithms are described in papers, not packaged as npm modules. KD-tree libraries assume preformatted data. Web Worker isolation has its own ergonomic learning curve. Climbing the staircase to assemble all of these is real engineering work, and unless you have a strong reason to be in this exact cell, the cost-benefit doesn't pencil.&lt;/p&gt;

&lt;p&gt;We had a reason. The platform we're building is browser-native by &lt;em&gt;commitment&lt;/em&gt;, not accident. Every other tool in our pipeline runs in the browser. Sending PDFs to a server for structural extraction would have broken the architectural model. So we climbed.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. The lesson above the niche
&lt;/h2&gt;

&lt;p&gt;There's a generalization worth saying out loud, because it applies far beyond PDF tooling.&lt;/p&gt;

&lt;p&gt;When you wonder whether you're reinventing a wheel, do the survey. But ask the right question. The question is not &lt;em&gt;"has anyone solved this problem?"&lt;/em&gt; — the answer to that is almost always yes, somewhere. The question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What set of constraints does my version satisfy that nobody else's version satisfies?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Constraints are commitments. &lt;em&gt;No backend. No model weights. Worker isolation. Deterministic output. Per-page streaming. Open source.&lt;/em&gt; Each one is a deliberate refusal of a path other people took.&lt;/p&gt;

&lt;p&gt;The intersection of constraints is where new niches live. The math you use &lt;em&gt;inside&lt;/em&gt; that intersection is often the same math everyone else uses. That's fine. The originality isn't in the math. It's in the negative space; the things you said no to.&lt;/p&gt;

&lt;p&gt;The pipeline isn't different because the algorithms are different. It's different because of where it runs and what it refuses to be.&lt;/p&gt;




&lt;p&gt;The full pipeline is open source as part of the &lt;a href="https://github.com/carnworkstudios" rel="noopener noreferrer"&gt;GINEXYS&lt;/a&gt; project. If you find a fifth camp I missed, or if you've built something that fills the empty quadrant differently, the issue tracker is open. I'm specifically curious whether anyone else has implemented in-browser CTM baking against pdfjs-dist's operator list — that piece felt the loneliest in my survey.&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>pdf</category>
      <category>architecture</category>
      <category>opensource</category>
    </item>
    <item>
      <title>How to Stop PDF Parsers from Hallucinating Tables out of Thin Air</title>
      <dc:creator>Bonzai2Carn</dc:creator>
      <pubDate>Tue, 12 May 2026 15:54:05 +0000</pubDate>
      <link>https://dev.to/bonzai2carn/how-to-stop-pdf-parsers-from-hallucinating-tables-out-of-thin-air-n0</link>
      <guid>https://dev.to/bonzai2carn/how-to-stop-pdf-parsers-from-hallucinating-tables-out-of-thin-air-n0</guid>
      <description>&lt;p&gt;PDF extraction is usually blind. &lt;/p&gt;

&lt;p&gt;If you've ever tried to write a script to scrape a PDF, you know exactly what I mean. You run the PDF through a generic text extractor, and instead of a clean table, you get a jammed wall of text where the columns are violently shoved into a single vertical stack. &lt;/p&gt;

&lt;p&gt;Or worse, you try to use a table extractor, and it hallucinates tables everywhere. See a bold heading with an underline? The parser thinks that's a 1x1 table. See a horizontal divider between paragraphs? Boom, phantom table. &lt;/p&gt;

&lt;p&gt;Why does this happen? Because most PDF parsers process the document in a strict, sequential pipeline. They look at all the lines. They look at all the text. And they just smash them together.&lt;/p&gt;

&lt;p&gt;I got tired of this. So I re-engineered the extraction pipeline in our PDF processor to stop reading the document like a machine, and start &lt;em&gt;seeing&lt;/em&gt; it like a human.&lt;/p&gt;

&lt;p&gt;Here is the math behind Context-Aware PDF Extraction.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. The Blind Extraction Problem
&lt;/h2&gt;

&lt;p&gt;Previously, our extraction pipeline worked like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Find all horizontal and vertical line segments (&lt;code&gt;H-segs&lt;/code&gt; and &lt;code&gt;V-segs&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Run them through a &lt;code&gt;LatticeReconstructor&lt;/code&gt; to find intersecting grids.&lt;/li&gt;
&lt;li&gt;Treat every grid as a table.&lt;/li&gt;
&lt;li&gt;Dump all the text in the document into those grids using a strict "is this point inside this box" check.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This was a disaster for documents that mixed paragraphs with tables. &lt;/p&gt;

&lt;p&gt;If a paragraph had a decorative underline, the &lt;code&gt;LatticeReconstructor&lt;/code&gt; would see the H-line, panic, and try to build a table out of it. &lt;br&gt;
If text was slightly offset inside a table cell due to coordinate jitter, the "point-in-box" check would fail, and the text would just vanish from the output.&lt;/p&gt;

&lt;p&gt;I needed the parser to understand &lt;em&gt;context&lt;/em&gt;. &lt;/p&gt;


&lt;h2&gt;
  
  
  2. Enter the Context Classifier
&lt;/h2&gt;

&lt;p&gt;To fix this, I built the &lt;code&gt;contextClassifier&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;Instead of treating the PDF as a bucket of shapes and text, the &lt;code&gt;contextClassifier&lt;/code&gt; walks the document and groups every single item into spatially bounded, typed regions: &lt;code&gt;TABLE&lt;/code&gt;, &lt;code&gt;PARAGRAPH&lt;/code&gt;, &lt;code&gt;HEADING&lt;/code&gt;, &lt;code&gt;LIST&lt;/code&gt;, and &lt;code&gt;IMAGE&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;But how do you tell a machine the difference between a table border and a decorative underline? &lt;/p&gt;

&lt;p&gt;You use proximity math.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// KD-tree style proximity: check if text sits exactly on top of an H-line&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;h&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;hSegs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;hY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;y1&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;y2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tm&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;textMeta&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;yDist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;hY&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vy&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; 

        &lt;span class="c1"&gt;// Underline: line is 0–5px below the text baseline&lt;/span&gt;
        &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;yDist&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;yDist&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nf"&gt;overlappingXSpan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;h&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nx"&gt;underlineSegIds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; 
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a horizontal line is exactly 0 to 5 pixels below a text baseline, and its width roughly matches the text width, it's not a table border. It's an underline. &lt;/p&gt;

&lt;p&gt;By tagging and removing these underlines &lt;em&gt;before&lt;/em&gt; we run the table reconstruction, we eliminate 99% of phantom tables. &lt;/p&gt;




&lt;h2&gt;
  
  
  3. Scoping the Text (No More Collisions)
&lt;/h2&gt;

&lt;p&gt;Once the tables are detected, we calculate the exact bounding box of the table grid. &lt;/p&gt;

&lt;p&gt;Instead of throwing all the document's text at the table builder, the classifier scoops up &lt;em&gt;only&lt;/em&gt; the text items that physically live inside that bounding box.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tableTextIndices&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tm&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;textMeta&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;insideBBox&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;bbox&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;tableTextIndices&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="nx"&gt;assignedTextIndices&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// Mark as consumed!&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This does two things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It guarantees that table text doesn't accidentally leak into paragraphs.&lt;/li&gt;
&lt;li&gt;It guarantees that paragraph text doesn't get sucked into table cells.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once a text item is claimed by a region, it's marked as consumed. &lt;/p&gt;




&lt;h2&gt;
  
  
  4. The Nearest-Cell Proximity Assignment
&lt;/h2&gt;

&lt;p&gt;Even with scoped text, getting the text into the correct table cell was still failing due to PDF rendering quirks. A cell might be at &lt;code&gt;x: 10.5&lt;/code&gt;, but the text was at &lt;code&gt;x: 10.4&lt;/code&gt;. A strict bounding box check would drop the text.&lt;/p&gt;

&lt;p&gt;I ripped out the strict containment checks and replaced them with a nearest-neighbor proximity model. &lt;/p&gt;

&lt;p&gt;For every piece of text, we find its nearest cell center using Euclidean distance. If it's within a 15px threshold, it snaps into place. No more jitter. No more dropped data.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. The Page Assembler
&lt;/h2&gt;

&lt;p&gt;Finally, the &lt;code&gt;pageAssembler&lt;/code&gt; takes over. &lt;/p&gt;

&lt;p&gt;It receives an array of perfectly classified, non-overlapping regions. It sorts them top-to-bottom based on their Y-coordinates. &lt;/p&gt;

&lt;p&gt;Then, it just iterates through them and calls the right extractor:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If it's a &lt;code&gt;TABLE&lt;/code&gt;, it sends the scoped text to the &lt;code&gt;tableBuilder&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If it's a &lt;code&gt;HEADING&lt;/code&gt;, it wraps it in an &lt;code&gt;&amp;lt;h3&amp;gt;&lt;/code&gt; or &lt;code&gt;&amp;lt;h4&amp;gt;&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If it's a &lt;code&gt;LIST&lt;/code&gt;, it strips the bullet points and outputs clean &lt;code&gt;&amp;lt;ul&amp;gt;&amp;lt;li&amp;gt;&lt;/code&gt; tags.&lt;/li&gt;
&lt;li&gt;If it's a &lt;code&gt;PARAGRAPH&lt;/code&gt;, it sends it to the &lt;code&gt;textRebuilder&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result? True document reading order. &lt;/p&gt;

&lt;p&gt;You upload a messy, complex PDF filled with tables, paragraphs, and lists. The pipeline classifies it, scopes the data, and spits out clean, semantically correct HTML. &lt;/p&gt;

&lt;p&gt;No backend processing. No AI hallucination. Just pure, deterministic math running directly in your browser using &lt;code&gt;pdfjs-dist&lt;/code&gt; and vanilla JS. &lt;/p&gt;

&lt;p&gt;The PDF is finally readable.&lt;/p&gt;

&lt;p&gt;You can find the repo at &lt;a href="https://github.com/carnworkstudios/doc-extractor" rel="noopener noreferrer"&gt;doc-extractor&lt;/a&gt; or give it a try at &lt;a href="https://ginexys.com/tools/pdf-processor" rel="noopener noreferrer"&gt;Ginexys&lt;/a&gt;&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>pdf</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Cleaning Broken HTML Tables from PDFs, Scrapes, and Legacy Exports in Vanilla JS</title>
      <dc:creator>Bonzai2Carn</dc:creator>
      <pubDate>Fri, 10 Apr 2026 05:30:18 +0000</pubDate>
      <link>https://dev.to/bonzai2carn/cleaning-broken-html-tables-from-pdfs-scrapes-and-legacy-exports-in-vanilla-js-1pfp</link>
      <guid>https://dev.to/bonzai2carn/cleaning-broken-html-tables-from-pdfs-scrapes-and-legacy-exports-in-vanilla-js-1pfp</guid>
      <description>&lt;p&gt;HTML tables are liars. &lt;/p&gt;

&lt;p&gt;If you haven't worked deeply with HTML tables, you might think a table is just a simple 2D array: &lt;code&gt;table[row][col]&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;The moment an HTML table introduces a &lt;code&gt;colspan&lt;/code&gt; or a &lt;code&gt;rowspan&lt;/code&gt;, the visual &lt;code&gt;(x, y)&lt;/code&gt; coordinate of a cell completely detaches from its DOM hierarchy. If row 1 has a cell with &lt;code&gt;colspan="3"&lt;/code&gt;, then the second &lt;code&gt;&amp;lt;td&amp;gt;&lt;/code&gt; in that row is visually in column 4, but programmatically it is &lt;code&gt;childNodes[1]&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;If you try to write a "select column" function by just iterating through &lt;code&gt;tr &amp;gt; td:nth-child(n)&lt;/code&gt;, your highlighting will look like abstract art the second it hits a merged cell.&lt;/p&gt;

&lt;p&gt;I learned that the hard way.&lt;/p&gt;

&lt;p&gt;If you work with scraped tables, PDF exports, legacy system data, or just need to clean up HTML tables before dropping them into a docs platform, this is for you.&lt;/p&gt;

&lt;p&gt;What started as a small utility for cleaning up scraped tables eventually became &lt;strong&gt;&lt;a href="https://ginexys.com/tools/table-formatter/" rel="noopener noreferrer"&gt;TAFNE - Table Formatter and Node Editor&lt;/a&gt;&lt;/strong&gt;, a browser-based Table IDE for reshaping broken tabular data and exporting it into useful formats. The hardest part wasn’t rendering the table. It was teaching the browser how to understand the table the way a human does.&lt;/p&gt;

&lt;h3&gt;
  
  
  What Didn't Work
&lt;/h3&gt;

&lt;p&gt;My first attempt was just checking &lt;code&gt;.prev()&lt;/code&gt; and &lt;code&gt;.next()&lt;/code&gt; and trying to keep a running tally of offset index values. &lt;/p&gt;




&lt;h2&gt;
  
  
  1. The Problem Space
&lt;/h2&gt;

&lt;p&gt;Try to write a function that highlights an entire column when you hover over a table header. If the table represents a perfectly flat 2D array, it’s trivial: loop through every &lt;code&gt;&amp;lt;tr&amp;gt;&lt;/code&gt; and add a CSS class to &lt;code&gt;childNodes[colIndex]&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;But what if you are given this table?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;table&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;"messy-table"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;tr&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;td&lt;/span&gt; &lt;span class="na"&gt;rowspan=&lt;/span&gt;&lt;span class="s"&gt;"2"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;A&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;td&lt;/span&gt; &lt;span class="na"&gt;colspan=&lt;/span&gt;&lt;span class="s"&gt;"2"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;B&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/tr&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;tr&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;td&amp;gt;&lt;/span&gt;C&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;td&amp;gt;&lt;/span&gt;D&lt;span class="nt"&gt;&amp;lt;/td&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/tr&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/table&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Visually, this is a 2x3 grid. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Row 1, Col 1 is &lt;code&gt;A&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Row 2, Col 1 is &lt;em&gt;also&lt;/em&gt; &lt;code&gt;A&lt;/code&gt; (because of &lt;code&gt;rowspan&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Row 2, Col 2 is &lt;code&gt;C&lt;/code&gt; &lt;/li&gt;
&lt;li&gt;Row 2, Col 3 is &lt;code&gt;D&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But programmatically? &lt;code&gt;C&lt;/code&gt; is &lt;code&gt;tr[1].childNodes[0]&lt;/code&gt;. It thinks it's in the first column, but visually it sits in the second. &lt;/p&gt;

&lt;p&gt;My initial approach of checking &lt;code&gt;.prev()&lt;/code&gt; and &lt;code&gt;.next()&lt;/code&gt; and keeping a running tally of offset index values was naive. This completely breaks when a cell has both &lt;code&gt;colspan&lt;/code&gt; and &lt;code&gt;rowspan&lt;/code&gt; acting simultaneously, or when consecutive cells in a row have varying spans. The edge cases are endless. &lt;/p&gt;

&lt;p&gt;I needed a topographic map of the DOM, not just a DOM tree.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. The Solution: The &lt;code&gt;VisualGridMapper&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;To perform complex UI actions like drag-and-drop or matrix transposition on a table, you need to translate the DOM into a strict, predictable Cartesian plane. &lt;/p&gt;

&lt;p&gt;I built a class called the &lt;code&gt;VisualGridMapper&lt;/code&gt;. Its sole job is to walk the table once and build a dense 2D array (&lt;code&gt;grid[row][col]&lt;/code&gt;) that maps absolute visual coordinates back to their origin node.&lt;/p&gt;

&lt;p&gt;Here is a simplified look at the mapping logic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;VisualGridMapper&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;$table&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;grid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt; &lt;span class="c1"&gt;// 2D array: grid[row][col]&lt;/span&gt;
        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cellMap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Map&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// DOM Element -&amp;gt; visual properties&lt;/span&gt;
        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mapTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;$table&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="nf"&gt;mapTable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;$table&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;currentRow&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="nx"&gt;$table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tr&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;each&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;rIndex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;tr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;currentRow&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;currentRow&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;
            &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;currentCol&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

            &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;tr&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;td, th&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;each&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;cIndex&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;cell&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;$cell&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;$&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cell&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
                &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rSpan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parseInt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;$cell&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;attr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;rowspan&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cSpan&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parseInt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;$cell&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;attr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;colspan&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

                &lt;span class="c1"&gt;// EDGE CASE: Skip cells that are already occupied by &lt;/span&gt;
                &lt;span class="c1"&gt;// a rowspan from a previous row&lt;/span&gt;
                &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;currentRow&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="nx"&gt;currentCol&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="nx"&gt;currentCol&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;

                &lt;span class="c1"&gt;// Record the origin node&lt;/span&gt;
                &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cellData&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="na"&gt;element&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;cell&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="na"&gt;isOrigin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// This is the actual DOM node&lt;/span&gt;
                    &lt;span class="na"&gt;startRow&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;currentRow&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="na"&gt;startCol&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;currentCol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="na"&gt;rowspan&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;rSpan&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="na"&gt;colspan&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;cSpan&lt;/span&gt;
                &lt;span class="p"&gt;};&lt;/span&gt;

                &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cellMap&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cell&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;cellData&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

                &lt;span class="c1"&gt;// Fill the physical space in our 2D array&lt;/span&gt;
                &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;rSpan&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;cSpan&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                        &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;currentRow&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;currentRow&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;

                        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;grid&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;currentRow&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="nx"&gt;currentCol&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                            &lt;span class="na"&gt;element&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;cell&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="na"&gt;isOrigin&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;c&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                        &lt;span class="p"&gt;};&lt;/span&gt;
                    &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="nx"&gt;currentCol&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;cSpan&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="p"&gt;});&lt;/span&gt;
            &lt;span class="nx"&gt;currentRow&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Handing the "Ghost Cell" Edge Case
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;while (this.grid[currentRow][currentCol])&lt;/code&gt; loop is the crucial edge case handler. &lt;br&gt;
As the parser moves through a &lt;code&gt;&amp;lt;tr&amp;gt;&lt;/code&gt;, it checks the map to see if the current visual column is already physically occupied by an element from a row &lt;em&gt;above&lt;/em&gt; it stretching down. If it is, the pointer advances silently, bumping the current row's children to the right so they align with their true visual placement.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Letdown that Became a Superpower
&lt;/h3&gt;

&lt;p&gt;Building this mapping layer was tedious. But once it existed, something amazing happened: &lt;strong&gt;complex table mutations fell out for free.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Want to transpose a table? I didn't need to write complex DOM-shuffling logic. I just ran a standard matrix transpose on my &lt;code&gt;VisualGridMapper&lt;/code&gt; array (&lt;code&gt;[row][col]&lt;/code&gt; becomes &lt;code&gt;[col][row]&lt;/code&gt;), swapped the &lt;code&gt;rowspan&lt;/code&gt; and &lt;code&gt;colspan&lt;/code&gt; values, merging cells, and splitting cells, all table mutations are now matrix problems. No worries about the complexities of sequentially re-rendering the DOM. Linear algebra solved the UI problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  3 Why This Tool Exists
&lt;/h2&gt;

&lt;p&gt;TAFNE was built specifically for developers, data analysts, and technical writers. For people who deal with messy tabular data and need a cleaner way to work with it&lt;/p&gt;

&lt;p&gt;You input or load a CSV, ASCII, text, or HTML, and TAFNE takes that &lt;code&gt;VisualGridMapper&lt;/code&gt; and generates multiple formats directly into an embedded &lt;strong&gt;Monaco Editor&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It currently supports exports like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Markdown, for GitHub READMEs and docs.&lt;/li&gt;
&lt;li&gt;JSON, for structured data pipelines or API work.&lt;/li&gt;
&lt;li&gt;HTML, for clean table output.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SQL&lt;/strong&gt;, which became the most useful export for me. Paste in a messy CSV, the tool can infer headers, generate a &lt;code&gt;CREATE TABLE&lt;/code&gt; statement, and produce the corresponding &lt;code&gt;INSERT INTO&lt;/code&gt; statements with escaped values.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can go from a mangled PDF scrape to a populated database backend in about 8 seconds, without writing a single line of backend parsing logic.&lt;/p&gt;

&lt;p&gt;I'm still working to include more imports and exports such as LaTeX, and Excel. You can support the development of TAFNE by checking out &lt;a href="https://github.com/carnworkstudios/TAFNE" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. The Architecture Choice
&lt;/h2&gt;

&lt;p&gt;The entire editor is built with Vanilla JavaScript and jQuery.&lt;/p&gt;

&lt;p&gt;That wasn’t a nostalgic decision. It came out of the constraints of the tool itself.&lt;/p&gt;

&lt;p&gt;I wanted the simplest possible setup: something you could open locally, run without a build step, and use without sending data to a backend. For a tool that may handle financial tables, internal reports, or scraped documents, local-first matters. The data should stay on the machine.&lt;/p&gt;

&lt;p&gt;There was also a more practical reason: the DOM is already the thing I was trying to control.&lt;/p&gt;

&lt;p&gt;For this kind of table manipulation, I didn’t want to constantly translate between a virtual state model and the browser’s actual structure. The table itself is the structure. So instead of forcing the problem into a framework-shaped box, I let the browser do what it was already good at, and used the mapper only when I needed to reason about the table mathematically.&lt;/p&gt;

&lt;p&gt;That choice came with tradeoffs, of course.&lt;/p&gt;

&lt;p&gt;Without framework lifecycles, I had to be much more disciplined about cleanup. Event handlers had to be namespaced carefully. Re-rendering meant I had to think hard about stale listeners. Undo and redo also took more manual work, because I couldn’t lean on immutable state patterns to do the bookkeeping for me.&lt;/p&gt;

&lt;p&gt;But the tradeoff felt worth it for this project.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. What I Learned
&lt;/h2&gt;

&lt;p&gt;The biggest lesson was that HTML tables are more than markup. If you want to make them editable, mergeable, split-able, or transposable, you need to stop treating them like a flat list and start treating them like a coordinate system.&lt;/p&gt;

&lt;p&gt;That change in perspective unlocked the whole engine.&lt;/p&gt;

&lt;p&gt;I didn’t begin with a grand plan to build a visual table IDE. I started with a broken problem, tried a few awkward fixes, and eventually found that the cleanest solution was to map the DOM into a visual grid first, then operate on that model instead of fighting the browser directly.&lt;/p&gt;

&lt;p&gt;That’s usually how these tools come together: not through one elegant insight, but through a series of small, stubborn corrections until the structure finally makes sense.&lt;/p&gt;

&lt;p&gt;The SQL emitter and the VisualGridMapper are both &lt;br&gt;
open source on GitHub: &lt;a href="https://github.com/carnworkstudios/TAFNE" rel="noopener noreferrer"&gt;carnworkstudios/TAFNE&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;I'd genuinely like feedback on the type inference logic. If you've solved similar problems differently, tell me in the comments or open an issue on the repo.&lt;/p&gt;

</description>
      <category>javascript</category>
      <category>webdev</category>
      <category>opensource</category>
      <category>html</category>
    </item>
  </channel>
</rss>
