DEV Community

Cover image for Cleaning Broken HTML Tables from PDFs, Scrapes, and Legacy Exports in Vanilla JS
Bonzai2Carn
Bonzai2Carn

Posted on

Cleaning Broken HTML Tables from PDFs, Scrapes, and Legacy Exports in Vanilla JS

HTML tables are liars.

If you haven't worked deeply with HTML tables, you might think a table is just a simple 2D array: table[row][col].

The moment an HTML table introduces a colspan or a rowspan, the visual (x, y) coordinate of a cell completely detaches from its DOM hierarchy. If row 1 has a cell with colspan="3", then the second <td> in that row is visually in column 4, but programmatically it is childNodes[1].

If you try to write a "select column" function by just iterating through tr > td:nth-child(n), your highlighting will look like abstract art the second it hits a merged cell.

I learned that the hard way.

If you work with scraped tables, PDF exports, legacy system data, or just need to clean up HTML tables before dropping them into a docs platform, this is for you.

What started as a small utility for cleaning up scraped tables eventually became TAFNE - Table Formatter and Node Editor, a browser-based Table IDE for reshaping broken tabular data and exporting it into useful formats. The hardest part wasn’t rendering the table. It was teaching the browser how to understand the table the way a human does.

What Didn't Work

My first attempt was just checking .prev() and .next() and trying to keep a running tally of offset index values.


1. The Problem Space

Try to write a function that highlights an entire column when you hover over a table header. If the table represents a perfectly flat 2D array, it’s trivial: loop through every <tr> and add a CSS class to childNodes[colIndex].

But what if you are given this table?

<table id="messy-table">
  <tr>
    <td rowspan="2">A</td>
    <td colspan="2">B</td>
  </tr>
  <tr>
    <td>C</td>
    <td>D</td>
  </tr>
</table>
Enter fullscreen mode Exit fullscreen mode

Visually, this is a 2x3 grid.

  • Row 1, Col 1 is A
  • Row 2, Col 1 is also A (because of rowspan)
  • Row 2, Col 2 is C
  • Row 2, Col 3 is D

But programmatically? C is tr[1].childNodes[0]. It thinks it's in the first column, but visually it sits in the second.

My initial approach of checking .prev() and .next() and keeping a running tally of offset index values was naive. This completely breaks when a cell has both colspan and rowspan acting simultaneously, or when consecutive cells in a row have varying spans. The edge cases are endless.

I needed a topographic map of the DOM, not just a DOM tree.


2. The Solution: The VisualGridMapper

To perform complex UI actions like drag-and-drop or matrix transposition on a table, you need to translate the DOM into a strict, predictable Cartesian plane.

I built a class called the VisualGridMapper. Its sole job is to walk the table once and build a dense 2D array (grid[row][col]) that maps absolute visual coordinates back to their origin node.

Here is a simplified look at the mapping logic:

class VisualGridMapper {
    constructor($table) {
        this.grid = []; // 2D array: grid[row][col]
        this.cellMap = new Map(); // DOM Element -> visual properties
        this.mapTable($table);
    }

    mapTable($table) {
        let currentRow = 0;

        $table.find('tr').each((rIndex, tr) => {
            if (!this.grid[currentRow]) this.grid[currentRow] = [];
            let currentCol = 0;

            $(tr).find('td, th').each((cIndex, cell) => {
                const $cell = $(cell);
                const rSpan = parseInt($cell.attr('rowspan')) || 1;
                const cSpan = parseInt($cell.attr('colspan')) || 1;

                // EDGE CASE: Skip cells that are already occupied by 
                // a rowspan from a previous row
                while (this.grid[currentRow][currentCol]) {
                    currentCol++;
                }

                // Record the origin node
                const cellData = {
                    element: cell,
                    isOrigin: true, // This is the actual DOM node
                    startRow: currentRow,
                    startCol: currentCol,
                    rowspan: rSpan,
                    colspan: cSpan
                };

                this.cellMap.set(cell, cellData);

                // Fill the physical space in our 2D array
                for (let r = 0; r < rSpan; r++) {
                    for (let c = 0; c < cSpan; c++) {
                        if (!this.grid[currentRow + r]) this.grid[currentRow + r] = [];

                        this.grid[currentRow + r][currentCol + c] = {
                            element: cell,
                            isOrigin: (r === 0 && c === 0)
                        };
                    }
                }
                currentCol += cSpan;
            });
            currentRow++;
        });
    }
}
Enter fullscreen mode Exit fullscreen mode

Handing the "Ghost Cell" Edge Case

The while (this.grid[currentRow][currentCol]) loop is the crucial edge case handler.
As the parser moves through a <tr>, it checks the map to see if the current visual column is already physically occupied by an element from a row above it stretching down. If it is, the pointer advances silently, bumping the current row's children to the right so they align with their true visual placement.

The Letdown that Became a Superpower

Building this mapping layer was tedious. But once it existed, something amazing happened: complex table mutations fell out for free.

Want to transpose a table? I didn't need to write complex DOM-shuffling logic. I just ran a standard matrix transpose on my VisualGridMapper array ([row][col] becomes [col][row]), swapped the rowspan and colspan values, merging cells, and splitting cells, all table mutations are now matrix problems. No worries about the complexities of sequentially re-rendering the DOM. Linear algebra solved the UI problem.


3 Why This Tool Exists

TAFNE was built specifically for developers, data analysts, and technical writers. For people who deal with messy tabular data and need a cleaner way to work with it

You input or load a CSV, ASCII, text, or HTML, and TAFNE takes that VisualGridMapper and generates multiple formats directly into an embedded Monaco Editor.

It currently supports exports like:

  • Markdown, for GitHub READMEs and docs.
  • JSON, for structured data pipelines or API work.
  • HTML, for clean table output.
  • SQL, which became the most useful export for me. Paste in a messy CSV, the tool can infer headers, generate a CREATE TABLE statement, and produce the corresponding INSERT INTO statements with escaped values.

You can go from a mangled PDF scrape to a populated database backend in about 8 seconds, without writing a single line of backend parsing logic.

I'm still working to include more imports and exports such as LaTeX, and Excel. You can support the development of TAFNE by checking out GitHub.

4. The Architecture Choice

The entire editor is built with Vanilla JavaScript and jQuery.

That wasn’t a nostalgic decision. It came out of the constraints of the tool itself.

I wanted the simplest possible setup: something you could open locally, run without a build step, and use without sending data to a backend. For a tool that may handle financial tables, internal reports, or scraped documents, local-first matters. The data should stay on the machine.

There was also a more practical reason: the DOM is already the thing I was trying to control.

For this kind of table manipulation, I didn’t want to constantly translate between a virtual state model and the browser’s actual structure. The table itself is the structure. So instead of forcing the problem into a framework-shaped box, I let the browser do what it was already good at, and used the mapper only when I needed to reason about the table mathematically.

That choice came with tradeoffs, of course.

Without framework lifecycles, I had to be much more disciplined about cleanup. Event handlers had to be namespaced carefully. Re-rendering meant I had to think hard about stale listeners. Undo and redo also took more manual work, because I couldn’t lean on immutable state patterns to do the bookkeeping for me.

But the tradeoff felt worth it for this project.


5. What I Learned

The biggest lesson was that HTML tables are more than markup. If you want to make them editable, mergeable, split-able, or transposable, you need to stop treating them like a flat list and start treating them like a coordinate system.

That change in perspective unlocked the whole engine.

I didn’t begin with a grand plan to build a visual table IDE. I started with a broken problem, tried a few awkward fixes, and eventually found that the cleanest solution was to map the DOM into a visual grid first, then operate on that model instead of fighting the browser directly.

That’s usually how these tools come together: not through one elegant insight, but through a series of small, stubborn corrections until the structure finally makes sense.

The SQL emitter and the VisualGridMapper are both
open source on GitHub: carnworkstudios/TAFNE.

I'd genuinely like feedback on the type inference logic. If you've solved similar problems differently, tell me in the comments or open an issue on the repo.

Top comments (0)