DEV Community: Bonzai2Carn

Beyond the Naive Regex: Proper PDF Font Style Extraction

Bonzai2Carn — Tue, 14 Jul 2026 13:47:42 +0000

TLDR: Build a fontStyleMap from page.commonObjs in the geometry worker. Each font name resolves to {bold, italic} flags from the actual parsed font descriptor. Merge onto textMeta items. Check transform[2] (shear component) for synthetic italic. textRebuilder wraps styled runs in <strong>/<em>/<u>.

Repo: tools/pdf-processor

The Problem

PDF.js gives each text item a fontName string. These look like ABCDEF+TimesNewRomanPS-BoldMT, which is a 6-character subset prefix followed by a PostScript-style variant name.

The prefix changes on every export. You cannot reliably regex-match the family before stripping it.

The Naive Attempt

Strip the prefix, then regex-match the remainder:

/bold|heavy|black/i.test(name.replace(/^[A-Z]{6}\+/, ''))

This works for well-named fonts. It fails silently for synthetic fonts named Font12 or F1, fonts with non-English variant names, and PDFs where the exporter normalized the font name.

page.commonObjs: The Authoritative Source

PDF.js exposes parsed font objects through page.commonObjs. Each font object has .bold and .italic boolean properties computed from the font's actual glyph metrics and descriptor, not its name. This is the ground truth.

const fontStyleMap = {};
const uniqueFontNames = [...new Set(textContent.items.map(i => i.fontName).filter(Boolean))];
for (const fn of uniqueFontNames) {
    const obj = page.commonObjs.get(fn);
    if (!obj) continue;
    const cleaned = (obj.name || fn).replace(/^[A-Z]{6}\+/, '');
    fontStyleMap[fn] = {
        bold:   !!obj.bold   || /bold|heavy|black/i.test(cleaned),
        italic: !!obj.italic || /italic|oblique|slanted/i.test(cleaned),
    };
}

The || fallback: trust the font object first, fall back to the cleaned name for fonts where .bold is not set by the parser.

Synthetic Italic: The Shear Transform

Some PDFs produce italic-looking text by applying a shear matrix to an upright font rather than loading an actual italic variant.

The text item's transform array is [a, b, c, d, e, f]. The c component is horizontal shear. A non-zero c means the glyphs are slanted.

const syntheticItalic = Math.abs(item.transform[2]) > 0.01;
if (syntheticItalic) meta.italic = true;

This catches faux italic rendering that the font object alone would miss.

Underlines: Vector Segment Pairing

Underlines in PDFs are separate vector line segments drawn beneath text, not a font property. ctmAdapter.js classifies horizontal segments with a Y position within ~0.35× font size below a text item as underlines. The textMeta item receives underlined: true.

Propagating to the Renderer

Flags flow: geometryWorker → textMeta → _scopeItems → textItems passed to textRebuilder. The rebuilder groups consecutive same-style items into runs:

function _wrapInlineStyle(text, style) {
    let html = _escHtml(text);
    if (style.underlined) html = `<u>${html}</u>`;
    if (style.italic)     html = `<em>${html}</em>`;
    if (style.bold)       html = `<strong>${html}</strong>`;
    return html;
}

Nesting order: underline outermost, bold innermost, matching standard HTML precedence for browser rendering.

Result

A line like WARNING Do not proceed without reading the SAFETY section in a PDF might produce:

<strong>WARNING</strong> Do not proceed without reading the <em>SAFETY</em> section

No OCR pass. No ML font classifier. Just font metadata that was already inside the PDF.

Diagnosing a Failing PDF Extraction Pipeline

Bonzai2Carn — Sat, 11 Jul 2026 14:30:00 +0000

TLDR: We stress-tested a deterministic column detection pipeline against a LaTeX academic paper. The hypothesis going in was wrong. The real failure mode was found in the data, not in the algorithm. Here is what we assumed, what we found, and where the work actually needs to happen.

PDF Processor
Repo

The Setup

The PDF extraction pipeline uses a bipartite band partition algorithm to detect 2-column layouts. It was built against two test documents: an Amazon earnings release (single-column financial) and a Siemens engineering manual (2-column, rich path geometry). Both work correctly.

The third document, a LaTeX academic paper, was always expected to be a harder case. But the failure mode we expected was wrong.

The Wrong Hypothesis

LaTeX PDFs have a lot of math. Math characters are typeset using font metrics where the advance width (how far the cursor moves after placing a glyph) does not match the ink width (how wide the character actually is). The assumption was:

getTextContent() text items have advance-width-based widths
Math glyphs have inflated advances
Many math items → median font size pulled toward math sizes (~7pt)
All thresholds calibrated from the body font size (PageScale S) are miscalibrated
Column detection breaks because the gutter threshold is wrong

This is a reasonable hypothesis. It is also completely incorrect for this document.

The diagnostic measured median S and mode S for every page. The divergence was 0.1pt. On some pages, mode was 9.5pt and median was 9.4pt. These are not meaningfully different. The calibration path is not the failure.

The Real Failure

After ruling out calibration, we looked at the actual items being processed.

For a standard 2-column LaTeX paper, the column gutter is roughly at X≈310 (out of a ~620px viewport). The bipartite algorithm's fallback path, which runs when interval merge finds no clean gap, walks candidate X values and counts how many items cross each. The split point should be the X where the fewest items cross.

On raiko-aistats-12.pdf, the fallback finds that every candidate X has the same high crossing count. Zero splits detected.

Why? PDF.js getTextContent() for this document does not expose individual math characters as separate items. Entire display-math equations arrive as single text items. A display-math block is full-width: it spans from the left column's left edge to the right column's right edge, crossing X≈310 along with X≈100, X≈150, and every other candidate.

When the fallback scan runs, it counts one or two wide equation items as crossing all candidates. No candidate has a lower crossing count than any other. No split is selected.

The gutter is real. The columns are real. The equations just happen to sit across the gutter and overwhelm the crossing count.

What Survived

The interval merge stage is correct. It correctly finds no clean gap because the equation items span the full page width in getTextContent(): there is no gap to find in the item X-extents. The problem is not in interval merge; it is in how the fallback scan treats anomalously wide items.

The calibration work (mode-S vs median-S) is still worth doing. The 0.1pt divergence on these three documents doesn't mean the fix is unnecessary. It means these three documents happen not to stress the calibration path. A document with 60% subscripts would.

The three-tier architecture (getStructTree → getOperatorList → getTextContent) is correct as a long-term model. But for these three specific documents, all three tiers resolve to the same thing: Tier 3 (bipartite fallback) is the only active path. The struct tree is absent from all three. Full-height vertical column rules are absent from all three.

The Actual Fix

The fix is a pre-filter on the fallback crossing scan. Items where vWidth > S * 4 are anomalously wide relative to the body font size. They are display-math blocks, full-width images, or full-width headers, not normal paragraph text. They should not contribute to the crossing count in the fallback scan because they are not evidence about where the column boundary is.

Filter them before the crossing scan runs. The surviving narrow items will have the correct gap pattern. Splits will be found.

This is a two-line change in _detectPageColumns. It does not touch the interval merge path, the three gates, or the bipartite structure.

What the Architecture Document Changed

Before this session: the pipeline had one active code path for all documents, calibrated by a median that is theoretically wrong for math-heavy documents.

After this session: the architecture document defines a three-tier model, a diagnostic harness confirms which tier is active per document, and the specific failure mode for LaTeX papers is root-caused to item width anomalies, not calibration.

The heavy restructure (structTreeReader, ctmAdapter extensions, tiered classifyPage) is still ahead. But the next actionable fix, the one that makes raiko-aistats work, is the display-math pre-filter, not the restructure.

The Three-Tier PDF Extraction Model: Demystifying PDF.js

Bonzai2Carn — Thu, 09 Jul 2026 14:30:00 +0000

TLDR: Every manufactured PDF (not scanned) has three fidelity levels available to a browser-side extractor: a semantic structure tree, a geometric paint stream, and a text convenience API derived from the paint stream. Most tools use only the third. The correct architecture reads them top-down and exits as soon as a tier produces a complete answer.

PDF Processor
Repo

The API Hierarchy

PDF.js exposes three document-reading APIs. They are not three ways of reading the same data. They are three different data sources at different levels of the PDF format.

getTextContent()

This is a convenience API. It returns positioned text items: { str, transform, width, height } for every glyph run on the page. You get text, position, and a typographic advance width.

The important fact: this API is derived from getOperatorList(). PDF.js processes the operator list internally, collects text paint operators (Tj, TJ, ', "), applies the current text matrix and CTM, and packages the results as text items. getTextContent() does not read a different part of the PDF. It is a processed view of the paint stream.

This derivation has a cost: advance widths are typographic, not ink widths. For most text, these are the same. For math, they are not. A subscript character with a large italic correction has an advance width that includes white space intended for the next character. Equation blocks composed from individual glyphs may have items whose total advance width is wider than their actual ink.

More critically for extraction: PDF.js may not produce individual items for every character in a math equation. Display-math blocks from LaTeX can arrive as single items spanning the full equation width. The individual character positions are not surfaced.

getOperatorList()

This is the raw paint stream. Every drawing command the PDF renderer would execute: move-to, line-to, curve-to, rectangle, fill, stroke, set-color, set-font, save, restore. The CTM stack is implicit: q pushes, Q pops, cm multiplies into the current matrix.

This is the ground truth for geometry. Table lines, box borders, background fills, and column rules are all explicit path operators here. Nothing is inferred.

Text paint operators (Tj, TJ) appear in the operator list with the current text matrix applied. This is where getTextContent() reads its data. The difference: in the operator list, each paint operation is also wrapped in BMC/BDC...EMC marked content blocks that carry a Marked Content ID (MCID).

getStructTree()

This is not derived from the paint stream. It is a separate data structure stored in the PDF cross-reference table, specifically the logical structure tree. It encodes the semantic role of every painted element: Table, TR, TD, TH, P, H1–H6, Figure, Formula, L (list), LI.

Each leaf node in the structure tree carries an MCID. Each BMC/BDC operator in the operator list also carries an MCID. Joining the two gives you: every glyph run → its semantic role.

This is the data source that most extractors never read.

The MCID Join

The join between structure tree and operator list is the technical centerpiece of Tier 1.

Walk the operator list once. Maintain a MCID stack: push on BDC, pop on EMC. When you encounter a text paint operator (Tj, TJ), record the current top-of-stack MCID.

for (let i = 0; i < opList.fnArray.length; i++) {
    const fn = opList.fnArray[i];
    const args = opList.argsArray[i];

    if (fn === OPS.beginMarkedContentProps) {
        const props = args[1];
        if (props && props.MCID !== undefined) {
            mcidStack.push(props.MCID);
        }
    } else if (fn === OPS.endMarkedContent) {
        mcidStack.pop();
    } else if (fn === OPS.showText || fn === OPS.showSpacedText) {
        const currentMcid = mcidStack[mcidStack.length - 1];
        if (currentMcid !== undefined) {
            opIndexToMcid.set(i, currentMcid);
        }
    }
}

Then walk the structure tree. Each Table node contains TR nodes, which contain TD nodes. Each TD node has an MCID. Map that MCID to the text items collected above.

Result: every text item has a semantic role. Tables fall out as TD nodes grouped by TR grouped by Table. No column detection. No stream detection. No threshold.

Tier 2: What ctmAdapter Is Already Close To Providing

The current pipeline reads the operator list via ctmAdapter.js, which emits subpath records and filled rectangles. It already collects vSegs (vertical segments).

The missing piece: nobody checks whether any vSeg spans the full content height. A vertical rule in an engineering manual or newsletter that runs from top to bottom of the content area is explicit column geometry. It is more reliable than any inference from text positions.

const contentHeight = contentBottom - contentTop;
const columnRules = vSegs.filter(s => {
    const len = Math.abs(s.y2 - s.y1);
    const midX = (s.x1 + s.x2) / 2;
    return len >= contentHeight * 0.60
        && midX >= vpWidth * 0.10
        && midX <= vpWidth * 0.90;
});

If columnRules.length > 0, their X positions are used directly as column splits. The bipartite algorithm is skipped entirely. A geometric fact is used as a geometric fact.

Two other operator list signals are currently discarded:

Clip stack: W/W* operators define clip regions. Some PDFs clip each column to its column rectangle. Adjacent clip regions with a gap encode a column layout without any inference.
Paint order: a text item painted after a filled rectangle is visually on top of that rectangle. When two filled regions overlap the same text item, paint order disambiguates which region the text belongs to.

Mode-S vs Median-S

PageScale S is the body font size, which is the calibration constant from which all thresholds are derived. Currently: S = median(vFont) across all text items on the page.

For LaTeX papers with subscripts, superscripts, and equation characters, the distribution of font sizes is not unimodal. There is a cluster of body-text items at 10pt and a long tail of math characters at 6–7pt. The median of this distribution is pulled toward the tail.

The fix is the mode, computed on 0.5pt bins:

const bins = new Map();
for (const tm of textMeta) {
    const bin = Math.round(tm.vFont * 2) / 2;
    bins.set(bin, (bins.get(bin) || 0) + 1);
}
let modeFont = 12, modeCount = 0;
for (const [bin, count] of bins) {
    if (count > modeCount) { modeCount = count; modeFont = bin; }
}
const S = modeFont;

The mode is the body text size on any well-structured document. Subscripts are a tail in the frequency distribution, not the center of mass.

What the Diagnostic Found

Running all three test PDFs through a read-only diagnostic harness produced:

PDF	Struct tree	Column rules	Mode-S vs Median-S	Tier 3 result
AMZN (financial)	Absent	None	Identical	0 splits ✓
59MN7C (engineering)	Absent	None found (diagnostic param bug)	Identical	Splits detected ✓
raiko-aistats (LaTeX)	Absent	None	0.1pt divergence	0 splits ✗

The tiered architecture is correct in design. For the current test documents, all three PDFs fall through to Tier 3. The struct tree is not used because it is not present. Column rules are not used because none are present.

The failure on raiko-aistats is entirely within Tier 3: the fallback crossing scan is not pre-filtering anomalously wide items. That is the next fix.

The Hierarchy Summary

getStructTree()       # semantic ground truth, MCID-joined to paint stream
getOperatorList()     # geometric ground truth, path operators + text operators
getTextContent()      # derived convenience, lossy (no MCID, typographic widths)

Every well-structured PDF authored in Word, InDesign, or a publishing tool has a populated struct tree. Every manufactured PDF (not scanned) has a full operator list. getTextContent() is the right tool for 80% of cases where you just need text and approximate positions.

For extraction that needs to correctly classify tables, columns, and reading order across arbitrary document types, you need all three APIs and a cascade strategy for which one to trust.

Why Web Workers Swallow Your Stacktraces (And How to Write Specs to Fix It)

Bonzai2Carn — Tue, 07 Jul 2026 14:30:00 +0000

TLDR: A spec that says "the function checks r.bbox.x < pageWidth / 2" without saying where pageWidth comes from is not a spec. It is a description of behavior with a hidden dependency. Every architecture document I have written with that pattern has produced at least one ReferenceError in implementation.

The Pattern That Keeps Failing

Architecture documents are good at describing what code should do. They are bad at describing where values come from and which function boundaries they cross.

The page assembly refactor spec said:

Check for FEATURE_LAYOUT: 2 cols, left is all visual, right is text. Compare each region's r.bbox.x against pageWidth / 2.

That is a correct behavioral description. It says nothing about execution context. The implementer writes it into _detectAutoZones, a module-level function, and references pageWidth by name. The spec did not say pageWidth needed to be passed as a parameter. The spec did not say _detectAutoZones was a module-level function. Both facts were implicit.

The result: ReferenceError: pageWidth is not defined on every PDF load, with a stacktrace that points at the worker message handler instead of the actual line.

Why This Keeps Happening

Architecture documents are written in prose. Prose does not have a type system. A sentence like "the function uses pageWidth" can mean:

The function receives pageWidth as a parameter.
The function reads pageWidth from a module-level variable.
The function is nested inside a caller that defines pageWidth and closes over it.
The function reads pageWidth from an object passed in.

All four are syntactically valid JavaScript. The prose does not distinguish between them. The implementer picks one and moves on.

When the spec is written by the same person who will implement it, option 3 is tempting because it is the least-friction path. The value is just "there" without needing to thread it through function signatures. It works when the function is actually nested. It throws when the function ends up at module level for any reason (extracted for reuse, moved for readability, placed outside the call site by default).

What Correct Specs Look Like

A spec that is actually implementable names the function signature:

_detectAutoZones(regions, numCols, pageWidth): pageWidth is viewport.width || 612, passed from assemblePage.

That is two extra words. It removes all ambiguity. The implementer knows the parameter needs to exist. The reviewer can check that the call site passes it. The bug does not happen.

The discipline: any time you write "the function uses X" in an architecture document, immediately write where X comes from. If it is a parameter, name it in the signature. If it is a module constant, name the constant. If it is a derived value, show the derivation.

The Stacktrace Problem

ReferenceErrors in Web Workers have a specific failure mode: the worker's error handler catches the error, serializes err.message (a string), and posts it to the main thread. The stack is not forwarded. The main thread reconstructs a new Error from the message string and throws it. DevTools shows the main thread throw, not the worker's.

This means a ReferenceError in a worker looks like an error in the worker message handler, not in the actual throwing function. The real location is invisible. You find it by grepping for the identifier named in the error message.

This is a general problem with any architecture that serializes errors across execution boundaries (workers, iframes, service workers, error-catching middleware). The message string is preserved. The stack is not. Every ReferenceError in such a system requires a grep rather than a stacktrace to locate.

The Uncomfortable Implication

If your architecture document has not specified where every value used by every function comes from, your implementation will have bugs that grep finds faster than debuggers. That is not a criticism of the implementer. It is a criticism of the spec.

The solution is not more thorough prose. It is specifying function signatures explicitly, the same way you would in TypeScript or a statically-typed language. Write the types. Write the parameter names. Write where the values come from. Three lines of explicit signature beats three paragraphs of behavioral description every time.

Web Components vs. Iframes: A Hard Lesson in DOM Isolation Barriers

Bonzai2Carn — Sat, 04 Jul 2026 14:30:00 +0000

TLDR

A custom element that fetched a canonical app's HTML and swapped document.body.innerHTML looked clean on the surface. It worked until it didn't: the swap raced with the existing app's event handlers, producing a tool that rendered correctly but did nothing. The correct pattern, an iframe pointing at the canonical URL with ?view=, was already in use by the VS Code extension. It took three weeks to apply the same answer to the web.

Ginexys

The Assumption That Seemed Reasonable

Ten SEO landing pages need unique <head> metadata but should load the same tool. A custom element that fetches the canonical tool HTML and injects it into the current page would deduplicate the tool code while keeping per-page metadata. One component, ten thin wrapper pages. Fewer moving parts.

How It Broke

The custom element ran document.body.innerHTML = canonicalBody.innerHTML. This replaced the body after the scripts that loaded the tool had already attached their event handlers to the original DOM.

The app init pattern was DOMContentLoaded → attach handlers → ready. After the innerHTML swap, the DOM the handlers were attached to was gone. The new DOM from the canonical body had no handlers attached. Sometimes the app's deferred script loaded against the old body and ran to completion. Sometimes it ran against the new body. Sometimes it ran twice. The outcome was non-deterministic.

Symptoms: tabs switched. File dialogs opened. Buttons were visually interactive. But clicking "process file" extracted nothing. Clicking "export" produced nothing. Clicking a gated feature opened nothing. No console errors. Everything looked correct and did nothing.

Why It Was Hard to Find

The canonical URL was always used for testing. /tools/pdf-processor/ loaded fine, worked correctly, passed every test.

A Vite plugin had been added to redirect the root canonical URL to /editor/ to prevent a PDF.js worker from hitting the root URL and receiving HTML instead of JavaScript. This redirect routed all user traffic through the wrapper. The standalone test path stopped existing. The bug only appeared on the path users actually took, and that path looked correct visually.

What Was Thrown Away

Three web component files, 8KB of code. The approach was not wrong in principle. It was wrong for an app that does imperative DOM initialization. A custom element that injects static HTML into a page is fine. A custom element that injects a running app into a page, with scripts that have already begun attaching handlers, is not.

Also discarded: the assumption that "deduplicate HTML" means "inject HTML". Deduplication of a running app means isolation, not injection.

What Replaced It

Each SEO landing page became a thin HTML with unique metadata and a single full-viewport iframe:

<iframe src="/tools/pdf-processor/?view=editor"
        style="width:100%;height:100vh;border:none;"
        allow="..."></iframe>

The canonical app loads inside its own document. Its scripts initialize against its own DOM. Nothing races. The ?view=editor query parameter activates the right tab. No DOM swap, no event handler collision.

The VS Code extension had been doing this correctly since the beginning: load the canonical index.html into a webview, inject a global for mode selection, let the app init normally. Three weeks later, the web followed the same pattern.

The Lesson

When you need to embed an app inside another document, use an isolation boundary that the browser already provides. The iframe is the answer. Web components are for components, meaning UI elements that render within the host document's DOM. An existing running app with its own initialization sequence is not a component. Load it at its own origin and talk to it via postMessage or URL parameters.

The sign that you are solving the wrong problem: when your custom element needs to replace document.body.innerHTML, you are trying to do iframe isolation without the isolation.

Seven Table Parsers, One Interface: Designing a Table Formatter and Node Editor (TAFNE)

Bonzai2Carn — Thu, 02 Jul 2026 14:30:00 +0000

Table Formatter
Repo

The first question most people ask when they see TAFNE accept HTML, CSV, TSV, Markdown, JSON, ASCII art tables, and SQL INSERT statements from the same input field is: how does it know which one you pasted?

The short answer: it doesn't always. You tell it. Or the file extension tells it. Or for unknown text files, it makes a content-based guess.

The longer answer is that each of these seven formats requires a genuinely different parsing approach, and the differences are interesting.

The Dispatcher

The entry point is a switch statement in parseInput():

switch (inputType) {
    case 'html':     tableHtml = parseHtmlInput(inputData); break;
    case 'ascii':    tableHtml = parseAsciiInput(inputData); break;
    case 'csv':      tableHtml = parseCsvInput(inputData); break;
    case 'text':     tableHtml = parseTextInput(inputData); break;
    case 'markdown': tableHtml = parseMarkdownInput(inputData); break;
    case 'json':     tableHtml = parseJsonInput(inputData); break;
    case 'sql':      tableHtml = parseSqlInput(inputData); break;
}

Every branch takes the same input: a raw text string. Every branch produces the same output: an HTML table string. What happens in between is completely different.

Two Parsing Strategies

The seven parsers split into two fundamental camps.

Semantic parsers rely on explicit structure. The format declares its own structure using tags, keys, or reserved keywords. The parser just has to read that declaration.

Heuristic parsers rely on pattern recognition. The format is a convention, not a formal specification. The parser makes educated guesses based on what it sees.

HTML and JSON are semantic. CSV, TSV, and ASCII are heuristic. Markdown and SQL sit in the middle.

The Semantic Parsers

The HTML parser uses a regex to find table elements:

const tablePattern = /<table[\s\S]*?<\/table>/gi;
const matches = html.match(tablePattern);

The structure is already there. The parser's job is to extract and normalize it, not reconstruct it.

The JSON parser reads explicit keys and values:

let data;
try { data = JSON.parse(json); } catch (e) { ... }
const headers = Object.keys(data[0]);

The column names are the object keys. The rows are the array elements. No guessing involved. If the input is valid JSON, the structure is unambiguous.

The JSON parser also handles a common real-world case: JSON that wraps the array in an outer object. If JSON.parse returns an object rather than an array, the parser looks for the first key whose value is a non-empty array:

const arrayKey = Object.keys(data).find(k => Array.isArray(data[k]) && data[k].length > 0);
data = arrayKey ? data[arrayKey] : [];

This handles API responses that return { "results": [...] } or { "data": [...] } without requiring the user to drill into the JSON manually.

The Heuristic Parsers

The CSV parser assumes commas separate columns and newlines separate rows. That's it.

const cells = line.split(',').map(cell => cell.trim().replace(/^"|"$/g, ''));

Quotes are stripped from around cells. The first row is assumed to be headers. This works for the vast majority of real CSV files, but it will fail on CSV files that contain commas inside quoted fields (a common edge case in RFC 4180-compliant CSV). The current parser treats every comma as a delimiter regardless of quoting context. For most practical use cases, this is fine.

The ASCII parser skips separator lines:

if (line.includes('+---') || line.includes('+===')) return;
const cells = line.split('|').filter(cell => cell.trim());

ASCII tables use +---+---+ for horizontal separators and | val | val | for data rows. The parser identifies separators by looking for the +--- pattern, skips them, and splits data rows on pipes.

The State Machine: Markdown

Markdown tables have three types of lines: the header row, the separator row, and data rows. The separator row is syntactically distinct but carries no data.

if (/^\|?[\s|:\-]+\|?$/.test(line)) return; // separator

const headerDone = false;
// ...
tableHtml += headerDone ? `<td>${cell}</td>` : `<th>${cell}</th>`;
headerDone = true;

The headerDone flag is a minimal state machine. Before processing the first data row, cells become <th>. After, they become <td>. The separator line is identified by a regex that matches lines containing only pipes, spaces, colons, and dashes.

The Hunter: SQL

The SQL parser is the most technically specific. It uses a capturing regex to scan for INSERT statements:

const insertRe = /INSERT\s+INTO\s+\S+\s*\(([^)]+)\)\s*VALUES\s*\(([^)]+)\)/gi;
let match;
while ((match = insertRe.exec(sql)) !== null) {
    if (!headers) {
        headers = match[1].split(',').map(c => c.trim().replace(/[`"']/g, ''));
    }
    const vals = match[2].split(',').map(v => v.trim().replace(/^'|'$/g, ''));
    rows.push(vals);
}

The exec method called in a loop advances the regex cursor after each match. Column names come from the first match's first capture group. Values come from every match's second group.

Single quotes around values are stripped. Escaped single quotes ('') are converted back to single quotes. The result is clean cell values, one row per INSERT.

Auto-Detection for File Loads

When a file is loaded instead of pasted, the format is detected from the extension. For .txt files and unknowns:

if (text.includes('\t')) {
    tableHtml = parseTextInput(text);
} else if (text.includes(',')) {
    tableHtml = parseCsvInput(text);
} else {
    tableHtml = parseTextInput(text);
}

Tabs win over commas. If neither is present, the tab-delimited parser handles it anyway, which will at least produce a single-column table from the line breaks.

Seven formats. One interface. The parsers are the wall between "raw text someone pasted" and "a table you can edit."

Source: github.com/carnworkstudios/TAFNE

Why We Treat HTML as a CAD Format for PDF (And Why It Works)

Bonzai2Carn — Tue, 30 Jun 2026 04:00:00 +0000

Most PDF-to-HTML tools stop at extraction. You get a dump of text, maybe some tables, and a "download HTML" button. That's the end of the story.

We didn't stop there. And the reason is simple: extracted HTML is not a document you're done with. It's a document you're about to edit.

PDF Processor
Repo

The problem with "extracted output"

When you extract a PDF, you get a structural snapshot of the original. That snapshot is close to what you want, but rarely exactly what you want. Tables have merged cells that should be split. Headings got classified as paragraphs. A two-column layout that made sense in print looks wrong on a screen. A numbered list starts at 3 because the PDF had a callout box in between.

Most tools hand you this output and say: open it in Word, fix it there.

That's a context switch. Every context switch is friction. Friction compounds.

HTML is already a spatial document format

Here's what most people don't realize: HTML rendered in a browser is a box model. Every element (every heading, paragraph, table, callout) is a box with dimensions, position, and CSS-computed layout. The browser calculates all of this automatically.

That box model is essentially a CAD coordinate system. You already have:

Positioned containers (zones, columns, regions)
Reflowable layout (CSS Grid)
Semantic element types (headings, paragraphs, lists, tables)
A full editing surface (contenteditable)

What was missing was the interaction layer to treat it like one.

Two modes, one surface

The Doc tab in Ginexys PDF Processor has two modes on the same surface:

Edit Mode: the existing contenteditable surface. Click into text, type, use the formatting toolbar. The browser handles all the text editing mechanics. This is what you use when you're making content corrections.

Selection Mode: a layout editing layer. Click "Select" in the toolbar. Now every extracted zone and region gets a drag handle. You can:

Drag zones to reorder sections of the page
Drag individual regions (headings, paragraphs, tables) within or across zones
Marquee-select multiple regions and group them into a new zone
Right-click any element and choose "Edit Code" to see and edit its raw HTML in a Monaco editor

Switch back to Edit Mode with the same button. The two modes share the same DOM, with no conversion and no re-render.

Why not absolute positioning?

The obvious CAD metaphor is Figma: drag elements freely, place them anywhere. We explicitly chose not to do this.

The reason is that our output is HTML, and HTML in a browser is a flow document. Absolute positioning breaks that. An absolutely-positioned element is outside the flow: it doesn't affect other elements, doesn't respond to container resizes, and doesn't export correctly to Markdown or XML or DOC.

Drag-to-reorder in document flow is more useful than drag-to-anywhere. You're reorganizing a document, not designing a poster.

Edit Code: the escape hatch

Every extracted element has a "Edit Code" option in the right-click menu. This opens a Monaco editor dialog with the element's raw outerHTML. You can:

Add a CSS class
Change the tag from <h4> to <h3>
Rewrite a paragraph's content entirely
Fix a table cell that parsed incorrectly
Add an attribute

Click Apply. The element is replaced in the live DOM. The change propagates to the Monaco source editor and the Visual Diff tab automatically.

This is the escape hatch that makes the higher-level tools trustworthy. If the drag handles can't express what you need, the code editor can.

The export chain closes the loop

Selection Mode and Edit Code aren't decorative. Every change you make in the Doc tab flows through the same sync coordinator (applyHtmlEverywhere) that the Monaco source editor uses. When you export:

HTML: the edited DOM, with images inlined as base64
Markdown: converted from the live DOM structure (real GFM: pipe tables, ### headings, - bullets)
XML: semantic tree (<heading level="3">, <table><row><cell>)
DOC: Office Open XML envelope, opens in Word/LibreOffice/Google Docs
PDF: browser print dialog, scoped to the Doc content

What you see is what you export.

SiaS: the tool works without the service

This is the SiaS (Software-in-a-Service) model applied. The offline tool, which includes geometry extraction, Doc editing, Selection Mode, Edit Code, and all five export formats, works entirely without an account, without a server, without any network connection.

The AI layer (Docling-powered extraction, GINEX schema analysis) sits on top. It makes the tool smarter. But the tool is already useful without it.

The CAD layer is part of the base tool. It always will be.

Ginexys PDF Processor is available at ginexys.com. Free, offline, no account required.

Debugging Mobile Drag & CSS Specificity in a Real-Time PDF Diff Tool

Bonzai2Carn — Sat, 27 Jun 2026 14:30:00 +0000

TLDR: Three UI problems, three different root causes. The column detection fix was algorithmic. The mobile drag was an axis-detection oversight. The CSS specificity bug was a cascade law I already know but applied wrong under time pressure.

PDF Processor
Repo

What We Set Out To Do

Four tasks entered this session:

Fix raiko-aistats: 0 column splits on all 9 pages of a two-column LaTeX PDF.
Fix visual-diff mobile drag: stacked layout (< 1024px) used horizontal clientX even though the divider was now vertical.
Add touch support to the compare diff resizer: mouse-only, no mobile drag.
Redesign the diff tab chrome: two rows of controls eating 82px before any content appeared.

The column detection fix was already covered in its own post-mortem. This one is about everything else.

The Mobile Drag Problem

The visual-diff layout switches to flex-direction: column at 1024px. The original initDividerResize() always read clientX and called outerWidth() on the first pane. Neither is meaningful when the axis is vertical.

The fix sounds simple: detect which axis the layout is using. The trap was how to detect it. You cannot use a window width check because the breakpoint is a CSS media query and can be overridden. The correct source of truth is:

getComputedStyle($layout[0]).flexDirection === 'column'

Computed style reads what CSS actually applied, not what the JavaScript thinks the breakpoint should be. This check runs at drag start, not at init, so it handles viewport resizes between page load and drag attempt.

The same pattern drives cursor choice: row-resize vs col-resize, and whether to write flex: 0 0 ${topPct}% to height or width.

The Diff Tab Redesign

Two problems with the original design:

Two separate rows of controls (mode tabs + a toolbar row) consumed ~82px before any content.
The layout and precision controls were visually grouped but semantically separated.

The redesign collapses everything into a single 36px bar. Three pill groups (Rich/Plain, Split/Unified, Word/Char) sit left-aligned in a flex row. Stats (N added, N removed) sit right-aligned. The pill group uses a container background with a raised active-pill shadow, which is the standard segmented control pattern.

This required no HTML restructuring of the diff panels themselves. Only the chrome above them changed.

The CSS Specificity Disaster

After the redesign, #view-diff started rendering on top of every other tab. The panels were supposed to be display: none when inactive. The diff panel was always visible.

Initial read of the user's report: "It's not the height. It's either you rename it view-diff where the name that is being referred to is probably compare-diff or something like that."

That sentence is about an ID mismatch hypothesis. I checked the IDs. They matched. The real culprit was in the CSS cascade.

The .view-panel rule sets display: none on all panels. A later rule .diff-layout set display: flex. These are both single-class selectors with equal specificity. Source order breaks the tie. .diff-layout appears later in the file. It wins. Every .view-panel.diff-layout element gets display: flex whether it is the active tab or not.

The fix is two characters wide: add .view-panel to the diff-layout rule.

/* Before: overrides display:none for all panels */
.diff-layout {
    display: flex;
    flex-direction: column;
}

/* After: only sets flex direction, never fights display:none */
.view-panel.diff-layout {
    flex-direction: column;
}

Two-class specificity (0,2,0) beats the single-class .view-panel rule (0,1,0), so the active-tab rule wins when it needs to. The inactive tabs keep display: none.

What Failed

Wrong hypothesis first. The user's wording pointed toward an ID mismatch. I checked IDs first. That was the wrong tree. The cascade investigation was second. In hindsight: "always visible" is a specificity smell, not a naming smell.

The display: flex on a layout helper class. Adding layout properties to a semantic class that gets applied alongside view-panel is the setup for this exact problem. A layout class should set layout properties (direction, wrap, gap). It should not set display unless it is the element that owns the display decision. .view-panel owns the display decision here. .diff-layout does not.

What Survived

Pill group segmented controls are a permanent pattern in this codebase now.
Axis-detecting drag via getComputedStyle is the correct approach for responsive dividers.
Touch support via { passive: false } and e.touches?.[0] ?? e is now consistent across both dividers.

The session closed with a clean build. The four items that entered finished as fixed.

Under the Hood: Drag, Touch, and CSS Cascade in a Real Diff UI

Bonzai2Carn — Thu, 25 Jun 2026 14:30:00 +0000

TLDR: Three interconnected UI problems reveal how layout-aware drag detection, unified touch/mouse event handling, and CSS specificity interact when you redesign a panel that lives inside a visibility-toggled tab system.

Schema Editor
Repo

The Responsive Drag Problem

The visual-diff layout uses flex-direction: row on wide screens and flex-direction: column on mobile (< 1024px). The divider between the two panes needs to do different things depending on which axis is active.

Naive approach and why it fails

A window-width check:

if (window.innerWidth <= 1024) { /* vertical */ } else { /* horizontal */ }

This breaks if the user resizes the window after the divider was initialized. It also breaks if the breakpoint is overridden by a more specific CSS rule. Window width is not the source of truth here. CSS is.

Correct approach: read computed flex direction

function isStacked() {
    return getComputedStyle($layout[0]).flexDirection === 'column';
}

getComputedStyle returns the resolved value after all cascades and media queries have applied. Reading it inside startDrag() means it reflects the layout at the moment the user puts a finger or pointer on the divider, not the layout at page load.

Unified event position extraction

function getEventPos(e) {
    const src = e.touches?.[0] ?? e;
    return isStacked() ? src.clientY : src.clientX;
}

Mouse events and touch events have the same coordinate fields once you extract the first touch from the list. The optional chaining ?.[0] safely returns undefined for mouse events, which then falls through to ?? e (the event itself). This is equivalent to a ternary but shorter and avoids importing lodash or a touch-helper library.

Dimension tracking

At drag start, capture the current first-pane dimension:

const $first = $layout.find('.vd-pane').first();
startSize = isStacked() ? $first.outerHeight() : $first.outerWidth();

During drag, compute the new size clamped to a min and max to prevent panes from collapsing:

const totalH = $layout.outerHeight();
const newH = Math.max(120, Math.min(totalH - 120, startSize + delta));
const topPct = (newH / totalH) * 100;
$panes.eq(0).css('flex', `0 0 ${topPct}%`);
$panes.eq(1).css('flex', `0 0 ${100 - topPct}%`);

Using flex: 0 0 N% instead of width: N% or height: N% works on both axis orientations because the flex shorthand sets the flex-basis. The browser maps flex-basis to the main axis automatically, using width in row layouts and height in column layouts.

Touch Event Registration

The dragging pattern requires three event pairs:

Phase	Mouse	Touch
Start	`mousedown`	`touchstart`
Move	`mousemove`	`touchmove`
End	`mouseup`	`touchend`

Why `{ passive: false }` matters

The browser defaults touchmove to passive to enable smooth scrolling. A passive listener cannot call e.preventDefault(). Without preventDefault() on touchmove, the browser scrolls the page instead of running the drag handler.

Registering touch events with { passive: false } tells the browser this listener may prevent default:

$divider[0].addEventListener('touchstart', startDrag, { passive: false });
document.addEventListener('touchmove', doDrag, { passive: false });
document.addEventListener('touchend', endDrag);

Note: touchend does not need { passive: false } because we never prevent default on it.

Why touchmove goes on document, not the divider

If the user's finger moves faster than the browser can process drag events, the pointer position can leave the divider element. If the listener is only on the divider, you lose the event mid-drag. Attaching touchmove and touchend to document ensures the drag completes correctly even if the pointer drifts off the handle.

The Diff Tab CSS Architecture

The 36px diff bar

The original diff chrome had two rows: a tab row (Rich/Plain) and a toolbar row (Split/Unified, Word/Char, stats). Total height: ~82px.

The redesign collapses this into a single div.diff-bar at height: 36px. The bar is a flex container:

.diff-bar {
    display: flex;
    align-items: center;
    justify-content: space-between;
    height: 36px;
    flex-shrink: 0;
    background: var(--toolbar-bg);
    border-bottom: 1px solid var(--border-dark);
}

flex-shrink: 0 prevents the bar from compressing when the diff workspace below it is larger than the available height. height: 36px is an explicit ceiling, not min-height, because the bar should never grow.

Pill group segmented control

.diff-pill-group {
    display: flex;
    align-items: center;
    background: var(--border);
    border-radius: 6px;
    padding: 2px;
    gap: 1px;
}
.diff-pill {
    height: 22px;
    padding: 0 8px;
    border-radius: 4px;
    background: transparent;
    color: var(--text-muted);
    font-size: 11px;
    font-weight: 500;
    transition: background .1s, color .1s;
}
.diff-pill.active {
    background: var(--surface);
    color: var(--accent-dark);
    box-shadow: 0 1px 3px rgba(0,0,0,.10);
}

The container background: var(--border) serves as the "track" color. The active pill lifts out of it with background: var(--surface) and a shadow. This is the same pattern Apple uses for segmented controls in UIKit: it reads as a single control, not a group of buttons.

The Specificity Bug

Setup

The tab visibility system works like this:

/* line ~645 in styles.css */
.view-panel {
    display: none;
}

/* active panel overrides per JS */
.view-panel.active {
    display: flex;
}

The diff layout helper class was added to make the diff panel a column-direction flex container:

/* line ~888 in styles.css: WRONG */
.diff-layout {
    display: flex;
    flex-direction: column;
}

Why it overwrote `display: none`

CSS specificity score for .view-panel is (0, 1, 0). CSS specificity score for .diff-layout is (0, 1, 0). Equal specificity. Source order breaks the tie. .diff-layout appears later in the file. It wins. Every element with class diff-layout gets display: flex regardless of any earlier display: none.

Fix

/* CORRECT */
.view-panel.diff-layout {
    flex-direction: column;
}

Two-class specificity score is (0, 2, 0). This beats the single-class .view-panel rule when both apply. More importantly: removing display: flex from this rule means it never fights the visibility system at all. .view-panel.diff-layout now only sets direction, which does nothing unless the element is already displayed.

The .view-panel.active rule (also (0, 2, 0)) fires when JS adds the active class and sets display: flex correctly. Source order then resolves the two (0, 2, 0) rules in favor of .active because it was written after .view-panel.diff-layout.

The lesson

Layout helper classes should not set display. The element that controls visibility owns the display property. A class that sets layout direction on a visibility-toggled element must either match specificity exactly or remove display from its rule.

Most PDF Extractors Use the Wrong API: Here’s What We Built Instead

Bonzai2Carn — Tue, 23 Jun 2026 14:30:00 +0000

TLDR: PDF.js exposes three data sources at three fidelity levels. The industry default is the one that was built as a convenience wrapper for the other two. This is not laziness, because there are real reasons it happened, but it is the root cause of why most frontend PDF extraction breaks on academic papers, publications, and anything that isn't a corporate report.

PDF Processor
Repo

The Hierarchy Nobody Talks About

When people say "PDF extraction," they mean getTextContent(). Text items, positions, advance widths. This is what pdfplumber, PyMuPDF, pdf-parse, and almost every browser-side PDF tool reads.

Here is what getTextContent() actually is: a derived, post-processed view of getOperatorList(). PDF.js collects text paint operators from the raw operator stream, applies the current CTM, and packages the results. It is not reading a different part of the PDF. It is giving you a processed version of data that is already available in a more complete form.

Above that: getStructTree(). Not derived from the paint stream at all. It reads the logical structure tree from the PDF cross-reference table. Tables, paragraphs, headings, figures, formulas. Every glyph run tagged with its semantic role, linked to the paint stream via Marked Content IDs.

The hierarchy is:

getStructTree()     # what the document means
getOperatorList()   # what the document draws
getTextContent()    # a filtered view of what the document draws

Most tools use the third one.

Why This Happened

There are real reasons getTextContent() became the default:

It is good enough for 80% of documents. Corporate reports, legal briefs, and simple technical manuals have straightforward text flows. getTextContent() gives you positioned text items and that is enough to reconstruct paragraphs and headers.

The struct tree is frequently wrong. Word exports tag table cells as <P>. InDesign creates arbitrary nesting that reflects layer creation order, not reading order. A tool that trusts the struct tree on arbitrary input will fail on a significant fraction of documents.

The MCID join is not automatic. PDF.js does not give you "text item → struct tree node" in one call. You have to walk the operator list, maintain a MCID stack at each BMC/BDC open/close, record the current MCID for each text paint op, and join that to the struct tree. That is non-trivial to implement correctly.

Toolchain inertia. PDFBox, pdfminer, and the other foundational tools are 10–15 years old. They prioritized the text content API. Everything built on top of them inherited the same priority.

These are valid reasons. They are also not the same as "getTextContent() is correct."

What You Miss

When you use only getTextContent(), you miss:

Table structure. The struct tree gives you Table → TR → TD directly. getTextContent() gives you positioned text items that happen to be inside table cells. You have to infer the table grid from item positions, which requires heuristics, thresholds, and fails on borderless tables.

Display-math blocks. LaTeX equation environments produce glyph runs that PDF.js collapses into single items in getTextContent(). The full equation arrives as one item whose width spans the display block. Individual characters are not surfaced. Trying to detect column boundaries on a LaTeX paper using item X-extents will find that display equations bridge every candidate column gap.

Column geometry. Multi-column layouts in publishing tools often include explicit vertical rules, which are path operators drawing a line at the column boundary. These are in getOperatorList(). They are not in getTextContent(). Column detection from text positions is an inference. Column detection from an explicit vertical rule at the same X is a fact.

Reading order. getTextContent() returns items in paint order, not reading order. For a 2-column document, that might be reading order, or it might not, depending on how the PDF was authored. The struct tree, for well-tagged documents, returns leaves in reading order by design.

The Cascade Is Not Optional

The correct architecture is a cascade:

Try getStructTree(). If table regions are present, extract them directly. No column detection needed.
Try getOperatorList() geometry: full-height vertical rules, clip stack. If column rules are present, use them directly. No text-based inference needed.
Fall through to getTextContent() with geometric inference (bipartite partition, stream detection). This is correct for untagged documents with minimal path geometry.

This is not three times the work. Tiers 1 and 2 are fast exits. If the struct tree has tables, you skip all the geometry inference for those zones. If a vertical rule is present, you skip the bipartite algorithm. The fallback (Tier 3) only runs when no higher-fidelity signal is available, which is most documents today, but not most well-authored documents.

The Uncomfortable Part

Running the cascade as a diagnostic on three test PDFs found that all three PDFs fall through to Tier 3. No struct tree, no vertical column rules, in any of them.

This could be read as: the cascade doesn't help for documents people actually use.

The correct reading is: the test suite is three PDFs, and all three happen to be untagged. Amazon earnings releases, Siemens engineering manuals, and LaTeX preprints produce no struct tree output by default. But a PDF exported from Microsoft Word with the "Create bookmarks" option, or from Adobe Acrobat with the accessibility features enabled, or from InDesign with the tagging export, all produce struct trees.

The cascade will be exercised when the document population expands. The diagnostic confirms the fallback is correct. The architecture is in place. The next step is the display-math filter: two lines in the fallback scan that make the LaTeX failure case work without touching anything else.

Why Splitting a 2,500-Line File Broke Our Architecture

Bonzai2Carn — Sat, 20 Jun 2026 14:30:00 +0000

TLDR

A single 2500-line index.html with all JS inline worked. Splitting it into modules surfaced three classes of bugs: arrow functions with broken $(this), initialization order errors from circular-looking imports, and implicit state dependencies that were invisible inside a shared closure. The bugs were not created by the split. They were revealed by it.

Schema CAD Editor

The Assumption That Seemed Reasonable

A large single-file codebase is a starting point. You iterate fast, everything is in one place, there are no import errors, no build steps. When the codebase is ready for production, you split it into proper modules. Splitting is a cleanup task, not a risky refactor.

This assumption is correct about the first part: monolith development is fast. It is wrong about the second: splitting is not cleanup. It is a refactor that must handle every bug the monolith hid.

When It Failed

The first failure was the trace wire button. The click handler used an arrow function with $(this):

$('#traceWireBtn').on('click', () => {
    const mode = $(this).data('mode'); // arrow function: 'this' is not the button
    self.setMode(mode);
});

In the monolith, $(this) in an arrow function refers to the outer this, which was the window-level module object. The data('mode') attribute happened to exist on the module object from a previous assignment. The handler worked.

After the split, the module object no longer had data('mode'). The attribute was on the button element. mode was undefined. The trace wire button silently did nothing.

The second failure was accordion panels. Their toggle logic was:

$('.accordion-header').on('click', function() {
    $(this).next('.accordion-body').slideToggle();
    // 'this' needed to be the clicked header
});

This used a regular function correctly. But the CSS for .active state was in a different section of the monolith that was moved to a separate CSS file during the split. The toggle added .active to the element; the CSS for .active was in a file that was not loaded at that point in the HTML. The panels visually appeared broken.

The third failure was theme initialization. The dark mode toggle set a class on document.body. After the split, the toggle module loaded before the theme initialization module. The theme module read localStorage.getItem('theme') and set the class. The toggle module read the current class from document.body to set its initial state. Because initialization order was not guaranteed, the toggle sometimes read the class before the theme module set it.

What Was Actually Wrong

Shared closure scope in the monolith masked three categories of problems:

this binding: Arrow functions used $(this) and happened to work because the outer this contained the expected data. The coincidence ended when modules changed what this referred to.
Load order: The monolith was a single script block. Everything initialized in order. Modules loaded in any order the HTML specified. Implicit dependencies on load order became explicit failures.
CSS scope: CSS was inline in the same file as the JS. After extraction to separate CSS files, rules needed to be included in the right order. Two rules in the monolith with accidental order-dependency broke when they were in different files with different load positions.

What Got Deleted

The monolith itself. 2500 lines of HTML/CSS/JS became 11 files: 2 CSS files and 9 JS modules organized into core/, canvas/, and features/ directories.

The deletion also cleared the test surface for the three bug classes: the broken arrow functions, the initialization race, and the CSS load-order issues were all visible and fixable once isolated into their own files.

What Replaced It

Module files with explicit exports and imports. Each module owns its state. Imports are explicit. The initialization order is determined by a top-level init function in core/svgEditor.js that calls each module's init in the correct sequence.

The arrow function handlers were converted to regular function declarations. The CSS was loaded in a fixed order by the HTML. The initialization race was resolved by explicit sequencing.

The Lesson

A monolith does not hide bugs by preventing them. It hides them by providing the environment they need to not manifest. When the environment changes (a module split), the bugs become visible. Splitting is not the cause. The delay is.

Building a High-Performance CAD Engine in Vanilla JavaScript (No Frameworks)

Bonzai2Carn — Thu, 18 Jun 2026 14:30:00 +0000

The modern web is built on frameworks. React, Vue, Svelte—they’ve made building UI easier, but they’ve also made us comfortable with a certain amount of "abstraction tax."

When I started building the Schema Editor, a browser-native tool for electrical and architectural schematics, I hit a wall with that tax almost immediately.

CAD tools are different from standard CRUD apps. You aren't just clicking buttons; you're manipulating thousands of SVG elements, running real-time pathfinding algorithms, and handling complex coordinate transformations (Tilt, Yaw, Perspective) all at once.

The Bottleneck of Re-rendering

In a framework-based app, state changes trigger a re-render. If I move a component in a diagram, I have to update its position, recalculate all its connected "Manhattan" wires, and re-draw the selection handles.

Doing this through a virtual DOM or a reactive dependency graph adds milliseconds of latency per frame. On a complex diagram, that’s the difference between 60fps and a stuttering mess.

The Solution: Direct DOM & Specialized Kernels

I decided to build the engine with Zero Dependencies. Just pure Vanilla JS and the SVG DOM.

Direct Manipulation: Instead of waiting for a framework to batch updates, we update SVG attributes (x, y, d) directly in the mousemove handler.
Spatial Indexing: To handle "hit detection" (knowing which wire you're clicking), we don't iterate through every element. We use a custom KD-Tree spatial index that allows us to query the canvas in O(log n) time.
Geometry Pipeline: We built a 4-phase geometry pipeline that handles everything from coordinate snapping to 3D perspective transformations before the data even touches the DOM.

The Result

The result is an editor that feels like a native desktop application. It’s light (under 100KB gzipped), starts instantly, and works perfectly on mobile browsers where CPU resources are limited.

It also makes the code incredibly approachable for contributors. You don’t need to learn a specific framework’s lifecycle or build system to add a new symbol to our Electrical, Software, or Construction kits. You just need to know JavaScript.

Join the Project

Schema Editor is free and open source. We're building a tool that respects the performance requirements of engineering while embracing the accessibility of the web.

Check out the code and the live demo here:
github.com/carnworkstudios/schema-editor

DEV Community: Bonzai2Carn

Beyond the Naive Regex: Proper PDF Font Style Extraction

The Problem

The Naive Attempt

page.commonObjs: The Authoritative Source

Synthetic Italic: The Shear Transform

Underlines: Vector Segment Pairing

Propagating to the Renderer

Result

Diagnosing a Failing PDF Extraction Pipeline

The Setup

The Wrong Hypothesis

The Real Failure

What Survived

The Actual Fix

What the Architecture Document Changed

The Three-Tier PDF Extraction Model: Demystifying PDF.js

The API Hierarchy

getTextContent()

getOperatorList()

getStructTree()

The MCID Join

Tier 2: What ctmAdapter Is Already Close To Providing

Mode-S vs Median-S

What the Diagnostic Found

The Hierarchy Summary

Why Web Workers Swallow Your Stacktraces (And How to Write Specs to Fix It)

The Pattern That Keeps Failing

Why This Keeps Happening

What Correct Specs Look Like

The Stacktrace Problem

The Uncomfortable Implication

Web Components vs. Iframes: A Hard Lesson in DOM Isolation Barriers

TLDR

The Assumption That Seemed Reasonable

How It Broke

Why It Was Hard to Find

What Was Thrown Away

What Replaced It

The Lesson

Seven Table Parsers, One Interface: Designing a Table Formatter and Node Editor (TAFNE)

The Dispatcher

Two Parsing Strategies

The Semantic Parsers

The Heuristic Parsers

The State Machine: Markdown

The Hunter: SQL

Auto-Detection for File Loads

Why We Treat HTML as a CAD Format for PDF (And Why It Works)

The problem with "extracted output"

HTML is already a spatial document format

Two modes, one surface

Why not absolute positioning?

Edit Code: the escape hatch

The export chain closes the loop

SiaS: the tool works without the service

Debugging Mobile Drag & CSS Specificity in a Real-Time PDF Diff Tool

What We Set Out To Do

The Mobile Drag Problem

The Diff Tab Redesign

The CSS Specificity Disaster

What Failed

What Survived

Under the Hood: Drag, Touch, and CSS Cascade in a Real Diff UI

The Responsive Drag Problem

Naive approach and why it fails

Correct approach: read computed flex direction

Unified event position extraction

Dimension tracking

Touch Event Registration

Why { passive: false } matters

Why touchmove goes on document, not the divider

The Diff Tab CSS Architecture

The 36px diff bar

Pill group segmented control

The Specificity Bug

Setup

Why it overwrote display: none

Fix

The lesson

Why `{ passive: false }` matters

Why it overwrote `display: none`