I've been building software for over 20 years across banking, healthcare, financial SaaS, and media. At Stuff, one of New Zealand's largest news platforms, I worked on systems that consumed and produced structured content at scale. In banking and healthcare, I dealt with legacy systems that predate REST APIs entirely: fixed-format financial reports, statement exports, internal tools that print structured text to stdout because that was the interface in 2003 and nobody has touched it since.
Back in 2012 I worked as a remote contractor for SITA, a Canadian company in the aviation industry, parsing CSV flight fare data and loading it into a database. Hundreds of files, each with a slightly different layout depending on the source system, each needing to be mapped to the same schema. We wrote a lot of custom parsing code in Java for what was, in the end, just structured text with a consistent shape.
The problem of "here is some structured text, I need the data inside it" is one of the most persistent problems in software. And the tooling for it, honestly, has not kept up.
A few years ago it found me again. I was working on a side project that needed to extract structured data from a set of web pages. The pages were clearly generated from a template. Every one of them had the structure. Just different data.
So I reached for the standard tools.
The cheerio phase
If you've scraped HTML with cheerio, you know the drill. You open DevTools, hunt for a CSS selector that's stable enough to rely on, and write something like this:
const name = $(".product-title h1").text().trim();
const price = $(".price-box .current-price").text().replace("$", "").trim();
const brand = $(".specs-table tr:nth-child(2) td:last-child").text().trim();
Fine for one or two fields. But the page I was working with had nested data: a list of items, each with their own sub-fields. Now I'm writing loops, mapping over $(".review-list li"), extracting children by index, trimming whitespace everywhere, and the code looks nothing like the data I'm trying to produce.
And it's fragile. The site updates a class name, rearranges a div, and your selectors silently return empty strings. You don't find out until your pipeline starts producing garbage.
The puppeteer detour
Puppeteer can fetch the page, but it still doesn't help with extracting the data. On top of that, you're spinning up a full headless Chromium instance to read a page that is, in the end, just text. The startup time, the memory, the flakiness of waitForSelector all felt like overkill for what I was trying to do.
In enterprise and regulated environments this gets worse. Security policies, sandboxed build agents, and locked-down CI pipelines often make running a headless browser simply not an option. I've been in those environments. A zero-dependency, pure-text approach is not just nicer, it's sometimes the only viable one.
Both tools (cheerio and puppeteer) had another limitation: they're HTML-only. But the world is full of structured text that isn't HTML. Log files. Emails. Markdown documents. Fixed-format financial reports. API responses rendered as plain text. I wanted something that could work on all of it.
A different way to think about it
At some point, frustrated and staring at yet another brittle CSS selector, I caught myself thinking about the pages differently.
EJS is a popular JavaScript template engine. Servers use it to generate HTML by calling something like ejs.render(template, data), where the template is an HTML file with placeholders and the data is the object that fills them in. The template already encodes all the rules for how data maps to text.
If the forward direction exists (template + data = text) then the reverse must be possible too. Text + template = data.
I had never seen this idea anywhere. I googled around, found nothing. The concept felt obvious in retrospect but apparently nobody had built it, at least not for general text.
I made a mental note and moved on. Life got busy. The idea sat in the back of my head for years.
Building it (with a little help)
Recently, with AI tooling finally good enough to act as a real coding partner, I sat down and built it in a day.
The first version was naive. I just tried to turn EJS template literals into a regex and match against the rendered string. It worked for simple cases and completely fell apart the moment loops entered the picture.
The breakthrough was realising I needed an AST first.
AST stands for Abstract Syntax Tree, a concept borrowed from how compilers work. When a compiler reads your code, it doesn't just scan it left to right as a string of characters. It first parses the text into a tree structure that represents the meaning of the code: this block is a function, inside it there's a loop, inside the loop there's a variable assignment, and so on. Each node in the tree is a meaningful piece of the program, and the relationships between nodes capture how those pieces nest and relate to each other. The word "abstract" means it strips away irrelevant details like whitespace and punctuation and keeps only the structure that matters.
For reverse-ejs, this meant I couldn't just scan the EJS template as text and replace tags with regex patterns on the fly. I needed to first parse the whole template into a tree that captured its real structure: here is a literal HTML chunk, here is a variable output tag, here is a loop that contains more literal HTML and more variable tags inside it, here is a conditional with two branches. Only once that tree existed could I walk it and generate the right regex for each node, with loops becoming repeating capture groups and conditionals becoming alternatives.
Once that clicked, everything else fell into place. Loops became repeating capture groups. Conditionals became alternations. Nested objects became dot-notation in the capture group names.
The moment it first worked end-to-end, I pasted a product page HTML, wrote a quick EJS template, ran the function, and got back a clean JSON object. I knew it was right.
npm install reverse-ejs
import { reverseEjs } from "reverse-ejs";
const template = `
<div class="product">
<h1><%= name %></h1>
<span class="price">$<%= price %></span>
<% reviews.forEach(review => { %>
<li><strong><%= review.author %></strong>: <%= review.text %></li>
<% }) %>
</div>`;
const html = /* fetched from the site */;
const data = reverseEjs(template, html, { flexibleWhitespace: true });
// {
// name: "Sony WH-1000XM5",
// price: "348.00",
// reviews: [
// { author: "Alice", text: "Best headphones I've ever owned." },
// { author: "Bob", text: "Great sound quality." }
// ]
// }
The template is the schema. You write what the page looks like, and you get back the data that produced it.
It works on more than HTML
The library never assumes its input is HTML. It just matches text against a template, which means it works on anything with a consistent structure.
Remember the SITA project from 2012? The fare files were not simple one-row-per-record CSVs. Each fare was spread across four consecutive lines, one for each record type: route, pricing, rules, and dates. They looked like this:
RTE,AC,YYZ,GRU,YLOWCA,Y,OW
PRC,542.00,98.40,45.00,685.40,CAD
RUL,21D,7,30,NO,NO
DAT,2012-03-01,2012-05-31,2012-01-15,2012-05-25
RTE,AC,YYZ,GRU,BLOWCA,Y,OW
PRC,489.00,98.40,45.00,632.40,CAD
RUL,14D,7,30,NO,YES
DAT,2012-03-01,2012-05-31,2012-01-15,2012-05-25
RTE,AC,YVR,LHR,YFLEX,C,RT
PRC,1240.00,187.60,95.00,1522.60,CAD
RUL,NONE,NONE,365,YES,NO
DAT,2012-04-01,2012-06-30,2012-02-01,2012-06-15
A standard CSV parser reads one row at a time. It has no concept of a record that spans multiple lines with different structures. You end up writing a state machine: read a line, check the prefix, decide which object you are building, accumulate fields, flush when you hit the next RTE line. We wrote exactly that, and it was fiddly and brittle.
With reverse-ejs, the whole file becomes a single template:
const template = `<% fares.forEach(f => { %>RTE,<%= f.carrier %>,<%= f.origin %>,<%= f.destination %>,<%= f.fareBasis %>,<%= f.cabin %>,<%= f.direction %>
PRC,<%= f.baseFare %>,<%= f.taxes %>,<%= f.surcharge %>,<%= f.total %>,<%= f.currency %>
RUL,<%= f.advancePurchase %>,<%= f.minStay %>,<%= f.maxStay %>,<%= f.refundable %>,<%= f.changeable %>
DAT,<%= f.validFrom %>,<%= f.validTo %>,<%= f.firstSale %>,<%= f.lastSale %>
<% }) %>`;
reverseEjs(template, fareFile, {
types: { baseFare: "number", taxes: "number", surcharge: "number", total: "number" },
});
// {
// fares: [
// { carrier: "AC", origin: "YYZ", destination: "GRU", fareBasis: "YLOWCA",
// cabin: "Y", direction: "OW", baseFare: 542.00, taxes: 98.40,
// surcharge: 45.00, total: 685.40, currency: "CAD",
// advancePurchase: "21D", minStay: "7", maxStay: "30",
// refundable: "NO", changeable: "NO",
// validFrom: "2012-03-01", validTo: "2012-05-31",
// firstSale: "2012-01-15", lastSale: "2012-05-25" },
// ...
// ]
// }
No state machine. No prefix checking. No wondering whether the RUL line you just read belongs to the fare above or below it. The template describes the shape of the data and the library figures the rest out. I genuinely wish this had existed back then.
And it goes further than CSV. In financial services I've seen reports like this coming out of legacy systems — no API, no JSON, just text that has been printed the same way for fifteen years:
ACCOUNT SUMMARY - 2026-04-10
================================
Account: ACC-00123 John Smith
Currency: USD
TRANSACTIONS
------------
2026-04-08 PAYMENT RECEIVED + 5,000.00 Balance: 12,430.50
2026-04-09 WIRE TRANSFER OUT - 1,200.00 Balance: 11,230.50
2026-04-10 SERVICE FEE - 15.00 Balance: 11,215.50
Here is what extracting it looks like:
const template = `ACCOUNT SUMMARY - <%= date %>
================================
Account: <%= accountId %> <%= accountName %>
Currency: <%= currency %>
TRANSACTIONS
------------
<% transactions.forEach(t => { %><%= t.date %> <%= t.description %> <%= t.amount %> Balance: <%= t.balance %>
<% }) %>`;
reverseEjs(template, report);
// {
// date: "2026-04-10",
// accountId: "ACC-00123",
// accountName: "John Smith",
// currency: "USD",
// transactions: [
// { date: "2026-04-08", description: "PAYMENT RECEIVED ", amount: "+ 5,000.00", balance: "12,430.50" },
// { date: "2026-04-09", description: "WIRE TRANSFER OUT ", amount: "- 1,200.00", balance: "11,230.50" },
// { date: "2026-04-10", description: "SERVICE FEE ", amount: "- 15.00", balance: "11,215.50" },
// ]
// }
Log files are another common case. Say you have a file like this:
[INFO] 2026-04-10T08:12:01Z api-gateway: server started on port 3000
[INFO] 2026-04-10T08:12:45Z auth-service: user login successful
[WARN] 2026-04-10T08:13:10Z api-gateway: response time exceeded threshold
[ERROR] 2026-04-10T08:14:22Z db-service: connection pool exhausted
[INFO] 2026-04-10T08:15:05Z auth-service: token refreshed
One template, the whole file:
const template = `<% entries.forEach(e => { %>[<%= e.level %>] <%= e.timestamp %> <%= e.service %>: <%= e.message %>
<% }) %>`;
reverseEjs(template, logFile);
// {
// entries: [
// { level: "INFO", timestamp: "2026-04-10T08:12:01Z", service: "api-gateway", message: "server started on port 3000" },
// { level: "INFO", timestamp: "2026-04-10T08:12:45Z", service: "auth-service", message: "user login successful" },
// { level: "WARN", timestamp: "2026-04-10T08:13:10Z", service: "api-gateway", message: "response time exceeded threshold" },
// { level: "ERROR", timestamp: "2026-04-10T08:14:22Z", service: "db-service", message: "connection pool exhausted" },
// { level: "INFO", timestamp: "2026-04-10T08:15:05Z", service: "auth-service", message: "token refreshed" },
// ]
// }
Same idea applies to emails, Markdown documents, CLI output. Anything you can describe with an EJS template.
The practical bits
There are a few features worth knowing about for real-world use.
Compiled templates: when you're processing many pages against the same template, compile once and reuse:
import { compileTemplate } from "reverse-ejs";
const compiled = compileTemplate(template);
for (const html of pages) {
const data = compiled.match(html);
}
Safe mode: for scraping pipelines where some pages won't match:
const data = reverseEjs(template, html, { safe: true });
if (data === null) {
// fall back to your cheerio extractor
}
Type coercion: extracted values come back as strings by default, but you can ask for numbers, booleans, or dates:
reverseEjs(template, html, { types: { price: "number", inStock: "boolean" } });
Zero dependencies. 20KB. TypeScript-native. Runs in Node.js, Bun, Deno, and the browser.
What it doesn't do
I want to be honest about the limitations, because they matter for deciding when to reach for this versus cheerio.
If the site uses heavy JavaScript rendering (a React SPA, for example), you still need Puppeteer or Playwright to get the HTML first. reverse-ejs takes over after that.
If two variables appear side by side with no text between them (<%= firstName %><%= lastName %>), the split point is ambiguous and the library returns them as a single combined value. Add a separator in the template and you're fine.
And if the site randomly changes its HTML structure between renders (A/B tests, CMS quirks, injected banners), the match may fail. safe: true combined with a cheerio fallback handles this gracefully.
Try it
I built a playground where you can paste your HTML and template and see the extraction live in your browser. No install, no account, your HTML never leaves your device:
👉 lucasrainett.github.io/reverse-ejs
If it solves a problem for you, I'd genuinely appreciate a star on GitHub. It's the clearest signal that the idea is useful to people beyond me.
👉 github.com/lucasrainett/reverse-ejs
And I'm curious: have you run into this problem before? What have you been using to extract structured data from template-rendered pages? Drop a comment, I'd love to know what cases I haven't thought of yet.
Top comments (0)