Turning a 4,000-node DOM into 40 components: the hard part of website-to-React

#webdev #javascript #react #nextjs

In my last post I broke down why "convert this website to React" is so much harder than scraping HTML, and I ended on the one problem I called the hardest: turning a rendered page into actual components. This post is about that problem specifically, because it's the difference between output a developer keeps and output they delete.
Here's a 2-minute demo of the tool doing the full conversion first, so you have context:
https://www.loom.com/share/00f1a18348a34770a77bb3d2b79ef641
The gap nobody warns you about
A rendered marketing page is routinely a few thousand DOM nodes deep. Wrapper inside wrapper inside wrapper, most of them carrying nothing but a class name and a single style.
The version of that page a human would actually write in React is maybe a few dozen components. A Nav, a Hero, a FeatureCard reused six times, a Footer.
Getting the styles right (which I covered last time, reading computed styles off the live page) gives you a page that looks correct. But if you emit it as one giant blob that mirrors the DOM one-to-one, it looks right and is completely unmaintainable. Nobody can edit it. The whole value of "get the React code" disappears the moment the React code is unreadable.
So the real task is this: from nothing but a rendered DOM, recover the component structure the original developer probably had in their head. The page does not tell you where the components are. You have to infer it.
Why this is genuinely hard
The information you want was destroyed before you ever saw the page. The original dev wrote a component; by the time it renders in the browser, that's just three more

s with no label saying "these belong together." Component names, props, boundaries, reuse, all of it gets compiled away into flat HTML.
So you're reverse-engineering intent from the only signals that survive into the rendered output:

Repetition. The strongest signal by far. If the same DOM shape appears several times with different content inside it, that is almost always one component rendered in a loop.
Structural rhythm. Sibling elements that share a layout pattern tend to be peers: list items, grid cells, nav links.
Semantic landmarks.

, , , , and ARIA roles are rare gifts. They're the few places the original structure leaks through, and they make excellent hard boundaries.
Visual grouping. Large shifts in layout, spacing, or background color usually mark where one section ends and the next begins.

None of these is reliable on its own. Combined, they're enough to make a good guess.
How the conversion approaches it
The structure recovery runs on top of those signals rather than on raw markup:

https://url2code.net Detect repeated subtrees. Walk the tree, look at the shape of each subtree rather than its text, and flag shapes that recur. Recurring shapes become candidate reusable components, and the differing content inside them becomes the data those components render.
Treat semantic tags as boundaries. Where the page uses real landmarks, honor them as the seams between top-level sections instead of guessing across them.
Collapse dead wrappers. A chain of single-child

s that exist only to hold one style gets flattened, so the output doesn't inherit someone else's nesting habits.
Cap the depth. Past a certain nesting level the output is forced to stay flat and readable, because a faithful-but-unreadable tree defeats the purpose.

The aim throughout is output that reads like something a developer would have written, not a literal transcript of the DOM.
The trade-off at the center of it
Every decision here is a tension between two failure modes.
Too literal and you get the giant blob: technically faithful, practically useless.
Too aggressive and you "helpfully" merge things that shouldn't be merged, invent components that don't match how the page really works, and produce clean-looking code that's quietly wrong.
The honest target is "a sensible starting point a developer can finish," not "exactly what the original team wrote," because the second is genuinely unknowable from rendered output alone. I'd rather hand someone a structure that's 80% right and obvious to fix than one that's clever and misleading.
Where it's solid, and where it isn't
Repetition detection is the strong part. It reliably pulls out the cards, list items, and nav links that make up most of a page's reusable structure. Section boundaries drawn from semantic tags are dependable too.
The weak part is pages built with little or no semantic structure and deeply nested custom layouts, where the visual-grouping guesses get shakier. That's where most of my current work is going.
This is still the piece of url2code I'm actively improving, and I'm genuinely not sure I've found the best approach. If you've worked on DOM-to-component inference, layout segmentation, or anything adjacent, I'd really like to hear how you'd attack it.
Try it / break it
url2code converts a website URL into a Next.js + Tailwind project: paste a URL, preview the rebuilt page, download the code. It's in free closed beta.
If you do site rebuilds or migrations, throw a real page at it and tell me where the componentization falls apart, because those are the exact cases I'm tuning against. Comment here or find it at https://url2code.net.

DEV Community

Turning a 4,000-node DOM into 40 components: the hard part of website-to-React

Top comments (0)