What 166 Modules Taught Us About Building an ETL Pipeline for Website Content

#hubspot #cms #webdev #contentops

ETL is a solved problem in most of the software world. Data teams have been extracting, transforming, and loading structured data between systems for decades. The tooling is mature. The patterns are well documented. Nobody writes a custom script to move CRM records between platforms anymore.

Website content is a different story.

We recently finished a migration that moved over 31,000 content entries from HubSpot CMS to Contentstack. 57 content types. 11 languages. 48 database tables. And 166 modules, many of them duplicates that had been built over years by different teams without any shared governance.

That project forced us to think about website content the same way data engineers think about warehouse migrations. And the biggest lesson was not about the tools we used. It was about why those tools did not exist in the first place.

Website content is not flat data

The reason ETL tooling exists for databases and CRMs but not for websites comes down to structure.

A CRM record is a row. It has fields. Name, email, lifecycle stage. The schema is predictable and mostly flat. Moving records between systems is a mapping exercise. Match the fields, run the transformation, load the output.

A website page is not a row. A single page might contain a hero section, a rich text block with embedded images, a call-to-action component, a form, a footer pulled from a global template, SEO metadata in a settings panel, and a URL path that determines where the page lives in the site architecture.

Every one of those elements is stored differently depending on the CMS. HubSpot organizes content inside modules within layout sections. Contentstack uses modular blocks within content types. WordPress uses its own block system. There is no universal content model.

So when you try to move a page from one platform to another, you are not mapping fields. You are translating an entire architectural model.

The module consolidation problem

This is where our migration got interesting.
The source HubSpot portal had 166 modules. On the surface, that sounds like a rich component library. In reality, it was architectural debt.

Five different accordion modules. Four hero banners. Three CTA components doing the same thing with slightly different field names. Modules created by different developers at different times, with no documentation and no naming convention.

Before we could transform anything for the destination platform, we had to understand what we actually had. That meant auditing every module, grouping duplicates, and deciding which ones to keep and which ones to consolidate. 166 became 40. Not by cutting content, but by recognizing that five versions of the same component should be one configurable component.

This is the step that most migration plans skip entirely. Teams jump straight from "export the content" to "import it into the new platform" without ever examining the content model itself. And that is how you end up migrating architectural debt from one CMS to another.

Why the extract phase is more than an export

When data teams extract records from a source system, they pull the full schema. Every field, every relationship, every metadata attribute. They do not just grab the display values and hope for the best.

Website teams routinely do the opposite. The typical CMS migration starts with a CSV export. Page title, URL, body text, maybe meta description. That covers maybe 30 percent of what a page actually contains.

The modules, the template assignments, the localization relationships between pages and their translated variants, the redirect rules, the database tables that power dynamic content, none of that shows up in a standard export.

A proper extraction means pulling the full content architecture. Every content type, every component, every relationship. You need to see the entire system before you can plan the transformation, which is exactly why having a spreadsheet view of your CMS matters so much. It turns the invisible into something you can inspect, filter, and plan around.

The loading order problem

Even after you extract and transform everything, loading content into a new platform is not as simple as pushing entries in.
Content has dependencies. A blog post references an author. That author needs to exist in the destination system before the post can be loaded. A page uses a hero component that references an image asset. That asset needs to be loaded before the component. A landing page includes a form that references a workflow. The workflow needs to exist first.

In our 31,000-entry migration, we used a two-pass approach. First pass loaded all standalone entries, assets, and authors. Second pass loaded pages and posts with their references, because by then everything they pointed to already existed.

This dependency ordering is a solved problem in data engineering. Database migrations handle foreign key relationships routinely. But in the CMS world, there is no standard tool that understands content dependencies well enough to handle the ordering automatically. Every migration team figures this out from scratch.

What we are building toward

At Smuves, we started with the extract layer. Connecting to a HubSpot portal and pulling all content into a structured, editable view. That naturally led to auditing, understanding what a site contains before making changes. And auditing naturally led to migration, moving content between platforms with the structural relationships intact.

The thread connecting all of it is the ETL pattern. Extract everything. Transform it into the right shape. Load it with the dependencies in order.

The website world is about a decade behind the data world on tooling for this. But the pattern is the same. And the more teams start treating their website content as structured data rather than a collection of individual pages, the faster the tooling will catch up.

If you are planning a CMS migration, start with the extraction. Get the full picture of what you have before you write a single migration plan. The module audit alone will change your timeline estimate.