DEV Community

Smuves
Smuves

Posted on

Building an ETL Pipeline for Website Content With the HubSpot API

If you have ever tried to move a large HubSpot CMS site to another platform, you know the drill. Export what you can to CSV. Write scripts for what the export misses. Manually handle the edge cases. Repeat until the deadline passes.

The frustrating part is that the pattern for doing this well already exists. ETL (Extract, Transform, Load) is the backbone of every data engineering pipeline. Pull data from a source, reshape it, push it to the destination. There are mature tools for this in every other data domain. CRM migrations, analytics pipelines, event data syncs. All of them use ETL.

Website content migrations do not. But they should.

The extraction problem

The HubSpot CMS API gives you access to pages, blog posts, authors, tags, redirects, and HubDB tables. On the surface, extraction looks straightforward. Hit the endpoints, paginate through the results, dump everything to JSON.

In practice, extraction is where most people underestimate the scope. A single page object from the HubSpot API contains the metadata you would expect: title, slug, meta description, publish date. But the actual content lives in widget containers and module fields that are nested several levels deep. A hero module might reference an image by file ID. A CTA might reference a form by GUID. A rich text field might contain embedded HubL tokens that only resolve in the HubSpot rendering engine.

A real extraction needs to capture all of this. Not just the top-level fields, but the full module tree, the asset references, the template associations, and the URL structure. If you only extract what the CSV export gives you, you are missing the majority of what makes each page actually work.

The transformation layer

This is where the real work happens. Transformation is not reformatting. It is restructuring.

On a migration project involving 31,000 entries, the source site had 166 modules. Five different accordion implementations built at different times by different developers. Four hero banner variants with slightly different field schemas. The transformation step consolidated these down to 40 target components.

What this means in practice is building a mapping layer between source module types and target component types. Every source module has a different field schema, and the target component needs a unified schema. One accordion variant might store items with fields called "title" and "body" while another uses "tab_label" and "tab_content." Both need to become a single consistent structure in the destination.

The complexity compounds when you factor in localization. The source site in this project used a page-per-language model across 11 languages. The destination platform used locale-based content entries. Every single page relationship had to be remapped, not just translated, but structurally reorganized.

Multiply that by every module type and every content type, and you start to see why this phase takes longer than the extraction and loading combined. The transformation layer is where migrations succeed or fail, and it is the part that gets the least planning.

The loading order

Loading transformed content into the destination platform is not a simple bulk insert. Content has dependencies that dictate the order of operations.

A blog post references an author. That author needs to exist as a content entry in the destination before the post can be created. A page uses a hero module that references an image asset. That asset needs to be uploaded first. A navigation component references pages by their URL path. Those pages need to exist before the navigation can be built.

The general sequence is: assets first (images, files, documents), then global content (headers, footers, navigation), then taxonomy entries (tags, categories, authors), then the actual content entries with their module references, then redirects mapping old URLs to new ones, and finally cross-references between content entries.

Getting this wrong means broken references, missing images, and pages that render with placeholder content in production. On a site with 31,000 entries and 48 HubDB tables, the dependency graph is not something you can hold in your head. It needs to be mapped explicitly.

Why this pattern matters

The point here is not that you should build a custom ETL pipeline from scratch every time you migrate a website. The point is that website content migration is fundamentally an ETL problem, and treating it as one gives you a structured, repeatable approach instead of ad hoc scripts and spreadsheets.

The data engineering world formalized ETL decades ago. Tools were built around the pattern. Roles were created. Best practices emerged. Content engineering is going through the same evolution right now.

Tools like Smuves are starting to bring that structured approach to HubSpot CMS specifically, giving teams a way to extract, inspect, and work with their content as structured data rather than one page at a time. The extraction layer becomes visible. The transformation decisions become auditable. The loading order becomes repeatable.

If you are planning a migration, or even just a large-scale content operation like a sitewide metadata update, thinking in ETL terms will save you from the most common failure mode: underestimating the transformation layer.

The extraction is the easy part. The loading is mechanical. The transformation is where the real engineering happens.

Top comments (0)