Smuves

Posted on Apr 16 • Originally published at smuves.com

How We Built a Module Audit Script for a 166-Component HubSpot Site

#hubspot #webdev #architecture

When a team tells you their HubSpot site has about 40 modules, and the actual count turns out to be 166, the first problem is not cleaning it up. The first problem is just figuring out what is there.

This post is about the audit script we built to answer that question. Not a polished product. Just a script that ran against the HubSpot Design Manager and the CMS API and spit out a CSV we could actually work from.

If you are staring at a HubSpot portal that has accumulated years of components and nobody on the current team remembers what half of them do, this is the approach that got us to a consolidation map.

Why the Design Manager UI is not enough

HubSpot's Design Manager shows you a list of modules. You can click into each one, see the fields, see the template that renders it, and get a sense of what it was built for.

What it does not show you is usage. There is no native "where is this module referenced" view. If you want to know whether a module is live on any page or if it is just sitting in the library unused, you have to cross-reference against every page, every blog post, every email template, and every landing page that could possibly use it.

For 40 modules that is annoying. For 166 modules it is a week of work.

That is the gap the script filled. Pull the full module list via the API, pull the full page list, cross-reference them, and output a single file that tells you which modules are actually in use and which are orphaned.

We wrote about why this kind of sprawl happens in the first place in this breakdown of CMS governance gaps. The short version is that nobody owns the component library in most companies, so new modules get added faster than old ones get retired. Over a few years you end up with five banner variants that do the same thing.

The script does not solve that problem. It just makes the mess visible.

Step one: Pulling the module list

The HubSpot CMS API has an endpoint for design manager assets. You hit /cms/v3/source-code/published/content with the right scope and you get back the module definitions, including the HubL name, the fields, the created date, and the last modified date.

The pagination is standard. Page through until you get an empty response. Dump everything into a list of dicts, one per module.

A few things to extract for each module:

The HubL name (this is the identifier you will use to grep against page source)
The module label (what shows up in the Design Manager UI)
The field list (what inputs the module accepts)
The template path
The created and last modified timestamps

That last one is more useful than it sounds. A module that has not been modified in four years and was built by a developer who no longer works at the company is a very different candidate for consolidation than one that was updated last quarter.

Step two: Pulling the page and post inventory

You need the full list of anywhere a module could be used. For a typical HubSpot portal that means:

Site pages
Landing pages
Blog posts
Email templates
Drag-and-drop templates

The Pages API and the Blog Posts API both return the serialized layout data for each page. Inside that layout data are module references, usually by HubL name. This is the thing you want to grep against.

Pull everything, paginate through it, and dump the layout JSON for each page into a local store. For large portals this takes a while. The portal we audited had around 32,000 pages, so we ran this in batches and cached aggressively.

One gotcha. Some pages reference modules inline via HubL in custom templates, not through the drag-and-drop layout. If you only look at the layout JSON you will miss those. You also need to pull the template source and grep for module tags there. Something like this:

import re

HUBL_MODULE_PATTERN = re.compile(
    r'{%\s*module\s+[\'"]?([a-zA-Z0-9_\-]+)[\'"]?',
    re.IGNORECASE
)

def extract_module_refs(template_source):
    return set(HUBL_MODULE_PATTERN.findall(template_source))

That regex is not bulletproof but it caught enough of the inline references to be worth including.

Step three: Cross-referencing

Now you have two things. A list of modules that exist. And a list of pages, posts, and templates with the modules they reference.

The cross-reference is just a lookup. For every module in the library, check whether its HubL name appears in any page layout or template source. Build a usage map.

from collections import defaultdict

usage_map = defaultdict(list)

for page in all_pages:
    for module_ref in page.module_refs:
        usage_map[module_ref].append({
            'page_id': page.id,
            'page_url': page.url,
            'page_type': page.type,
        })

for module in all_modules:
    if module.hubl_name not in usage_map:
        module.status = 'orphaned'
    elif len(usage_map[module.hubl_name]) == 1:
        module.status = 'single_use'
    else:
        module.status = 'active'

Three buckets. Orphaned (not referenced anywhere, deletion candidate). Single-use (referenced on exactly one page, probably consolidation candidate). Active (referenced on multiple pages, needs more careful review).

For the 166-module portal, the breakdown was rough as follows:

58 orphaned modules (not referenced anywhere in the current site)
31 single-use modules (most were custom builds for specific landing pages)
77 active modules (referenced across multiple pages)

That 58 number was the first real "oh" moment. A third of the library was dead code. Nobody had known because nobody had looked.

Step four: Similarity detection

The harder question is not which modules are unused. It is which modules are duplicative.

This is where the field list comparison becomes useful. If two modules have near-identical field schemas (same field names, same types, same structure), they are probably doing the same thing even if they have different names.

The approach we used was a simple Jaccard similarity on field signatures.

def field_signature(module):
    return frozenset(
        (field.name, field.type) for field in module.fields
    )

def jaccard(a, b):
    if not a and not b:
        return 1.0
    return len(a & b) / len(a | b)

similarity_pairs = []
for i, mod_a in enumerate(modules):
    for mod_b in modules[i+1:]:
        sig_a = field_signature(mod_a)
        sig_b = field_signature(mod_b)
        score = jaccard(sig_a, sig_b)
        if score >= 0.8:
            similarity_pairs.append((mod_a, mod_b, score))

Any pair with a similarity score above 0.8 is worth a human review. The script does not decide they are duplicates. It flags them for a person to look at.

When we ran this against the 166-module portal, we got 43 similarity pairs above the threshold. After manual review, 31 of those pairs turned out to be true duplicates that could be consolidated. The rest were intentional variants with meaningful differences.

This is where automation stops being useful. A script can tell you two modules have the same field schema. It cannot tell you whether one renders a rounded button and the other renders a square button, or whether one has a max-width constraint and the other does not. That review has to happen in a browser with a real human looking at the output.

Step five: The consolidation map

The output of all this is a single CSV with one row per module. Columns:

Module name
Usage status (orphaned, single_use, active)
Usage count
List of pages where it is referenced
Similar modules flagged for review
Consolidation recommendation

That last column is the one that actually drives decisions. For each module, a human reviews the data and marks it as one of:

Keep (this is a canonical version, keep it as-is)
Merge into X (this is duplicative of module X, migrate all references and delete)
Delete (orphaned or deprecated, safe to remove)
Review (unclear, needs more investigation)

For the 166-module portal, the final breakdown was 40 keeps, 78 merges, 35 deletes, and 13 reviews that got resolved in follow-up discussions. 166 modules became 40. Same rendered output on every live page.

What the script cannot do

The audit script is a discovery tool, not a decision tool. It tells you what is there. It does not tell you what should be there.

The consolidation decisions still require human judgment. Is this module still needed for campaigns that happen twice a year? Does the marketing team have plans to use it in Q4? Is the styling intentional or is it drift?

The script also does not handle the actual migration. Once you decide module A should be merged into module B, you still have to update every page that references A, swap it for B, and test that the page renders correctly. That is a separate piece of tooling.

What the script does is shrink the decision space. Instead of staring at 166 modules and trying to figure out where to start, you have a ranked list of candidates with usage data attached. You can work through it systematically.

Would we open source it?

Probably not. The script is too tightly coupled to the specific project we built it for, and rewriting it to be generically useful would take more effort than the value justifies for a standalone tool.

But the approach is the important part, not the code. If you are facing a similar audit, the steps are straightforward:

Pull the full module inventory via the API
Pull the full page and template inventory via the API
Cross-reference to build a usage map
Run similarity detection on field signatures
Export the result as a flat CSV
Review the CSV with a human who knows the site

That process is what we are productizing at Smuves as part of our migration tooling. The goal is to make this kind of audit something a marketing ops person can run in an afternoon, not something that requires writing a custom script every time.

The underlying problem is not hard. It just requires someone to actually look at the library.

DEV Community