What I learned wiring JSON-LD structured data audits into a post-deploy CI step

#webdev #githubactions #astro #tutorial

The conclusion first: JSON-LD structured data is one of those things that can vanish from your site without breaking anything visible. The Astro build succeeds. The Cloudflare Pages deploy completes. The page renders fine in a browser. But inside the <script type="application/ld+json"> block — what Googlebot reads to decide whether your page qualifies for rich results — something went wrong, and you won't know until Search Console flags it weeks later.

I added a post-deploy audit step to my CI pipeline that finds this in under 60 seconds. Here's how the script works, what it found on first run, and where the approach falls short.

Why structured data breaks silently in Astro SSG

My three directory sites — aiappdex.com, findindiegame.com, ossfind.com — are fully static Astro 5 SSG builds deployed to Cloudflare Pages. Structured data lives in the <head> of each page, injected by layout components. No server-side rendering, no dynamic injection.

The schema types in use:

SoftwareApplication + BreadcrumbList on aiappdex.com model pages
VideoGame + BreadcrumbList on findindiegame.com game pages
ItemList + BreadcrumbList on ossfind.com alternatives pages
WebSite on all homepages

These all come from Astro layout components. When I add a new slot, reorganize the <head>, or extract shared layout logic, the JSON-LD block can disappear. The Astro compiler doesn't validate structured data. The build step doesn't check it. The deploy succeeds. Nothing errors.

This matters especially for static SSG sites where correctness at build time is the only opportunity — there's no server to validate output at runtime. If a template change drops the VideoGame schema from 2,000 game pages, the damage is done by the time the deploy finishes.

I mentioned in a weekly recap that I suspected some pages had malformed FAQ JSON-LD. That was the nudge to actually build the check.

What the audit script checks

scripts/audit-jsonld.mjs defines a table of expectations per site:

const SITES = [
  {
    host: "aiappdex.com",
    homepage: { path: "/", expectedTypes: ["WebSite"] },
    detail: {
      pathRegex: /\/models\//,
      expectedTypes: ["SoftwareApplication", "BreadcrumbList"],
    },
  },
  {
    host: "findindiegame.com",
    homepage: { path: "/", expectedTypes: ["WebSite"] },
    detail: {
      pathRegex: /\/games\//,
      expectedTypes: ["VideoGame", "BreadcrumbList"],
    },
  },
  {
    host: "ossfind.com",
    homepage: { path: "/", expectedTypes: ["WebSite"] },
    detail: {
      pathRegex: /\/alternatives\//,
      expectedTypes: ["ItemList", "BreadcrumbList"],
    },
  },
];

For each site, the script fetches the homepage and two sample detail pages, extracts all JSON-LD blocks, collects the @type values present, and reports any expected type that's missing.

It runs against live deployed pages, not build output. If Cloudflare returns a cached version of the old page, this catches it. If a CDN edge is serving different HTML than origin, this catches it. Testing the build artifact catches template errors earlier, but not deployment and caching issues — and those are real failure modes I've hit before with Cloudflare Pages.

How it discovers live pages from the sitemap

Instead of hardcoding detail page paths, the script reads the live sitemap to find real pages:

async function discoverDetailPaths(host, regex, count = 2) {
  try {
    const sitemap = await fetch(`https://${host}/sitemap-0.xml`).then(r => r.text());
    const urls = [...sitemap.matchAll(/<loc>([^<]+)<\/loc>/g)].map(m => m[1]);
    return urls.filter(u => regex.test(u)).slice(0, count).map(u => new URL(u).pathname);
  } catch {
    return [];
  }
}

The filename sitemap-0.xml is intentional. As I documented earlier in this series, @astrojs/sitemap on small sites (under roughly 1,000 pages) writes /sitemap-0.xml, not /sitemap-index.xml. Hardcoding /sitemap-index.xml would cause discovery to fail silently — falling back to checking no detail pages at all.

Filtering against pathRegex finds actual model/game/alternative pages that exist in the current production deployment. It checks 2 samples per site per run, which is fast but not exhaustive.

Extracting JSON-LD and handling @graph

The extraction is a regex over the HTML, with one non-obvious case: the @graph unwrapping.

function extractJsonLd(html) {
  const matches = [
    ...html.matchAll(
      /<script[^>]+type=["']application\/ld\+json["'][^>]*>([\s\S]*?)<\/script>/gi,
    ),
  ];
  const items = [];
  for (const m of matches) {
    try {
      const parsed = JSON.parse(m[1].trim());
      const arr = Array.isArray(parsed) ? parsed : [parsed];
      for (const node of arr) {
        if (!node) continue;
        if (node["@graph"]) {
          for (const sub of node["@graph"]) {
            if (sub && sub["@type"]) items.push(sub);
          }
        } else if (node["@type"]) {
          items.push(node);
        }
      }
    } catch (e) {
      items.push({ "@type": "_PARSE_ERROR", error: String(e).slice(0, 80) });
    }
  }
  return items;
}

Some generators bundle multiple schema objects inside a top-level @graph array. Google treats each item in @graph as a separate entity — the audit does the same. If structured data has @graph: [{ "@type": "VideoGame" }, { "@type": "BreadcrumbList" }], both types are extracted and validated individually.

Parse errors surface as a _PARSE_ERROR entry. This catches malformed JSON before it reaches the @type check — useful if a template interpolation injects an unescaped quote into the JSON block.

Adding it to the CI pipeline as a non-fatal step

I wired it into publish-articles.yml — the same pipeline that handles article distribution across Dev.to, Hashnode, and Bluesky:

- name: Audit JSON-LD (non-fatal)
  run: node scripts/audit-jsonld.mjs || echo "JSON-LD audit reported issues (non-fatal)"

The || fallback is the key design decision. It means the step always exits 0, so a failing audit never blocks article publishing. Issues appear in the action log, but no deploy is halted.

This mirrors how I handled the Bluesky image upload timing issue: add the check first, observe what it reports in real conditions, fix the underlying problems, then tighten the failure mode. Making a new check fatal immediately guarantees you'll be debugging a blocked pipeline at the worst moment.

Once all three sites audit clean on every run, I'll drop the || and let a missing BreadcrumbList fail the workflow. Not yet.

What it found on first run

Three issues surfaced immediately:

ossfind.com alternatives pages: missing ItemList

Expected — I hadn't added ItemList schema to the ossfind alternatives layout yet. The audit turned "I should add structured data to ossfind someday" into a concrete, CI-visible task.

findindiegame.com homepage: http:// in WebSite @id

The @id field in the WebSite block was http://findindiegame.com. I had copied a schema template and missed updating the protocol. Nothing breaks visibly — the page renders correctly, the structured data is syntactically valid — but it's inconsistent with what Googlebot sees for the canonical URL.

aiappdex.com model pages: name field used raw HuggingFace model ID

The name field in SoftwareApplication schema contained "meta-llama/Llama-3.1-8B-Instruct" — the raw database ID — instead of the human-readable "Llama 3.1 8B Instruct" that appears in the page <h1>. Both values were available in the Astro component, but the template was pulling from the wrong field.

Site	Issue	Status
ossfind.com	Missing `ItemList` on alternatives pages	Backlog
findindiegame.com	`http://` in WebSite `@id`	Fixed
aiappdex.com	`name` used raw model ID instead of display name	Fixed

Issues 2 and 3 were genuine bugs I wouldn't have found otherwise. Neither showed up in the Astro build, the Cloudflare deploy log, or any browser-level review. The audit found them on first run because it reads structured data the same way Googlebot does: as text inside a <script> tag, not as something the browser renders.

What I'd add next

URL self-consistency check. The http:// bug was caught by manual inspection of the reported types. A systematic check would verify that every url or @id field in structured data matches the actual canonical URL of the page — so that class of error gets caught automatically.

aggregateRating on VideoGame pages. The Steam review data is already in the Turso database: total_reviews, total_positive, review_score. Once I emit aggregateRating structured data, the audit should verify it's present and well-formed on game pages.

FAQPage schema. I want to add FAQ sections to top model pages on aiappdex.com. Once added, the audit needs a validation rule for those pages.

Running against build output before deploy. The current approach finds issues after they're live. Running the same extraction logic against the Astro build output — with a local astro preview server in CI — would catch template regressions pre-deploy. That adds CI complexity I'm not ready to take on; post-deploy detection is good enough for now.

The limit worth stating plainly: the audit checks 2 sample pages per site per run. It doesn't catch issues that only affect specific page types, rare edge cases in the data, or pages that happen not to be in the sitemap sample. It's a smoke test, not a full validation suite.