Mitrasish

Posted on Jul 5 • Originally published at trylyra.ai

GitHub Actions SEO: gate PRs on broken links and schema

#seo #github #webdev #devops

Originally published on the Lyra blog.

Code review is good at catching logic bugs. SEO bugs are different: a broken canonical does not throw a build error, a dead external link does not fail a type check, and a malformed JSON-LD block does not appear in a diff in any way that signals a problem. They ship quietly. You find out weeks later from Search Console.

The fix is a GitHub Actions SEO workflow that gates every blog PR automatically. Four jobs check broken links, meta and canonical correctness, JSON-LD validity, and a Lighthouse performance budget. The merge button stays red until all four pass.

This is the workflow, job by job.

What a blog PR can ship that code review misses

A code reviewer checking a blog post looks at the prose: is the structure right, does the intro land, are the claims defensible? Nobody in that review is clicking every external link, validating the canonical, or running the new hero image through a performance budget. Those checks are not part of the review process. CI makes them automatic.

Broken external links nobody clicked during editorial review

External links rot. A link that resolved when the author found the source may have moved, renamed, or 404ed by the time the post ships. Nobody in editorial review clicks every citation in a 2,000-word post. A CI job does.

A missing or self-conflicting canonical that splits your ranking signal

The canonical tag tells Google which URL to credit when the same or similar content appears at multiple addresses. In a Next.js App Router site, pages generate their canonical via generateMetadata. The common failure mode is a page that inherits a canonical from a parent layout instead of setting its own, producing a post whose canonical points at /blog/ rather than /blog/your-post-slug/. Astro's sitemap integration has its own version of this failure mode, covered in our Astro vs Next.js SEO comparison, so the check below is worth adapting rather than skipping if you are on Astro instead.

The page renders without error, silently sending its ranking signal to the wrong URL.

Malformed JSON-LD that silently forfeits rich-result eligibility

Nestlé measured that pages appearing as rich results in Google Search have an 82% higher click-through rate than non-rich-result pages, a figure cited in Google's structured data documentation. A Milestone Internet study of 4.5 million queries measured 58 clicks per 100 queries for rich results against 41 for standard results. A single malformed property in the JSON-LD block, a date string in the wrong format, or a missing required field silently disqualifies the page from rich-result consideration. The structured data is rendered in the HTML; it just does not validate.

Lighthouse runs around 8 automated SEO audits per page, and none of them validate JSON-LD content. A separate validation step closes that gap.

A new hero image that blows your Lighthouse budget

Google's Core Web Vitals thresholds are LCP under 2.5 seconds, CLS under 0.1, and INP under 200 milliseconds. Roughly half of all tracked origins pass all three, per 2025 Web Almanac data, with desktop (56%) outperforming mobile (48%).

A PR that adds a 3MB PNG where a 200KB WebP should be can push LCP over threshold, but the build succeeds and the post looks fine locally. The regression only surfaces in Search Console weeks later.

The GitHub Actions SEO workflow: four checks, one file

All four jobs live in .github/workflows/blog-seo.yml. The workflow triggers on pull requests that change files in content/blog/, so it only runs when content changes:

name: Blog SEO checks

on:
  pull_request:
    paths:
      - 'content/blog/**'
      - '.github/workflows/blog-seo.yml'

Job 1: broken links - lychee-action scans Markdown files before the build

lychee-action wraps lychee, a link checker written in Rust. The lychee project benchmarks it at 576 links in about 60 seconds on the analysis-tools-dev/static-analysis repository; throughput varies by repo size and link distribution, but most blogs with a few dozen posts complete in well under two minutes. It reads Markdown files directly and does not require a running server, so it can complete before any build step.

jobs:
  broken-links:
    name: Broken links
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v7

      - name: Check links
        uses: lycheeverse/lychee-action@v2
        with:
          args: --verbose --no-progress 'content/blog/**/*.md'
          fail: true
          jobSummary: true

fail: true exits with a non-zero code on any broken link, which fails the job. jobSummary: true writes the full report to the GitHub Actions job summary, accessible from the PR's check status.

Add a .lycheeignore at the repo root for URLs to exclude, one regex per line:

# Localhost references in code blocks
http://localhost
# Web archive links
https://web.archive.org

Job 2: meta, canonical, and OG tags - parse built HTML after next build

There is no off-the-shelf action for meta-tag validation on a Next.js App Router site, so this job builds the site and runs a short Node script against the HTML output. The script checks each page for a <meta name="description">, a <link rel="canonical"> that matches the page's own URL, and basic Open Graph tags.

After validating, the job uploads the build as an artifact. The JSON-LD and Lighthouse jobs download it instead of rebuilding, so all three validate the same output and CI time does not multiply with each additional check:

  meta-tags:
    name: Meta and canonical tags
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v7

      - uses: actions/setup-node@v6
        with:
          node-version: '20'
          cache: 'npm'

      - run: npm ci

      - name: Cache Next.js build
        uses: actions/cache@v6
        with:
          path: .next/cache
          key: ${{ runner.os }}-nextjs-${{ hashFiles('**/package-lock.json') }}

      - name: Build
        run: npx next build
        env:
          NODE_ENV: production

      - name: Check meta and canonical tags
        run: node scripts/check-meta.mjs

      - name: Upload build artifact
        uses: actions/upload-artifact@v7
        with:
          name: next-build
          path: |
            .next/
            public/
          retention-days: 1

Setting process.exitCode = 1 instead of calling process.exit(1) immediately lets the script report every failure across all pages in a single run rather than stopping at the first hit. Create scripts/check-meta.mjs in your repo:

// scripts/check-meta.mjs
import { readdir, readFile } from 'node:fs/promises';
import { join, resolve } from 'node:path';

const SITE_URL = process.env.SITE_URL ?? 'https://yoursite.com';
const BLOG_DIR = resolve('.next/server/app/blog');

async function walk(dir) {
  const entries = await readdir(dir, { withFileTypes: true });
  const files = [];
  for (const entry of entries) {
    const full = join(dir, entry.name);
    if (entry.isDirectory()) {
      files.push(...await walk(full));
    } else if (entry.name === 'page.html') {
      files.push(full);
    }
  }
  return files;
}

async function checkPage(htmlPath) {
  const slug = htmlPath.replace(BLOG_DIR + '/', '').replace('/page.html', '');
  const html = await readFile(htmlPath, 'utf8');
  const expectedUrl = `${SITE_URL}/blog/${slug}/`;
  let ok = true;

  const description =
    html.match(/<meta[^>]+name="description"[^>]+content="([^"]+)"/i)?.[1] ??
    html.match(/<meta[^>]+content="([^"]+)"[^>]+name="description"/i)?.[1] ??
    null;

  if (!description) {
    console.error(`[FAIL] Missing meta description: /blog/${slug}/`);
    process.exitCode = 1;
    ok = false;
  }

  const canonical =
    html.match(/<link[^>]+rel="canonical"[^>]+href="([^"]+)"/i)?.[1] ??
    html.match(/<link[^>]+href="([^"]+)"[^>]+rel="canonical"/i)?.[1] ??
    null;

  if (!canonical || canonical !== expectedUrl) {
    console.error(`[FAIL] Canonical mismatch: /blog/${slug}/`);
    console.error(`  Expected: ${expectedUrl}`);
    console.error(`  Found:    ${canonical ?? 'missing'}`);
    process.exitCode = 1;
    ok = false;
  }

  const ogTitle =
    html.match(/<meta[^>]+property="og:title"[^>]+content="([^"]+)"/i)?.[1] ??
    html.match(/<meta[^>]+content="([^"]+)"[^>]+property="og:title"/i)?.[1] ??
    null;

  if (!ogTitle) {
    console.error(`[FAIL] Missing og:title: /blog/${slug}/`);
    process.exitCode = 1;
    ok = false;
  }

  if (ok) console.log(`[OK]   /blog/${slug}/`);
}

const files = await walk(BLOG_DIR).catch(() => []);

if (files.length === 0) {
  console.error('[FAIL] No HTML found in .next/server/app/blog - run next build first');
  process.exitCode = 1;
} else {
  await Promise.all(files.map(checkPage));
}

walk recurses the App Router build directory and collects every page.html file. Next.js 15 App Router writes pre-rendered pages to .next/server/app/blog/<slug>/page.html, so the slug is extracted directly from the path. checkPage reads each file, runs all three checks without short-circuiting, and logs every failure before the process exits. Set SITE_URL via the environment (or hardcode your domain) to match the canonical your generateMetadata produces.

Job 3: JSON-LD linting - schemar posts pass/fail results as a sticky PR comment

Schemar (johnnyreilly/schemar) wraps the Schema Markup Validator. It accepts a list of URLs, checks the JSON-LD on each against Schema.org's rules, and returns pass/fail results. Combine it with marocchino/sticky-pull-request-comment to keep the validation output as a single updating comment on the PR rather than a new comment on every push.

This job downloads the build artifact from the meta-tags job rather than rebuilding from scratch. The needs: meta-tags dependency controls ordering; the artifact carries the actual output.

The job also needs the slug of the post being reviewed. Rather than hardcoding it, a get-slug step extracts the filename from the git diff - the slug is just the new .md filename in content/blog/ with its extension stripped:

  json-ld:
    name: JSON-LD validation
    runs-on: ubuntu-latest
    needs: meta-tags
    steps:
      - uses: actions/checkout@v7

      - uses: actions/setup-node@v6
        with:
          node-version: '20'
          cache: 'npm'

      - run: npm ci

      - name: Download build artifact
        uses: actions/download-artifact@v8
        with:
          name: next-build

      - name: Get new post slug
        id: slug
        run: |
          git fetch origin ${{ github.base_ref }}:refs/remotes/origin/${{ github.base_ref }}
          SLUG=$(git diff --name-only origin/${{ github.base_ref }}..HEAD \
            -- 'content/blog/' | grep '\.md$' | head -1 \
            | sed 's|content/blog/||; s|\.md$||')
          echo "slug=${SLUG}" >> $GITHUB_OUTPUT

      - name: Start preview server
        run: npx next start &

      - name: Wait for server
        run: npx wait-on http://localhost:3000

      - name: Validate JSON-LD
        id: schemar
        uses: johnnyreilly/schemar@v0.1.1
        with:
          urls: "http://localhost:3000/blog/${{ steps.slug.outputs.slug }}/"

      - name: Format results as markdown
        id: format
        if: always()
        uses: actions/github-script@v9
        with:
          script: |
            const results = ${{ steps.schemar.outputs.results }};
            const lines = results.map((r) =>
              `${r.processedValidationResult.success ? '🟢' : '🔴'} ${r.url}: ${r.processedValidationResult.resultText}`
            );
            core.setOutput('comment', ['### JSON-LD validation', ...lines].join('\n'));

      - name: Post results as sticky PR comment
        uses: marocchino/sticky-pull-request-comment@v3
        with:
          header: json-ld-validation
          message: ${{ steps.format.outputs.comment }}

The fetch line writes an explicit refspec, origin/${{ github.base_ref }}:refs/remotes/origin/${{ github.base_ref }}, instead of a bare git fetch origin main. actions/checkout@v7 defaults to a shallow, single-branch clone of the PR head, so a bare fetch only populates FETCH_HEAD and leaves no local origin/main ref for the diff to compare against. The explicit refspec creates that ref directly.

It still is not enough on its own. actions/checkout@v7's default depth-1 clone fetches only the PR head commit, with no shared history to main in the local repository, so origin/main and HEAD have no common ancestor that git can find locally. A three-dot diff (origin/main...HEAD), which compares against the merge base, fails with fatal: no merge base in that state. The two-dot form above (origin/main..HEAD) compares the two tips directly and does not need one, so it works regardless of the checkout's fetch depth.

The header param on the sticky comment means each new push overwrites the previous result in place. The PR timeline stays clean.

Schemar's results output is Result[], a JSON array, not pre-formatted markdown: each entry carries a url and a processedValidationResult object with success and resultText fields, confirmed in schemar's action.yml. Passing that array straight to message posts raw JSON on the PR. The actions/github-script step in between maps each result to a one-line pass/fail row before it reaches the sticky comment, which is the same shape johnnyreilly's own writeup of the action uses for its PR comments.

Job 4: Lighthouse budget - serve the build locally, assert on LCP, CLS, and INP

treosh/lighthouse-ci-action runs Lighthouse CI against a locally served build and fails the job when any assertion falls below threshold.

Like the JSON-LD job, this downloads the artifact rather than running another build. It also uses the same get-slug step to discover the post URL from the diff, then generates .lighthouserc.json on the fly so no file needs manual editing per PR:

  lighthouse:
    name: Lighthouse budget
    runs-on: ubuntu-latest
    needs: meta-tags
    steps:
      - uses: actions/checkout@v7

      - uses: actions/setup-node@v6
        with:
          node-version: '20'
          cache: 'npm'

      - run: npm ci

      - name: Download build artifact
        uses: actions/download-artifact@v8
        with:
          name: next-build

      - name: Get new post slug
        id: slug
        run: |
          git fetch origin ${{ github.base_ref }}:refs/remotes/origin/${{ github.base_ref }}
          SLUG=$(git diff --name-only origin/${{ github.base_ref }}..HEAD \
            -- 'content/blog/' | grep '\.md$' | head -1 \
            | sed 's|content/blog/||; s|\.md$||')
          echo "slug=${SLUG}" >> $GITHUB_OUTPUT

      - name: Generate .lighthouserc.json
        run: |
          cat > .lighthouserc.json << EOF
          {
            "ci": {
              "collect": {
                "url": ["http://localhost:3000/blog/${{ steps.slug.outputs.slug }}/"],
                "startServerCommand": "npx next start",
                "startServerReadyPattern": "started server"
              },
              "assert": {
                "assertions": {
                  "largest-contentful-paint": ["error", { "maxNumericValue": 2500 }],
                  "cumulative-layout-shift": ["error", { "maxNumericValue": 0.1 }],
                  "total-blocking-time": ["warn", { "maxNumericValue": 300 }]
                }
              }
            }
          }
          EOF

      - name: Run Lighthouse CI
        uses: treosh/lighthouse-ci-action@v12
        with:
          uploadArtifacts: true
          temporaryPublicStorage: true
          configPath: .lighthouserc.json

The Generate .lighthouserc.json step uses a heredoc where ${{ steps.slug.outputs.slug }} is substituted by the Actions runner before the shell executes - so the generated file contains the literal slug, not a variable reference. LCP under 2500ms and CLS under 0.1 are Google's passing thresholds. Using "error" rather than "warn" is what causes the job to fail. Total blocking time is the closest lab-measurable proxy for INP; "warn" surfaces problems without blocking the merge on what is an approximation of a field metric. Tighten or relax as the site's performance baseline becomes clearer.

Wiring it into the PR so the merge button stays red

The four jobs above produce check runs on every PR. By default, GitHub does not prevent merging when a check fails. One configuration step makes them binding.

Required status checks in branch protection - the one setting that makes everything above binding

Go to repository Settings, then Branches. Add a branch protection rule for the branch content merges into, typically main. Under "Require status checks to pass before merging", add all four job names:

Broken links
Meta and canonical tags
JSON-LD validation
Lighthouse budget

With these set as required, the merge button stays disabled until all four pass. A single failure keeps the PR locked regardless of approvals. The same branch protection rule is also what stops any GitHub App, including a well-behaved AI writer, from pushing straight to main: it forces a Contents-write token through this exact PR path, which is worth checking alongside the app's actual permission grant.

Without this step, the entire setup is advisory: the checks run and report, but nothing actually blocks the merge. This is the step most workflow tutorials omit.

Surfacing failures inline with sticky PR comments

The Schemar job's sticky comment puts JSON-LD results directly on the PR without navigating to the Actions run page. For the other three jobs, the GitHub job summary (via jobSummary: true on lychee, and console output on the meta-tag script) provides the detailed report accessible from each check status link.

Make the meta-tag script output specific enough to act on immediately:

[FAIL] Missing meta description: /blog/new-post-slug/
[FAIL] Canonical mismatch: /blog/new-post-slug/
  Expected: https://yoursite.com/blog/new-post-slug/
  Found:    https://yoursite.com/blog/

Keeping it zero-maintenance: .lycheeignore, pinned action versions, and caching the build

Three habits prevent the workflow from becoming a source of noise.

Pin action versions to major version tags (@v2, @v7, @v12). Moving tags like @latest break without warning when upstream ships a breaking change. Check release pages when onboarding a new action; marocchino/sticky-pull-request-comment is at v3, for example.
Share the build output. The cache step in the meta-tags job preserves .next/cache between workflow runs, and the artifact upload carries the final output to the JSON-LD and Lighthouse jobs - one build per PR, three jobs consuming it.
Keep .lycheeignore current. As the blog grows, more code-block URLs and archived-page references need exclusion. A stale file generates false failures that train the team to dismiss CI output; update it when adding an exclusion-worthy URL.

Where the green check ends and editorial judgment begins

What four passing jobs actually confirm - and what they cannot

A green run confirms:

No external link in the PR's Markdown files returns a 4xx or 5xx response
Every generated page has a meta description, a self-referencing canonical, and Open Graph tags
The structured data on the new post validates against Schema.org
The new post clears Core Web Vitals thresholds under lab conditions

What it does not confirm: whether the facts are correct, whether the post answers the question it sets up, or whether the prose is worth reading. CI has no opinion about those things.

This is the same division that makes automated content creation work without creating editorial risk: automate every check that has a clear pass/fail definition, leave judgment to people with context.

The split that works: CI owns technical correctness, humans own voice and facts

When a PR reaches human review with all four checks green, the reviewer does not need to wonder whether the canonical is pointing at itself or whether the link to the case study still resolves. CI answered those questions. The reviewer can focus on what CI cannot check: accuracy, voice, and whether the post actually serves the reader.

Combined with an earlier fact-checking step, Lyra verifies claims and links before opening the PR, CI gates the technical surface, and human review handles editorial judgment. All three pass before the post ships.

For teams using a PR-based AI blog writer where an agent produces the first draft, the CI gate is especially useful. The agent drafts fast, the checks run in parallel, and the reviewer sees a PR already validated on both the technical and factual axes. Internal linking automation also benefits directly: the broken-link job confirms that any new cross-links added to a post actually resolve before they ship.

I'm building Lyra, an autonomous blog writer that writes in your blog's voice, fact-checks every claim, and opens a pull request you review. This post comes from her blog, where we publish what we learn running the pipeline. Happy to answer questions in the comments.

DEV Community