DEV Community

Hackceleration
Hackceleration

Posted on • Originally published at hackceleration.com

Building an Automated Broken Link Checker with n8n and Google Sheets

Dead links damage SEO rankings and frustrate users, but manually checking hundreds of pages is impractical. This n8n workflow automatically scans your entire website daily, tests every internal link using HTTP HEAD requests, and logs broken URLs to Google Sheets with their source pages. Here's how to architect this integration in n8n.

Architecture Overview

The workflow operates in two connected parts:

Main Workflow (Orchestration)

1. Schedule Trigger (daily midnight)
2. Fetch XML sitemap
3. Parse XML to JSON
4. Extract page URLs
5. For each page:
   - Download HTML content
   - Extract all internal links (regex filtering)
6. Send aggregated links to sub-workflow webhook
7. Create dated Google Sheets report
8. Move report to dedicated Drive folder
Enter fullscreen mode Exit fullscreen mode

Sub-Workflow (Link Testing)

1. Receive links via webhook
2. For each link:
   - Send HTTP HEAD request
   - Check status code
   - If ≠ 200: Log to Google Sheets
3. Return results to main workflow
Enter fullscreen mode Exit fullscreen mode

Why this architecture? Separating orchestration from testing prevents timeout issues on large sites. The main workflow completes quickly after delegating link testing to the webhook-triggered sub-workflow, which processes links asynchronously. Using HTTP HEAD requests instead of GET requests reduces bandwidth by ~95% since we only need response headers, not full page content.

API Integration Deep-Dive

XML Sitemap Fetching

The workflow starts by downloading your sitemap—a structured list of all pages you want indexed:

// HTTP Request node configuration
GET https://yourdomain.com/sitemap.xml
Headers: {}
Authentication: None
Enter fullscreen mode Exit fullscreen mode

Response structure (XML):

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://yourdomain.com/page1</loc>
    <lastmod>2025-01-15</lastmod>
  </url>
  <url>
    <loc>https://yourdomain.com/page2</loc>
  </url>
</urlset>
Enter fullscreen mode Exit fullscreen mode

The XML to JSON node converts this to:

{
  "urlset": {
    "url": [
      {"loc": "https://yourdomain.com/page1"},
      {"loc": "https://yourdomain.com/page2"}
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

What happens when sitemap is missing? The HTTP Request node returns a 404 error and the workflow fails. Always verify your sitemap URL before first run—common paths include /sitemap.xml, /sitemap_index.xml, or /page-sitemap.xml for WordPress sites with SEO plugins.

Link Extraction with Smart Filtering

The Code node extracts URLs from HTML using regex, then filters aggressively:

// Extract all URLs from HTML
const html = $input.item.json.data;
const urlRegex = /href=["']([^"']+)["']|src=["']([^"']+)["']/gi;
const matches = [...html.matchAll(urlRegex)];

// Deduplicate
const uniqueUrls = [...new Set(matches.map(m => m[1] || m[2]))];

// Filter CDN and static assets
const cdnPatterns = [
  /cloudflare\.com/,
  /cloudfront\.net/,
  /googleapis\.com/,
  /gstatic\.com/
];

const staticExtensions = /\.(jpg|jpeg|png|gif|css|js|woff|woff2|ttf)$/i;

const filteredUrls = uniqueUrls.filter(url => 
  !cdnPatterns.some(pattern => pattern.test(url)) &&
  !staticExtensions.test(url)
);
Enter fullscreen mode Exit fullscreen mode

Why filter CDN resources? CDN URLs rarely break (they're managed by providers like Cloudflare), and testing them wastes execution time. Static assets like images or fonts breaking won't affect SEO as critically as page-level 404s. This filtering reduces checks by 60-80% on typical sites.

HTTP HEAD Requests for Link Testing

Instead of downloading full pages, the workflow sends HEAD requests:

// HTTP Request node (sub-workflow)
HEAD {{ $json.url }}
Headers: {}
Ignore SSL Issues: true
Enter fullscreen mode Exit fullscreen mode

Response example (200 OK):

{
  "statusCode": 200,
  "headers": {
    "content-type": "text/html",
    "content-length": "15234"
  }
}
Enter fullscreen mode Exit fullscreen mode

Response example (404 Not Found):

{
  "statusCode": 404,
  "headers": {
    "content-type": "text/html"
  }
}
Enter fullscreen mode Exit fullscreen mode

HEAD vs GET performance: A typical page GET request transfers 500KB-2MB. A HEAD request for the same URL transfers <1KB (just headers). For 1000 links, that's 500MB vs 500KB—a 1000x bandwidth reduction.

Google Sheets Logging

Broken links get written to Google Sheets using the Append Row operation:

// Google Sheets API call (abstracted by n8n node)
POST https://sheets.googleapis.com/v4/spreadsheets/{spreadsheetId}/values/Sheet1!A1:append
Authorization: Bearer {oauth_token}
Body: {
  "values": [[
    "https://yourdomain.com/source-page",
    "https://yourdomain.com/broken-link"
  ]]
}
Enter fullscreen mode Exit fullscreen mode

Authentication: The node uses OAuth2 credentials configured in n8n. You'll need to authorize n8n to access your Google account once—the refresh token handles subsequent authentications automatically.

Rate limits: Google Sheets API allows 100 requests per 100 seconds per user. With one append per broken link, you can log ~6000 broken links per hour before hitting limits. For most sites, this is excessive headroom.

Implementation Gotchas

Handling Missing or Malformed Sitemaps

If your sitemap URL returns 404, the workflow fails immediately. Add error handling by setting "Continue On Fail" in the HTTP Request node, then add an IF node checking for statusCode === 200 before proceeding.

Webhook URL Configuration

The sub-workflow webhook must be accessible from your main workflow. If running n8n locally, use a tunneling service like ngrok for testing. In production, ensure your n8n instance has a public URL or internal network routing configured.

Dealing with Rate Limiting

Some sites implement aggressive rate limiting. If testing 1000+ links triggers IP blocks, add a Wait node between HTTP HEAD requests (50-100ms delay). This increases execution time but prevents false positives from rate limit 429 errors.

SSL Certificate Errors

The workflow enables "Ignore SSL Issues" to handle sites with expired certificates. This means you'll flag links as broken based on HTTP status codes, not SSL validity. If you need to exclude SSL errors, disable this option—but be prepared for false positives.

Authentication Errors

Google API credentials can expire or revoke. If the workflow suddenly fails with 401 errors, re-authenticate your Google Sheets credential in n8n's credential manager. The OAuth2 flow will refresh your access token.

Cost Optimization

n8n cloud pricing is based on workflow executions and data processed. To minimize costs:

  • Run the workflow during off-peak hours (midnight reduces server load competition)
  • Increase the Schedule Trigger interval to weekly instead of daily for smaller sites
  • Use the batch size parameter to process multiple links per execution rather than one-by-one

Common Failure Modes

Memory exhaustion: Sites with 10,000+ pages can exhaust n8n's memory during HTML parsing. Solution: Add pagination to the Split In Batches node (process 100 pages per batch, loop until complete).

Webhook timeout: If link testing takes >60 seconds, the webhook times out. Solution: Return an immediate 202 Accepted response, then continue processing asynchronously without waiting for completion.

Duplicate logging: If the workflow re-runs while still processing, you'll get duplicate Google Sheets rows. Solution: Add a Set node at the start that checks for existing executions using n8n's workflow API before proceeding.

Prerequisites

Required accounts and tools:

  • n8n instance (self-hosted or cloud) with active workflow execution
  • Google account with Sheets and Drive access
  • XML sitemap accessible at public URL
  • Google Drive folder for storing reports

API credentials needed:

  • Google Sheets OAuth2 credential in n8n
  • Google Drive OAuth2 credential in n8n

Estimated API costs:

  • Google Sheets API: Free (within 100 requests/100 seconds quota)
  • Google Drive API: Free (within 1000 requests/100 seconds quota)
  • n8n cloud: ~$20/month for 5000 workflow executions (if using hosted version)

Links to official documentation:

Get the Complete Workflow Configuration

This tutorial covers the API integration architecture and core logic for building a broken link checker in n8n. For the complete workflow JSON with all node configurations, visual workflow canvas screenshots, and a video walkthrough of the setup process, check out the full implementation guide.

Top comments (0)