Datawinder

Posted on Jun 10 • Edited on Jun 22 • Originally published at datawinder.hashnode.dev

Building a Lean, Single-Worker Broken URL Monitor for Data Pipelines

#webscraping #datapipeline #devtools #apify

The Technical Problem: Websites Drift, Pipelines Don't Know

Long-running scraping pipelines have a structural assumption baked in: the URLs you configured last month still resolve today. That assumption is wrong more often than you'd expect.

Sites reorganize their URL structure during CMS migrations. Documentation pages get archived or consolidated. Blog posts get unpublished. Product pages disappear. This is called site drift — the slow, continuous decay of a website's link graph over time — and it's completely normal behavior from the target site's perspective. From your pipeline's perspective it's a quiet source of wasted work.

The failure mode looks like this: your scheduled scraper fires, constructs its list of target URLs from a cached sitemap or a hardcoded config, and dispatches requests to all of them. Some of those URLs now return 404 Not Found or 500 Internal Server Error. The pipeline either silently swallows the errors, logs them somewhere nobody checks, or — worse — passes empty response bodies downstream into your parser, which produces garbage records. Your data store fills with empty or malformed entries. Compute units are consumed for zero useful output.

At small scale, this is a minor annoyance. At any meaningful schedule frequency — hourly, daily, continuous — it compounds into a real cost problem. You're paying for bandwidth and execution time on requests you already know are going to fail, because nobody built a gate to check first.

Open-Source Shortcut: If you want to skip the setup and see this asynchronous connection pool logic running directly in your local terminal, I’ve open-sourced the lean, parameter-driven codebase. You can clone it straight from GitHub: lean-sitemap-monitor.

The Resource Constraint: Why You Don't Need a Distributed System For This

The instinctive over-engineered response to this problem looks like: a Redis queue holding URL state, a database tracking historical status codes per endpoint, a separate worker process polling for changes, and a notification layer sitting on top of all of it. That architecture exists in enterprise SEO tooling and costs $99–$300/month to run as a managed service.

For a solo developer or a small pipeline, that's the wrong answer on every axis. It's expensive to run, painful to maintain, and solves a much harder version of the problem than you actually have.

The right mental model here is simpler: you need a scheduled, single-loop execution that reads a known list of URLs, checks each one, and reports what's broken. No persistent state beyond the last run's output. No complex graph traversal. No distributed coordination.

A contained, single-worker monitor has a near-zero infrastructure footprint. It runs, produces a report, and exits. The scheduling layer — a cron job, a CI pipeline trigger, an Apify schedule — is entirely separate from the execution logic. Keeping those two concerns decoupled is what makes the tool cheap to operate and easy to reason about.

The Core Mechanics: How to Make It Efficient

Given the constraint of a single-loop executor, three engineering decisions determine whether the tool is actually useful or just technically correct.

1. A Single Entry Point: Sitemap Ingestion

Instead of maintaining a manually curated list of URLs or building a crawler that discovers pages by following links, the monitor reads directly from the target site's sitemap.xml. A sitemap is a structured, flat inventory of every URL the site owner considers canonical — exactly the list you want to check. Parsing it once at the start of each run gives you a complete, authoritative URL set without any graph traversal or state management overhead.

from apify_client import ApifyClient

# Initialize the client with your Apify API token
client = ApifyClient("<YOUR_API_TOKEN>")

# One entry point: the sitemap URL.
# The actor parses it into a flat URL list and loads it straight into the check queue.
# All other parameters have sensible defaults — override only what you need.
run_input = {
    "sitemapUrl": "https://example.com/sitemap.xml",
    "requestMethod": "head",      # HEAD only fetches status headers, not the full page body
    "followRedirects": True,      # Track redirect chains to confirm final destination status
    "timeoutMs": 10000,           # Drop any request that hasn't responded within 10 seconds
    "maxConcurrency": 10          # Max simultaneous in-flight requests — keeps memory and rate limits sane
}

# Run the actor and wait for it to finish
run = client.actor("datawinder/broken-url-monitor").call(run_input=run_input)

# Results come back as dataset items — one output record per run
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    if item.get("baseline"):
        print("Baseline established. Monitor is active for next run.")
    elif item.get("unchanged"):
        print(f"No changes. {item.get('unchangedCount', 0)} URLs confirmed healthy.")
    else:
        critical = item.get("changes", {}).get("critical", [])
        if critical:
            print(f"{len(critical)} dead URLs detected:")
            for change in critical:
                print(f"  {change['url']} — was {change['previous']['status']}, now {change['current']['status']}")
        else:
            print("Changes detected but none critical. Check warning and info tiers.")

This also means the URL list stays current automatically. When the site adds or removes pages, the sitemap reflects it. You're not maintaining a separate config file that drifts out of sync with reality.

2. Protocol Optimization: HEAD Requests, Not GET

This is the single most impactful efficiency decision in the whole tool. A standard GET request downloads the full HTTP response — status line, headers, and the entire response body. For a documentation page, that might be 80–200KB of HTML you have no use for. Multiply that by 500 URLs and you've downloaded 40–100MB of content just to check whether those pages exist.

A HEAD request asks for the response headers only. The server returns the status code — 200 OK, 301 Moved Permanently, 404 Not Found, 500 Internal Server Error — without the body. The transfer cost is negligible. You get exactly the signal you need: is this URL alive or dead.

The followRedirects flag handles the case where a URL has moved rather than died. A 301 redirect isn't necessarily a broken link — it might be a canonical URL change where the content still exists at a new location. Following the redirect chain to the final destination status code is what distinguishes "this page moved" from "this page is gone."

The one edge case: some servers reject HEAD requests and return 405 Method Not Allowed. When that happens, the requestMethod input can be toggled to "get" as a fallback. That's a configuration decision, not a code change.

3. Fail-Safe Boundaries: Timeouts and Concurrency

Two parameters keep the single-loop execution from becoming a liability.

timeoutMs (default: 10,000ms) is a per-request hard cutoff. Without it, a single hanging socket — a server that accepts the connection but never responds — can stall the entire execution thread waiting indefinitely. With it, any request that doesn't respond within 10 seconds is marked as timed out and the loop moves on. The pipeline doesn't hang. The report still generates.

maxConcurrency (default: 10) controls how many requests are in-flight simultaneously. This serves two purposes. First, it prevents local memory exhaustion — opening 500 simultaneous connections is a fast way to OOM a small worker. Second, it keeps the request rate polite enough that the target server doesn't rate-limit or block the monitor. Ten concurrent HEAD requests is aggressive enough to finish a 500-URL sitemap in under a minute, conservative enough to avoid triggering most rate limiters.

Together these two parameters define the execution envelope. The monitor runs fast, doesn't hang, and doesn't get itself blocked.

The Implementation: What the Output Looks Like

Running the monitor produces a structured JSON report. On first run, it establishes a baseline:

{
  "baseline": true,
  "summary": {
    "total": 84,
    "ok": 84,
    "redirect": 0,
    "clientError": 0,
    "serverError": 0
  },
  "message": "Baseline stored. Monitoring is now active."
}

On subsequent runs, it diffs against that baseline and surfaces only what changed:

{
  "baseline": false,
  "summary": { "total": 84, "ok": 82, "errors": 2 },
  "changes": {
    "critical": [
      {
        "url": "https://example.com/target-page",
        "previous": { "status": 200 },
        "current": { "status": 404 }
      }
    ],
    "warning": [],
    "info": []
  },
  "unchangedCount": 82
}

changes.critical is the actionable list — URLs that were previously healthy and are now returning errors. That's the array you pipe into your alerting logic or your pipeline's pre-flight gate. Everything in unchangedCount is confirmed healthy and costs nothing downstream.

The severity tiers (critical, warning, info) let you tune how aggressively you respond. A critical — a 200 that became a 404 — is worth blocking a pipeline run over. A warning — a timestamp regression or a minor metadata shift — probably isn't.

Scaling Up to Cloud-Level Data Auditing

Building an asynchronous concurrency loop locally is a brilliant way to understand how to maximize your machine’s processing throughput. However, running heavy recursive status checks over thousands of URLs on a daily schedule burns massive local network bandwidth, risks crashing your local runtime on large XML strings, and requires your machine or a dedicated server to stay persistently awake.

If you want a completely hands-off, production-grade upgrade, check out the Official Broken URL Monitor Actor on Apify. It offloads the entire processing loop to a serverless cloud instance, utilizes streaming XML data processing to prevent memory drops on massive datasets, and gives you clean, visual health dashboards with plug-and-play Slack and Email alerts for pennies an execution.

Wrapping Up

This exact logic is packaged into the broken-url-monitor Actor on Apify. It takes a sitemap URL as input, runs the HEAD request loop with the parameters described above, persists the baseline between runs on Apify's infrastructure, and returns the structured diff. No server to maintain, no state database to manage, no $99/month SEO platform subscription.

The actor runs for literal pennies per execution on a 500-URL sitemap. Schedule it ahead of your main scraping pipeline and use the changes.critical array as a pre-flight check. If it's empty, proceed. If it's not, fix the dead URLs before wasting a full pipeline run on them.

The schemas and source are on Datawinder Labs GitHub if you want to look under the hood or adapt the logic for your own use case.

DEV Community