NexGenData

Posted on Jun 29 • Originally published at thenextgennexus.com

Detect Web Vitals Regressions in Production Before Real Users Notice

#api #webscraping #opensource #python

Synthetic monitoring is one of those budget lines that scales linearly with how seriously a company takes web performance, and then jumps to a different curve entirely the moment somebody adds Real User Monitoring. A typical DataDog Synthetics setup for a marketing site at a Series B startup runs USD 12 per browser test per month. Multiply by 8 critical landing pages, 2 geographic regions, and a 5-minute interval, and the line item lands at roughly USD 1,920 per month before you have written a single assertion. Add DataDog RUM at USD 1.50 per 1,000 sessions and a moderately trafficked marketing site adds another USD 800 to USD 2,500. Now you are paying USD 3,000 to USD 4,500 per month to find out that your marketing team's hero video swap regressed LCP from 2.1s to 3.4s.

For product surfaces and checkout funnels, DataDog and New Relic earn their keep. They give you distributed tracing, error correlation, deployment markers, RUM with session replay, and the kind of forensic depth you need when a checkout abandonment alert fires at 2am. But for marketing pages, blog content, docs sites, and any surface where the question you actually want to answer is "did our last deploy hurt Core Web Vitals on the pages Google measures for ranking?" — synthetic APM is wildly over-instrumented and over-priced.

The pragmatic alternative for the marketing-and-content surface area: a scheduled Lighthouse run on your top 50 URLs, plus a per-deploy Lighthouse run on the 8 pages that actually matter, with a Slack alert when any Core Web Vital crosses Google's "good" threshold. Total cost: under USD 50 per month. Implementation time: an afternoon for the GitHub Actions workflow plus a day for the dashboarding.

This post walks through that build end to end. We will compare the cost and capability tradeoffs against DataDog Synthetics, New Relic Browser, and SpeedCurve LUX; show a production-ready GitHub Actions workflow that runs Lighthouse on every prod deploy and posts to Slack on regression; cover the threshold tuning that prevents alert fatigue; and lay out the schema for storing the time series so you can spot trends instead of just thresholds.

Why "marketing perf" is its own monitoring problem

Three things make marketing-page perf monitoring different from product-app perf monitoring:

Deploy frequency is bursty, not continuous. Most marketing sites deploy through a CMS or a static site generator with PRs from contractors or marketing ops. You get 3-15 deploys per day during a campaign push and zero for two weeks afterwards. APM tools that bill on test count or session volume are paying for the dead air.
The metric that matters is the lab metric, not the field metric. Google's CrUX (Chrome User Experience Report) data is the official source for ranking, but it lags by 28 days and is bucketed by origin (or, for high-traffic pages, by URL). What you actually want during incident triage is "did the last deploy regress LCP on /pricing?" That is a lab Lighthouse question, not a RUM question. RUM tells you what users saw last week. Lab tells you what changed in the build that landed at 14:47.
Most marketing-page perf regressions are introduced by content edits, not code. A marketer drops in a 4MB hero image instead of a 400KB one. The CMS publishes it without optimization. CWV tanks. APM correlates the regression with no deployment marker (because content was published outside the engineering deploy pipeline) and the alert sits in a channel for three weeks before anyone connects the dots.

A lightweight Lighthouse-on-deploy plus a daily scheduled audit catches all three failure modes for a fraction of the synthetic-monitoring spend.

Cost comparison: synthetic monitoring vs. scheduled Lighthouse

For a representative scenario — 50 URLs monitored hourly, 8 critical URLs audited on every deploy with roughly 10 deploys per day — here is what the major options cost in 2026:

| Tool | Pricing | Monthly cost (50 URLs hourly + 8 per-deploy) | | --- | --- | --- | | DataDog Synthetic Browser Tests | USD 12/test/month | USD 600+ for 50 hourly tests, plus roughly USD 300 for the per-deploy runs as triggered tests | | New Relic Synthetic Browser Monitor | USD 0.0050/check (Advanced) | USD 180/month for 36,000 hourly checks | | SpeedCurve LUX Lite | USD 134/month for 25K monthly checks | USD 134/month, plus separate Synthetic plan for per-deploy | | SpeedCurve Synthetic Pro | USD 414/month for 50K checks | USD 414/month | | Calibre Standard | USD 273/month for 4K pages | USD 273/month | | Lighthouse CI on self-hosted infra | t3.medium plus storage | USD 45/month plus DevOps time and on-call burden | | Page Speed Analyzer actor (scheduled + per-deploy) | USD 0.05/page audited | USD 38/month for 36K hourly + 2.4K per-deploy audits |

The actor route comes in roughly an order of magnitude under DataDog and SpeedCurve for the same audit volume. The tradeoff is that you do not get session replay, distributed tracing, or any of the APM goodies. For marketing-page perf monitoring, you do not need them.

The bulk Lighthouse runner

The NexGenData Page Speed Analyzer is an Apify actor that wraps the Google PageSpeed Insights API for bulk runs. You feed it a list of URLs, optionally a Google API key (recommended — the free tier gets you 25,000 audits per day per Cloud project), and it returns a JSON record per URL with the full Lighthouse report: Performance, Accessibility, Best Practices, SEO, and PWA scores plus the Core Web Vitals (FCP, LCP, CLS, TBT, TTI). It supports mobile and desktop emulation and handles the rate limiting, retries, and parallelism that you would otherwise have to build yourself.

Compared to standing up Lighthouse CI on your own infrastructure, the operational savings are non-trivial. Lighthouse CI is open source and good, but it is also yet another service for your platform team to monitor, patch, and rotate Chromium versions for. The actor abstracts all of that behind an HTTP API, which means your CI workflow is a single curl to start a run and a poll loop to fetch results.

Architecture overview

Here is the full pipeline. There are two trigger paths: scheduled (daily perf snapshot) and on-deploy (immediate post-release validation).


    [Scheduled trigger, hourly]
                |
                v
    [Page Speed Analyzer actor]
                |
                v
    [Apify dataset, JSON]
                |
                v
    [Postgres time-series table]
                |
                v
    [Grafana dashboard + Slack alert]


    [Production deploy, prod webhook]
                |
                v
    [GitHub Actions workflow]
                |
                v
    [Page Speed Analyzer actor, sync run]
                |
                v
    [Threshold check in workflow step]
                |
                +--> Pass: post green check to deployment
                +--> Fail: post Slack alert + open PR comment

The two paths share the same actor and the same storage table, which keeps the pipeline simple. The on-deploy path is synchronous (the workflow waits for the audit) and the scheduled path is fire-and-forget (a webhook on actor completion writes to Postgres).

The GitHub Actions workflow

This is the workflow we run on every production deploy of the marketing site. It runs after the deploy step succeeds, audits 8 critical URLs against Google's published Core Web Vitals thresholds (LCP < 2.5s good, CLS < 0.1 good, INP < 200ms good — INP replaced FID as a Core Web Vital in March 2024), and either passes the deploy or fires a Slack alert.


    name: Post-deploy Lighthouse audit

    on:
      workflow_run:
        workflows: ["Deploy production"]
        types: [completed]

    jobs:
      lighthouse:
        if: ${{ github.event.workflow_run.conclusion == 'success' }}
        runs-on: ubuntu-latest
        steps:
          - name: Trigger bulk audit
            id: audit
            env:
              APIFY_TOKEN: ${{ secrets.APIFY_TOKEN }}
              PSI_KEY: ${{ secrets.PSI_API_KEY }}
            run: |
              RUN_ID=$(curl -s -X POST \
                "https://api.apify.com/v2/acts/nexgendata~page-speed-analyzer/runs?token=$APIFY_TOKEN" \
                -H 'Content-Type: application/json' \
                -d '{
                  "urls": [
                    "https://www.example.com/",
                    "https://www.example.com/pricing",
                    "https://www.example.com/product",
                    "https://www.example.com/enterprise",
                    "https://www.example.com/docs",
                    "https://www.example.com/blog",
                    "https://www.example.com/signup",
                    "https://www.example.com/login"
                  ],
                  "strategy": "mobile",
                  "apiKey": "'$PSI_KEY'",
                  "categories": ["performance", "accessibility", "best-practices", "seo"],
                  "metadata": {
                    "trigger": "deploy",
                    "commit_sha": "${{ github.event.workflow_run.head_sha }}",
                    "deploy_run_id": "${{ github.event.workflow_run.id }}"
                  }
                }' | jq -r '.data.id')

              # Poll for completion (sync runs cap at 5 min on the free tier)
              while true; do
                STATUS=$(curl -s \
                  "https://api.apify.com/v2/actor-runs/$RUN_ID?token=$APIFY_TOKEN" \
                  | jq -r '.data.status')
                if [ "$STATUS" = "SUCCEEDED" ]; then break; fi
                if [ "$STATUS" = "FAILED" ] || [ "$STATUS" = "ABORTED" ]; then
                  echo "Audit run $RUN_ID ended with status $STATUS"
                  exit 1
                fi
                sleep 10
              done

              # Pull dataset
              DATASET_ID=$(curl -s \
                "https://api.apify.com/v2/actor-runs/$RUN_ID?token=$APIFY_TOKEN" \
                | jq -r '.data.defaultDatasetId')
              curl -s \
                "https://api.apify.com/v2/datasets/$DATASET_ID/items?token=$APIFY_TOKEN&format;=json" \
                > audit.json

              echo "audit_file=audit.json" >> $GITHUB_OUTPUT

          - name: Threshold check
            run: |
              node <<'EOF'
              const fs = require('fs');
              const results = JSON.parse(fs.readFileSync('audit.json', 'utf8'));
              const failures = [];
              for (const r of results) {
                const lcp = r.audits['largest-contentful-paint']?.numericValue;
                const cls = r.audits['cumulative-layout-shift']?.numericValue;
                const tbt = r.audits['total-blocking-time']?.numericValue;
                const perf = r.categories.performance.score * 100;
                if (lcp > 2500) failures.push(`${r.requestedUrl}: LCP ${Math.round(lcp)}ms (target <2500)`);
                if (cls > 0.1) failures.push(`${r.requestedUrl}: CLS ${cls.toFixed(3)} (target <0.1)`);
                if (tbt > 200) failures.push(`${r.requestedUrl}: TBT ${Math.round(tbt)}ms (target <200, INP proxy)`);
                if (perf < 75) failures.push(`${r.requestedUrl}: Performance score ${Math.round(perf)} (target >=75)`);
              }
              if (failures.length) {
                fs.writeFileSync('failures.txt', failures.join('\n'));
                process.exit(1);
              }
              EOF

          - name: Slack alert on regression
            if: failure()
            env:
              SLACK_WEBHOOK: ${{ secrets.SLACK_PERF_WEBHOOK }}
            run: |
              MESSAGE=$(cat failures.txt)
              curl -X POST -H 'Content-Type: application/json' \
                -d "{
                  \"text\": \"Deploy ${{ github.event.workflow_run.head_sha }} regressed Core Web Vitals\",
                  \"attachments\": [{
                    \"color\": \"danger\",
                    \"text\": \"$MESSAGE\"
                  }]
                }" \
                "$SLACK_WEBHOOK"

A few details worth flagging:

The workflow uses workflow_run rather than push or deployment so it only fires after the production deploy actually succeeds. A failed deploy should not also fire a perf alert.
The poll loop is intentional. Apify supports synchronous run endpoints (/run-sync) but they have a 5-minute timeout that can bite when one of your URLs is slow. The async pattern with a poll handles 30-second runs and 4-minute runs identically.
The threshold check uses Google's published "good" thresholds for LCP and CLS. For the third Core Web Vital (INP), the actor reports TBT as the lab proxy because INP cannot be measured in lab conditions. The Web Vitals team's published guidance is that TBT < 200ms in the lab correlates well with INP < 200ms in the field.
The metadata.commit_sha field is critical for postmortem analysis. When a regression alert fires, the first thing you want is the diff between the regression deploy and the previous green deploy.

Threshold tuning to avoid alert fatigue

Naive threshold alerts are how perf monitoring earns the reputation of being noise. The above workflow uses absolute thresholds (LCP > 2500ms = fail), which is the right starting point but will fire on a single bad audit caused by a transient PSI quirk. Two refinements that get the noise down to roughly 1 false positive per month:

1. Median of 3 runs. Lighthouse run-to-run variance on the same URL is typically ±10-15% on LCP and ±5 points on the performance score. A single audit can fire spuriously. Adjust the actor input to run each URL 3 times and take the median:


    {
      "urls": [...],
      "runsPerUrl": 3,
      "aggregation": "median"
    }

The actor returns one record per URL with the median values across the 3 runs. Cost triples but you go from "fires once a week" to "fires once a quarter."

2. Compare against last green deploy, not absolute threshold. Google's "good" thresholds are right for the long-term goal but wrong for deploy-gate alerting. If your baseline LCP is 1.8s and you regress to 2.3s, that is a 28% regression that the absolute-threshold check will miss because 2.3s is still "good." Pull the median LCP from the last green deploy and alert on a 15% regression or 300ms absolute regression, whichever is larger:


    const baseline = await fetchLastGreenAuditFromPostgres(url);
    const regressed =
      (current.lcp > baseline.lcp * 1.15) ||
      (current.lcp - baseline.lcp > 300);

This is the single biggest win in alert quality. It changes the alert from "this page is slow" (which the team already knows) to "this deploy made this page slower" (which the team needs to act on).

3. Don't alert on the first 30 seconds after deploy. CDN cache purges and edge propagation take 10-90 seconds depending on your provider. Run the audit, wait 60 seconds, run again, and only alert if the second run still fails.

Storing the time series

The Apify dataset is your raw event log. For trend analysis and the "last green deploy" lookup above, you want the data in Postgres or a time-series database. The schema we use:


    CREATE TABLE web_vitals_audits (
      id BIGSERIAL PRIMARY KEY,
      audit_ts TIMESTAMPTZ NOT NULL DEFAULT NOW(),
      trigger TEXT NOT NULL,             -- 'scheduled' | 'deploy'
      commit_sha TEXT,
      url TEXT NOT NULL,
      strategy TEXT NOT NULL,            -- 'mobile' | 'desktop'
      performance_score INT,
      accessibility_score INT,
      best_practices_score INT,
      seo_score INT,
      fcp_ms INT,
      lcp_ms INT,
      cls NUMERIC(6,4),
      tbt_ms INT,
      tti_ms INT,
      speed_index INT,
      apify_run_id TEXT,
      raw_audit JSONB
    );

    CREATE INDEX ON web_vitals_audits (url, audit_ts DESC);
    CREATE INDEX ON web_vitals_audits (commit_sha) WHERE commit_sha IS NOT NULL;
    CREATE INDEX ON web_vitals_audits (trigger, audit_ts DESC);

The raw_audit JSONB column stores the full Lighthouse report, which is critical for diagnostics. When an alert fires you want to be able to pull audits['unused-javascript'], audits['render-blocking-resources'], and audits['largest-contentful-paint-element'] without re-running the audit. A year of hourly audits across 50 URLs comes out to roughly 6GB with the raw JSONB, which fits comfortably on a small managed Postgres instance.

For ingestion, set an Apify webhook on ACTOR.RUN.SUCCEEDED pointed at a small ingestion endpoint (a 50-line Cloud Function or Lambda) that pulls the dataset and inserts. You do not need Kafka for this volume.

Grafana dashboard

The dashboard we run has 4 panels:

Per-URL LCP time series — last 7 days, p50 across all hourly audits, with a deploy marker overlay pulled from the commit_sha column. The deploy markers are the most important visual element. Most regressions visually correlate with a deploy marker on this chart.
Performance score heatmap — 50 URLs on the y-axis, 24 hours on the x-axis, colored by performance score. A bad deploy shows up as a vertical red stripe. A bad page shows up as a horizontal red stripe. Different action implied.
Regression count by week — number of alerts fired, week over week. If this is trending up, the team needs a perf budget conversation, not more monitoring.
Per-deploy delta table — for each of the last 20 production deploys, the delta in p50 LCP, CLS, and performance score vs. the previous deploy. Sortable by regression magnitude.

All four panels query the Postgres table directly. No special time-series database required.

When to escalate to a real APM tool

The pipeline above does not replace DataDog or New Relic for everything. The right model is layered:

Page Speed Analyzer + scheduled audits : marketing site, blog, docs, public landing pages. Lab data, deploy gates, trend monitoring.
DataDog RUM or Sentry Performance : app surfaces, checkout funnel, authenticated dashboards. Field data, session replay, error correlation.
DataDog Synthetics or Checkly : critical user journeys (login flow, checkout) where you need synthetic transactions, not just page loads.
OpenTelemetry traces : backend latency root-cause when a frontend regression turns out to be a slow API.

Treat the Lighthouse pipeline as your perimeter monitoring and the heavier APM tools as your interior monitoring. Most teams over-instrument the perimeter and under-instrument the interior because synthetic monitoring vendors price them the same way.

Operational notes from running this in production

A few things that are not obvious until you have run this for 6 months:

Mobile audits are noisier than desktop. PSI's mobile emulation throttles CPU 4x and network to slow 4G. Run-to-run variance on mobile LCP is roughly 2x desktop. Tune thresholds accordingly or only alert on desktop.
Geographic variance is real. PSI runs from a Google datacenter. If your audience is not in North America, lab numbers will differ from CrUX field data. Cross-reference both.
CrUX field data should inform your alert thresholds. If your CrUX p75 LCP is 2.4s, an absolute lab threshold of 2.5s will alert constantly. Set lab thresholds against your current field reality, not Google's idealized cutoff.

Get started

The Page Speed Analyzer actor is at apify.com/nexgendata/page-speed-analyzer. Free Apify account, supply your Google PSI API key, and you can have the GitHub Actions workflow above running in your repo this afternoon. The Postgres ingestion and Grafana dashboard are a one-day follow-on.

For broader site-reliability tooling on the same pricing model, the NexGenData library has companions worth knowing about:

DNS Propagation Checker for DNS migration validation and multi-region resolution checks
DMARC Bulk Auditor for email infrastructure monitoring across owned domains
Website Content Crawler for sitemap auditing and content drift detection on the same surfaces you are perf-monitoring
Contact Info Scraper for keeping incident-contact directories current across vendors and partners

The full catalog is at apify.com/nexgendata. Pay-per-event pricing across all of them, which means you can stand up a respectable platform-engineering tooling stack for less than the monthly cost of one DataDog seat.

Marketing-page perf monitoring does not need to be a recurring USD 4,000 line item. A scheduled actor, a 60-line GitHub Actions workflow, and a Postgres table cover the use case at 1% of the cost. The savings are not even the most interesting part — the most interesting part is that you stop ignoring perf alerts because you are no longer paying enough money to feel obligated to.

DEV Community