DEV Community: Mike

Your URL-fetching service is an SSRF engine - here's the two-layer guard we shipped

Mike — Wed, 15 Jul 2026 06:03:39 +0000

If your service takes a URL from a user and fetches it server-side - a monitor, a link previewer, a webhook tester, an "import from URL" button - you've built a server-side request forgery (SSRF) engine whether you meant to or not. NorthDuty is a website health monitor: you give it a URL, we load it in a real Chromium browser and report back. That is exactly the shape of the problem.

The attack

Our server has a network position your users don't. It runs in a VPC. It can reach the cloud metadata endpoint at 169.254.169.254. It can reach internal services on 10.x and 192.168.x that never touch the public internet.

So an attacker signs up and adds a monitor for:

http://169.254.169.254/latest/meta-data/iam/security-credentials/

If we naively fetch that and hand back the response — or even just its status and timing — they've pulled temporary IAM credentials for the role running the fetch. The same trick reaches internal admin panels, databases with HTTP interfaces, and localhost services bound only to loopback.

Why the obvious fix falls short

The first instinct is a blocklist on the hostname string: reject localhost, 127.0.0.1, maybe 169.254.169.254. Two things break it.

It's ranges, not addresses. Loopback isn't one IP, it's 127.0.0.0/8. Private space is three CIDR blocks. Add carrier-grade NAT (100.64.0.0/10), link-local (169.254.0.0/16, which contains the metadata IP), and every IPv6 equivalent. And ::ffff:127.0.0.1 is loopback written as an IPv4-mapped IPv6 address — a substring match on "127.0.0.1" won't catch ::ffff:7f00:1, which is the same address. You have to parse and classify the resolved IP, not pattern-match the text someone typed.

Time-of-check vs time-of-use. Even a correct IP classifier run once is beatable with DNS rebinding. The attacker controls evil.example.com. When your check resolves it, it answers with a public IP — pass. Milliseconds later, when the browser actually connects, the same hostname resolves to 10.0.0.1. Your check and your fetch saw different answers.

How we actually guard it

Two layers, and a request only proceeds if both allow it.

1. Pre-navigation. Before the browser touches the URL, resolve the hostname and classify the resolved IP against every blocked range. The IPv4 CIDRs we reject:

const IPV4_BLOCKED_CIDRS = [
  ['0.0.0.0', 8], ['10.0.0.0', 8], ['100.64.0.0', 10],
  ['127.0.0.0', 8], ['169.254.0.0', 16], ['172.16.0.0', 12],
  ['192.168.0.0', 16], ['198.18.0.0', 15], // + docs / multicast / reserved
];

IPv6 gets expanded to its eight 16-bit hextets first — including any embedded IPv4 tail — so equivalent notations collapse to one canonical form before we classify them.

2. Per-request. A page isn't one fetch. It pulls in scripts, images, XHRs — dozens of URLs you never typed. So we attach a route handler in Playwright that re-runs the same IP check on every sub-resource hostname, caching per host. This is the DNS-rebinding defense: the check runs at connect time, not once upfront, so a hostname that flips to a private IP between resolution and use gets caught on use.

There's exactly one escape hatch: ALLOW_PRIVATE_URLS=true, which disables both layers for local development and nowhere else.

The part nobody warns you about

This guard is security-critical and it lives in two separate services — the screenshot worker and the user-flow worker, both driving Playwright. Copy-pasting a security control into two codebases is how you end up with two codebases that slowly disagree. One gets a fix for a new IPv6 edge case; the other doesn't; six months later one service is exploitable, the other isn't, and nobody remembers why.

Our answer is blunt: the shared core file has to stay byte-for-byte identical between the two repos. A CI job fetches the sibling repo's copy and fails the build on a single byte of difference. Not "logically equivalent" — identical. Fix one and CI stays red until you've copied it verbatim into the other. It's crude, but a security check that silently diverges across services is worse than one you deliberately maintain in two places.

Takeaway

If you fetch user-supplied URLs server-side, assume you're running an SSRF proxy until you've proven otherwise. Classify resolved IPs against ranges, not a handful of literal addresses. Check at connect time, not just upfront, or DNS rebinding walks straight through. And if the guard has to exist in more than one service, make divergence a build failure — not a code-review hope.

This is roughly why the SSRF guard was one of the first things we built into NorthDuty instead of one of the last.

Stop Trusting Screenshots: Why Visual Regression Monitoring Cries Wolf (and How to Fix It)

Mike — Sun, 05 Jul 2026 12:22:52 +0000

Last month our visual-diff monitor flagged 47 changes on a client's homepage in one run. Forty-six of them were a rotating testimonial carousel that happened to land on a different slide each time the page was captured. One was real.

If you've built or used any screenshot-based monitoring, you already know this problem. Two screenshots of the exact same, unchanged page rarely match pixel-for-pixel. Carousels rotate. Cookie banners fade in on a timer. Lazy-loaded images pop in a beat late. Ads shift half a pixel. Fonts render with slightly different anti-aliasing depending on what else the browser was doing. Diff two raw captures and you get a wall of "changes," and within a week nobody on the team opens the alert anymore.

Why the obvious fixes don't work

The first instinct is usually to loosen the pixel-diff threshold. That just trades false positives for false negatives - now a genuinely moved button or a broken layout has to clear the same bar as carousel noise, so you miss the thing you built the tool to catch in the first place.

The second instinct is manual exclusion zones: tell the tool to ignore the carousel <div>, the ad slot, the cookie banner. This works until the page changes - a redesign moves the carousel, a new banner ships with a different selector, and you're back to noisy alerts plus a pile of dead config nobody remembers writing.

The third "fix" is tolerating the noise, which is what most teams actually do in practice, and it's a big part of why visual regression tooling has a reputation for being more trouble than it's worth.

Make the page prove it's stable before you trust anything about it

The fix that actually moved the needle for us wasn't a smarter diff algorithm. It was refusing to treat a single screenshot as ground truth at all.

Before any comparison happens, the page goes through a stabilization pass: known cookie/consent overlays get removed (we track a couple hundred variants at this point — cookie banner vendors are not standardized), carousels and video get paused, lazy-loaded images get force-loaded, and the page gets scrolled in passes to trigger anything that only renders on scroll.

That alone helps, but it doesn't prove the page is actually settled. So after stabilization, we capture the page twice in a row and diff those two captures against each other, using the same production differ we use for real comparisons. If more than a small threshold of pixels changed between two captures that are supposed to be identical, the page isn't stable yet - something is still animating, loading, or rotating.

Roughly, the logic looks like this:

async function captureStable(page) {
  for (let attempt = 0; attempt < STABILITY_ATTEMPTS; attempt++) {
    await stabilizePage(page); // remove overlays, pause media, force-load lazy content
    const shotA = await screenshot(page);
    const shotB = await screenshot(page);
    const changedPct = diffPercent(shotA, shotB);

    if (changedPct <= STABILITY_THRESHOLD_PERCENT) {
      return shotB; // page proved it can hold still — safe to compare against baseline
    }
    // still moving — stabilize again before giving up
  }
  throw new UnstablePageError();
}

Our threshold is 0.1% changed pixels, and we allow two stabilization attempts before failing the job outright rather than uploading a screenshot we don't trust. A failed stability check is a signal in itself — it usually means the page has something genuinely hard to capture (an ad network with aggressive refresh, a video background, an A/B test swapping content client-side) and it's better to surface that than to silently pass along a noisy baseline.

Only a screenshot that survives its own self-check gets compared against yesterday's baseline.

The alignment problem nobody mentions

Even once you trust both images, naive pixel diffing has a second failure mode: it assumes the two screenshots are pixel-aligned. In practice, content shifts vertically all the time for legitimate reasons - someone adds an announcement banner, a cookie notice that failed to get removed shifts everything down 40px, or a section above the fold got taller. Diff that directly and you get a false positive across the entire page below the shift, even though nothing actually changed except its position.

Our differ handles this by hashing horizontal strips of the current image (a cheap perceptual hash, not a cryptographic one) and searching for the best-matching row in the baseline using a mix of exact hash hits, Hamming-distance neighbors, and a pixel-difference-validated seed search. It always resolves to some baseline row, so the real pixel comparison never runs against an arbitrarily clamped position. After the raw pixel diff, small isolated diff regions (under about 8 pixels) get dropped as noise, and the surviving regions get dilated slightly and rendered as a yellow-outlined overlay on a desaturated background — so a human reviewing the alert can immediately see what changed without hunting for it in a full-color before/after.

None of these tuning constants — the stability threshold, the retry count, the noise-component cutoff — are arbitrary. They came from running the pipeline against real, noisy production pages and adjusting until the false-positive rate actually dropped instead of just moving around.

Takeaway

Most of the hard problems in visual regression monitoring aren't about detecting pixel differences — pixelmatch and friends solve that part in a few lines. The hard problem is deciding which differences are worth waking someone up for, and that requires the tool to be skeptical of its own inputs first. Verify that a page can hold still before you trust any diff computed from it. That one change did more for our false-positive rate than any threshold or alignment tuning we tried on top of it.

This is the stability-check pipeline behind NorthDuty's visual-diff monitoring, if you want to see it end to end rather than reimplement it.

What Building Website Monitoring Taught Me About Silent Failures

Mike — Sat, 13 Jun 2026 20:58:18 +0000

I assumed when building NorthDuty that website monitoring would focus on the obvious failures.

The site is down.

The SSL certificate expired.

DNS is broken.

The server returns 500.

That kind of thing.

Oh and yes those problems are important. If a site is totally down you certainly want to know asap.

But the more I worked on monitoring, the more I came to realize that the most irritating failures are not the loud ones. They are the silent ones.

The website answers back. The status code is acceptable. The home page at least is loading, technically. Doesn't look broken at a simple HTTP sniff.

But for the user, something important is gone.

`200 OK` can lie to you

One of the first lessons was that 200 OK does not mean “the website works.”

It only means the server returned a response.

Simple to say, but easy to lose sight of when you're building monitoring. That's as far as a lot of uptime checks go. They'll make a request, check the status code, maybe look at response time, and consider that healthy.

The problem is that users do not experience status codes.

They experience pages.

A frontend app can return 200 OK and the screen can stay blank because the JavaScript crashed. A dashboard can load the shell but never fetch the data. A pricing page can render, but the CTA button can be invisible because of a CSS change. A checkout flow can fail after the first step, while the homepage looks 100% fine.

From the monitor’s point of view, everything passed.

From the user’s point of view, the product is broken.

That gap is where a lot of silent failures live.

The page can be alive and broken at the same time

This was probably the biggest mental shift for me.

I used to think of websites in black and white, before putting in deeper checks: up or down.

Now I think there is a large middle area.

A site can be “up” but unusable.

A page can load but be empty.

A form can appear but not submit.

A login page can work while the logged-in app is broken.

A marketing site can be perfect while the actual product is failing.

That middle area is dangerous because it does not always trigger alarms.

Monitoring just the homepage, you might overlook the dashboard. Monitoring just the API, you might overlook the frontend. Monitoring just response codes, you might overlook rendering issues. Looking only at logs, you might overlook what the user actually saw.

This is why I started caring more about screenshots, visual diffs, and user flows.

Not because they are fancy features, but because they answer a better question:

What did the user actually get?

Screenshots are surprisingly honest

A screenshot is not perfect, but it is hard to argue with.

If the page is blank, the screenshot shows it.

If the layout exploded, the screenshot shows it.

If a cookie banner covers half the screen, the screenshot shows it.

If the main content never appears, the screenshot shows it.

Adding screenshot-based monitoring changed my perspective on failures. Logs are good, but logs say what the system thought happened. Screenshots say what the user likely saw.

That difference matters.

Of course, visual monitoring has its own problems. Pages are messy. Dates change. Animations move. Fonts load at slightly different times. Ads and third-party widgets do whatever they want. If you alert on every pixel difference, you will create noise fast.

Screenshotting the changes is easy. The hard part is knowing which changes to screenshot.

A visual diff that reports “this page changed by 0.3%” is not inherently useful. A visual diff that highlights “the hero section has disappeared”, “the layout has shifted”, or “the app has rendered an error screen” is quite useful.

That kind of failure is easy to miss with traditional uptime monitoring.

User flows matter more than pages

Yet another useful thing I learned: it's good to monitor a page, but it's better to monitor a flow.

Most visitors aren't just coming to look at a product, they want to do something.

They want to log in.

Create a project.

Submit a form.

Check a report.

Finish checkout.

Invite a teammate.

If one step in that path breaks, the product is broken for that user.

This is particularly true for SaaS products. The homepage can look great while the app itself is not useable. The API can be up while authentication is broken. The dashboard can load while the main action button does nothing.

A single URL check will not catch that.

That's why I think user-flow monitoring is closer to actual reliability. It doesn't ask, "Does this URL respond?" It asks, "Can someone still complete the thing they came here to do?"

That is a much better question.

Silent failures are expensive because users rarely report them

This part is easy to underestimate.

When something breaks, we like to imagine users will tell us.

Sometimes they do. Most of the time, they do not.

They refresh.

They try again.

They assume the product is unreliable.

They leave.

If they're shopping, you may never know you lost them. If they're a customer, they may silently lose faith. If they were ready to buy, they may not.

That's what makes silent failures so aggravating. They do not necessarily cause a big fuss. They just quietly erode confidence.

And because they are quiet, they can stay hidden longer than they should.

What I monitor differently now

Building website monitoring made me less satisfied with basic uptime checks.

I still believe they're required. Status codes, SSL certs, DNS, response time and obvious server failures should be monitored.

But I would not stop there.

Now I care about things like:

Did the page actually render?
Is the screen blank?
Did the frontend throw an error?
Did the visual layout change in an unexpected way?
Are important elements still visible?
Can the main user flow still complete?
Is the mobile version still usable?
Are third-party scripts slowing down or blocking the page?

These checks are closer to what users actually experience.

Because at the end of the day, nobody cares that your server technically responded if the page they needed is broken.

The main lesson

The most valuable lesson that NorthDuty has taught me is that "up" is not equal to "working".

A website can be online and still fail its users.

That's easy to say but it changes the way you approach monitoring. You stop focusing on the infrastructure and you start focusing on the experience. You stop asking if the server responded and you start asking if the product still works from the outside.

Silent failures are hard because they do not announce themselves clearly.

You have to go looking for them.

And once you start looking at sites like that, basic uptime monitoring starts to seem like a first layer.

DEV Community: Mike

Your URL-fetching service is an SSRF engine - here's the two-layer guard we shipped

The attack

Why the obvious fix falls short

How we actually guard it

The part nobody warns you about

Takeaway

Stop Trusting Screenshots: Why Visual Regression Monitoring Cries Wolf (and How to Fix It)

Why the obvious fixes don't work

Make the page prove it's stable before you trust anything about it

The alignment problem nobody mentions

Takeaway

What Building Website Monitoring Taught Me About Silent Failures

200 OK can lie to you

The page can be alive and broken at the same time

Screenshots are surprisingly honest

User flows matter more than pages

Silent failures are expensive because users rarely report them

What I monitor differently now

The main lesson

`200 OK` can lie to you