DEV Community: Benjamin

I Tested My Security Scanner on 500 Sites and Found It Was Lying About 158 of Them

Benjamin — Tue, 24 Mar 2026 18:22:47 +0000

Two days ago I published how I rebuilt my scoring from scratch. I recalibrated 20+ finding severities against CVSS and Bugcrowd, built SPA detection, and aligned with industry standards. Users confirmed the fixes worked.

Then I decided to actually test whether my scanner tells the truth.

Not "scan a few sites and eyeball the results." Real testing. A/B simulations on every scan in my database. Ground truth verification with actual HTTP requests. Gaming attacks against my own scoring.

I tested 500+ sites over one session. Here's what I found.

Test 1: I sent real PUT/DELETE/PATCH requests to 158 sites

My scanner flagged 158 sites for "Dangerous HTTP methods enabled: PUT, DELETE, PATCH." That's a real security finding. If your server accepts DELETE requests without authentication, someone can delete your data.

Except I never verified whether those methods were actually enabled. The scanner sent the request, saw a non-405 response, and concluded "method allowed."

So I sent real PUT, DELETE, and PATCH requests to all 158 sites and recorded what came back.

What happened	Sites	%
False positive: server returned HTML (SPA catch-all)	95	60%
False positive: server redirected (301/302/308)	56	35%
Already blocked (405/400/404) but scanner flagged anyway	7	5%
Actually enabled (real API accepting the method)	0	0%

Zero. Not one site had PUT, DELETE, or PATCH actually enabled on its homepage.

The root cause: Single Page Applications return 200 OK with their HTML shell for any HTTP method, not just GET. A Next.js app, a React SPA on Vercel, a Nuxt site on Netlify — they all respond to PUT / with their homepage. My scanner saw 200 OK and concluded the method was "allowed."

Same problem with redirects. A server that sends 301 Redirect for every request isn't "accepting PUT." It's redirecting everything. But 301 is less than 400, so my check passed it.

One user, Antoine, told me on LinkedIn: "Next.js App Router only exposes what you explicitly export. Since your route files only have GET and POST, those other methods automatically return 405. That was the only false positive. Everything else was real and we fixed it."

That last sentence matters. The other findings were real. The HTTP methods check was the one that lied.

Decision: I removed the HTTP methods check entirely. Not fixed. Removed. A check with 0% true positives has no business being in a security report. I'll reintroduce it when I have a reliable way to test methods on actual API endpoints, not homepages.

Test 2: My scanner was blind to 82% of external scripts

Mozilla Observatory checks whether your external scripts have Subresource Integrity (SRI) hashes. If a CDN gets compromised, SRI prevents the tampered script from running.

Observatory flags SRI issues on about 93% of sites. My scanner: 18%.

I assumed this was a detection quality gap. I was wrong. It was a plumbing bug.

My SRI check runs on the initial HTML returned by the server. But modern frameworks (React, Next.js, Vue, Nuxt) don't put script tags in the HTML. They inject them dynamically after the page loads. My scanner was checking a page with zero script tags and concluding "no SRI issues."

The irony: I already collect the rendered DOM. My headless browser (Cloudflare Puppeteer) renders the page, executes JavaScript, and returns the final HTML. I use it for tech stack detection. I just never connected it to the SRI check.

The fix: one line. checkSRI(renderedHtml || htmlBody) instead of checkSRI(htmlBody). The rendered HTML sees the scripts that frameworks inject. The initial HTML doesn't.

I also found a secondary bug: the SRI finding didn't have a findingId. My benchmark tool mapped Observatory's SRI test to sri-missing, but no finding ever had that ID. The 5% agreement rate I reported in my benchmark wasn't a detection gap. It was a broken mapping.

Test 3: I scanned Amazon, Netflix, and Reddit

I had built a prototype for contextual scoring: adjust finding severity based on how much JavaScript a site loads. A static portfolio with zero scripts shouldn't be penalized the same way as a site with 12 third-party trackers.

The prototype worked on my existing 362 user scans. 57% saw their score improve. Zero degraded. Average improvement: +0.78 points. 100 grade changes, all upward.

Then my CTO asked: "How many of those 362 sites classify as 'complex'?"

Zero. Every single user in my database has a simple site. My contextual scoring had only been tested in one direction: making simple sites look better. I had no data on whether it correctly maintained severity for complex sites.

So I scanned 140 of the biggest sites on the internet: Amazon, Netflix, Reddit, GitHub, Stripe, Shopify, Notion, Figma, Airbnb, Booking.com, and 130 more.

Surface Level	Sites	%
MINIMAL (0 external scripts detected)	34	24%
LOW (1-5 scripts)	106	76%
MEDIUM (6+ scripts)	0	0%
HIGH (eval/dangerous patterns)	0	0%

Amazon, Netflix, and Reddit all classified as "simple." Zero external scripts detected.

This is the same renderedHtml bug. My script count comes from the SRI check, which only sees the initial HTML. Amazon loads 50+ scripts dynamically. My scanner sees one or two.

What this meant for contextual scoring: If I deployed it as designed, Amazon and a personal blog would get the same "low surface" bonus. A missing CSP on Amazon would be downgraded to informational, same as a missing CSP on a one-page HTML portfolio. That's not contextual scoring. That's a bug that happens to look like a feature.

My CTO put it clearly: "The dev indie who sees his score go up without doing anything gets a false sense of security. That's worse than a false positive. A false positive wastes time. A false sense of security wastes vigilance."

Decision: Contextual scoring stays on hold. Not because it's a bad idea, but because the current design rewards inaction on simple sites instead of rewarding action. And the data pipeline can't distinguish simple sites from complex ones yet. Both problems need to be solved before deployment.

Test 4: I tried to game my own score

An IEEE paper, Lepochat et al. (2025) "One Does Not Simply Score a Website", found that security scoring algorithms can be trivially gamed. I cited this paper in my previous article as a methodological reference. My CTO pointed out I was using it as decoration, not as a constraint.

So I tested it. I simulated 12 gaming scenarios: adding security headers with empty, invalid, or actively harmful values. The question: does the score go up without security going up?

Results: 9 out of 12 gaming attempts succeeded.

Gaming attempt	Score change	Actual security
All headers empty/invalid at once	+2.3 (7.7 to 10.0)	Zero. Some actively worse.
Strict-Transport-Security: max-age=0	+0.8	Worse. Tells browsers to remove HSTS protection.
Referrer-Policy: unsafe-url	+0.3	Worse. Leaks full URLs to third parties.
Content-Security-Policy: (empty)	+0.3	Zero. Browser ignores empty CSP.
Content-Security-Policy: default-src *	+0.3	Zero. Allows everything.
X-Frame-Options: ALLOWALL	+0.3	Zero. Not a valid value. Browser ignores it.
X-Content-Type-Options: yes-please-sniff	+0.3	Zero. Only nosniff is valid.

The worst case: add every header with a garbage value and your score goes from 7.7 to a perfect 10. No security improvement. Some headers actively degrade your security.

HSTS: max-age=0 is the most damaging. It tells browsers to stop enforcing HTTPS. My scanner sees "HSTS header present" and removes the finding. The score goes up. The protection goes down.

The root cause: My scanner checks whether headers exist. It doesn't validate their values. Content-Security-Policy: (empty string) passes the "has CSP" check. X-Frame-Options: ALLOWALL passes the "has X-Frame-Options" check. The check is header in response, not header is correctly configured.

What I'm changing

Headers are no longer in the score

This was the hardest decision. Headers were 6 of my ~50 findings. They're the easiest thing for a user to fix. They show up on every scan.

But they're gameable. A check that can be passed by adding an empty header doesn't belong in a security score. And a check that rewards max-age=0 is actively harmful.

Headers still appear in your report as a checklist. "Your site has CSP: yes/no." "Your HSTS max-age is 31536000." But they don't affect your score.

The score now reflects only findings with ground truth: exposed files, JavaScript secrets, SSL configuration, cookie security, SRI, CORS misconfigurations, backend permission bypasses. Things I can verify are real.

SRI uses the rendered DOM

The SRI check now runs on the fully rendered page from my headless browser instead of the initial HTML. This means it sees the scripts that React, Next.js, Vue, and other frameworks inject dynamically.

Expected impact: detection goes from 18% to an estimated 60-80% of external scripts.

HTTP methods check is gone

Removed entirely. 158 false positives, zero true positives. The check tested the homepage URL with alternative HTTP methods. Every modern web framework returns 200 or redirects for any method on the root path. The check was structurally incapable of producing true positives.

What I learned

Test your assumptions with real requests

I had "automated tests" that validated the HTTP methods analyzer. The tests passed. The analyzer correctly identified status < 400 as "method allowed." The logic was correct. The assumption was wrong.

The only way I found this was by sending actual PUT requests to actual sites and looking at what came back. 200 with HTML. 301 redirect. Not a single JSON API response. The tests were green. The check was broken.

A score that can be gamed is worse than no score

If someone can improve their score by adding Content-Security-Policy: (empty), the score doesn't measure security. It measures header count. And if HSTS: max-age=0 improves the score while removing protection, the score is actively misleading.

Lepochat et al. warned about this. I cited their paper. I didn't test against it until my CTO asked.

"0 degradations" is not a result

My contextual scoring prototype showed 57% of sites improved, 0% degraded. I presented this as validation. My CTO pointed out it was a tautology: the algorithm was designed to only lower penalties, and my entire dataset was simple sites. Of course nothing degraded. The test that matters is whether complex sites keep their full severity. I couldn't run that test because my scanner can't tell the difference between a blog and Amazon.

Transparency is a tool, not a shield

My first article built trust by being honest about my scanner's limitations. That trust could become a shield for future decisions that haven't been validated. "Benji is transparent, so his contextual scoring must be solid." It wasn't. The design favored the wrong users.

The fix isn't less transparency. It's testing every claim before publishing it.

The method

If you're building a scoring system, here's the testing approach that caught these bugs:

Ground truth verification. Don't just check that your logic is correct. Verify that your assumptions about the real world are correct. Send real requests. Compare with real data.

A/B simulation on existing data. Before changing anything in production, replay every historical scan through the new algorithm. Measure what changes. If 0% degrade, ask whether that's a result or a design property.

Gaming tests. Try to improve the score without improving security. If you can, the score doesn't measure what it claims.

Benchmark against a reference, but know what you're benchmarking. I used Mozilla Observatory as a reference for 342 sites. That helped find the SRI mapping bug. But I also used Observatory correlation as a quality metric for my scoring, which doesn't make sense if I'm intentionally measuring different things.

Sources

CVSS v3.1 Specification (FIRST.org)
Bugcrowd Vulnerability Rating Taxonomy
Lepochat et al. (2025) "One Does Not Simply Score a Website" (IEEE WTMC)
Mozilla Observatory Scoring
OWASP Web Security Testing Guide

AmIHackable scans your website from the outside — the same perspective an attacker has. Scan your site now.

How I Score Your Website's Security (And Why I Rebuilt It From Scratch)

Benjamin — Sun, 22 Mar 2026 12:34:51 +0000

AmIHackable ?

I tested my scanner against Mozilla Observatory on 229 sites, found I was wrong on 40% of them, and rebuilt my entire approach. Here's everything I learned.

The problem: security scores that lie

I built AmIHackable to give developers a clear picture of their website's security. Paste your URL, get a score, fix what matters. Simple.

Except it wasn't simple. Users started telling me things like:

"My site is a React SPA on Netlify. Your scanner says I have WordPress, PHP, and an exposed .env file. None of that is true."

"You gave me 3/10 but Mozilla Observatory gives me B+. Your score is misleading for a site with TLS 1.3, solid auth, and zero XSS surface."

"The scanner flagged dangerouslySetInnerHTML as an XSS risk — but that string doesn't exist anywhere in my code. It's in React's own bundle."

These weren't edge cases. When I dug into the data, I found systematic problems.

I compared my scores against Mozilla Observatory on 229 real sites. The results were uncomfortable:

Sites that Observatory rated A+ were getting D from me
Sites that Observatory rated F were getting A+ from me
Overall correlation: 56% — barely better than flipping a coin between two adjacent grades

I had two opposite problems at the same time: too harsh on well-configured sites, too lenient on poorly-configured ones.

What went wrong

Problem 1: SPA false positives

Modern web apps use Single Page Application architecture — one HTML file serves all routes. When you request /actuator/env or /.env on a Netlify SPA, you get a 200 OK with the app's homepage. My scanner saw 200 OK and concluded the file was accessible.

Result: a React site with zero backend vulnerabilities gets flagged for Spring Boot actuator endpoints, PHP config files, and WordPress REST APIs. The score tanks to F.

One user scanned his SPA on Netlify, got 0/10 with 23 findings — 15 of which were phantom API endpoints that didn't exist. He tried again a minute later. Same result. He probably left thinking the tool was broken.

The fix: Before checking sensitive paths, I now probe a random nonsensical URL. If the site returns 200 OK with HTML (the SPA shell), I know it's a catch-all. Every subsequent check compares the response body against this fingerprint — if they match, it's the same SPA shell, not a real file.

That same Netlify site now scores 8.2/10 B with 3 real findings instead of 23 false ones.

Problem 2: severity inflation

When I first built the scanner, I made a deliberate choice: rate conservatively, alert too much rather than too little. I figured it was better to flag a missing CSP as High and have a user add it, than to call it Low and have them ignore it.

That logic felt responsible. But it backfired completely.

Missing CSP? High. Missing X-Frame-Options? Medium. Missing HSTS? High. Session cookie without HttpOnly? High. My scanner was screaming "danger" at sites that were fundamentally fine — just missing some defense-in-depth layers.

When I researched how the security industry actually rates these findings, I realized how far off I was:

Finding	My initial rating	Industry consensus	Source
Missing CSP	High	Low (CVSS 2.1-3.1)	Tenable, Acunetix, Bugcrowd VRT
Missing HSTS	High	Medium (CVSS 4.8-6.5)	Tenable, Probely, OWASP
Missing X-Frame-Options	Medium	Low (CVSS 2.1-4.3)	Bugcrowd P4-P5
Session cookie no HttpOnly	High	Low (CVSS 2.0-3.5)	Requires existing XSS to exploit
Missing Referrer-Policy	Low	Informational	Bugcrowd P5
Source maps exposed	High	Medium (CVSS 3.5-5.3)	Info disclosure, not directly exploitable

The key insight came from a user who put it perfectly: "The headers are nice-to-have, not vulnerabilities. A 3/10 score is misleading for a site with TLS 1.3, solid auth, and tested sanitization."

He was right. A missing security header is the absence of a mitigation, not the presence of a vulnerability. A missing CSP doesn't create XSS — it removes a layer of defense against XSS if one already exists. Those are two fundamentally different things.

Professional penetration testers and bug bounty platforms (Bugcrowd VRT, HackerOne) consistently rate missing headers as P4-P5 (Low/Informational) across millions of real submissions. I was rating them as active threats.

Problem 3: no detection context

My scanner treated every finding identically regardless of what the site actually does. A missing CSP on a static portfolio with zero JavaScript gets the same severity as a missing CSP on an e-commerce site loading 12 third-party scripts. Those aren't the same risk. But my score said they were.

What I rebuilt

Findings first, score second

The most important lesson from user feedback: nobody complained about the score formula. They complained about individual findings being wrong.

A perfectly calibrated scoring model applied to false findings produces false scores. I invested most of my effort into making every finding defensible before touching the scoring.

Changes deployed:

SPA catch-all detection eliminates false positives on Netlify, Vercel, and Cloudflare Pages
Bundle-aware XSS detection skips framework internals (React, Vite, Next.js bundles use dangerouslySetInnerHTML internally — flagging it was misleading)
Platform-aware email checks skip SPF/DMARC on *.netlify.app, *.vercel.app and similar — you don't control that DNS, so the finding isn't actionable
Industry-calibrated severities for all 20+ finding types, each sourced on CVSS v3.1, Bugcrowd VRT, and OWASP WSTG

Technology detection

My tech detection is powered by Wappalyzer's open-source database (7,500+ technologies) combined with headless browser execution via Cloudflare Workers. I now pass JavaScript global variables detected through real browser execution and DNS records (MX, TXT, NS) to the detection engine — revealing hosting providers, email services, and CDN layers without additional requests.

I benchmarked my detection against the real Wappalyzer browser extension on 445 sites. On a real-world test like Doctolib, I match 5 out of 5 of Wappalyzer's key detections (Rails, Cloudflare, Sentry, Didomi, Bot Management) and catch 2 extras (Ruby runtime, Google Tag Manager). There are gaps in smaller libraries (Preact, PDF.js) that I'm closing.

Scoring methodology

Each finding's severity is now aligned with industry standards:

Severity	What it means	Based on	Examples
Critical	Immediate risk, exploitable now	CVSS 9.0-10.0	Exposed `.env` with credentials, SSL failure, database RLS bypass
High	Significant weakness	CVSS 7.0-8.9	Secrets in JS, weak TLS protocol
Medium	Real but conditional risk	CVSS 4.0-6.9	Missing HSTS, open redirect, session cookie without Secure flag
Low	Defense-in-depth gap	CVSS 0.1-3.9	Missing CSP, missing X-Frame-Options, XSS code patterns
Info	Context, no action needed	CVSS 0	SSL configured correctly, Observatory grade, detected tech stack

Every severity decision is traceable to a published source: CVSS v3.1, Bugcrowd VRT, OWASP WSTG, or the CWE Top 25. I keep the full mapping documented internally — every finding has a CWE ID, a WSTG test reference, and a rationale.

What I detect that others can't

Adaptive Backend Probing. When my scanner detects backend credentials in your bundled JavaScript — Supabase URLs and anon keys, Firebase configs — it doesn't just flag "key exposed." It automatically tests whether that key can actually access your data. Are your database tables visible? Is Row Level Security properly configured? Can anonymous users read data they shouldn't?

This transforms a finding from "your key is visible in the JS" into "your key allows reading the users table without authentication."

SPA-Aware Scanning. I detect Single Page Application routing and adapt all checks accordingly. Sites on Netlify, Vercel, and Cloudflare Pages no longer get flagged for sensitive files and API endpoints that are actually just the SPA shell returning 200 OK for everything.

What I'm honest about

This score is not a risk prediction

My score measures your observable security posture from the outside. It doesn't predict whether you'll be breached. A site with a perfect score can have SQL injection in its login form — I can't see that without access to your code.

Think of it as a health checkup, not a diagnosis. It tells you what's visible and what to fix first.

I measure more than Observatory — and that creates divergence

Mozilla Observatory tests 10 things (all security headers). I test 50+. My scores won't always match Observatory — and that's intentional. A site with perfect headers but an exposed .env file gets A+ from Observatory and a much lower score from me. I think that's the right behavior.

Missing mitigations ≠ vulnerabilities

This is worth repeating: a missing CSP header does not make your site vulnerable to XSS. It removes a layer of defense. I score it accordingly — as Low, not Critical.

If you see a Low finding for a missing header, it means "this would strengthen your security posture" — not "you're being hacked right now."

What's next: contextual scoring

Right now, every missing CSP gets the same severity. But a missing CSP on a static portfolio with zero JavaScript is effectively Informational — there's nothing to protect against. The same missing CSP on a site loading Google Tag Manager, Stripe.js, and Intercom has real consequences — those third-party scripts are exactly the attack surface that CSP is designed to control.

I already detect your tech stack, your third-party scripts, your framework. The next step is using that context to adjust severity dynamically:

0 third-party scripts, no inline JS → CSP missing is Info (nothing to protect)
1-5 scripts, no inline → CSP missing is Low (limited surface)
6+ third-party scripts or inline JS → CSP missing is Medium (real attack surface)
eval() or dangerouslySetInnerHTML with user input → CSP missing is High (active risk)

No scanner does this today. Observatory, SecurityHeaders, Qualys — they all score findings in isolation. I'm building toward a score that understands your actual attack surface.

Sources

My methodology is built on published standards, not arbitrary choices:

CVSS v3.1 Specification — Severity ranges for individual findings (FIRST.org)
Bugcrowd Vulnerability Rating Taxonomy — P1-P5 severity from millions of real bug bounty submissions
OWASP Web Security Testing Guide — Test categorization and methodology
CWE Top 25 2024 — Common weakness scoring
Mozilla Observatory Scoring — Reference baseline for header grading
Qualys SSL Labs Rating Guide — Grade cap methodology for SSL
OECD/JRC Handbook on Composite Indicators — Composite scoring framework
Lepochat et al. (2025) "One Does Not Simply Score a Website" — IEEE critique of website scoring algorithms
Tenable, Acunetix, Probely — Individual finding severity benchmarks (CSP, HSTS, Source Maps)

AmIHackable scans your website from the outside — the same perspective an attacker has. It detects your technology stack, checks your security configuration, and gives you a prioritized list of what to fix. Scan your site now.