DEV Community

ricco020
ricco020

Posted on • Originally published at anonymflow.com

7 methodology choices that separate a useful VPN benchmark from a marketing chart

TL;DR

I ran a lot of VPN benchmarks this year. Some are below the noise floor. Some are signal. The difference comes down to whether the measurement is reproducible by someone who didn't run it. Here are the 7 methodology choices that separate a useful VPN benchmark from a marketing chart.

I documented this protocol on AnonymFlow's VPN leak audit methodology — this post is the dev-focused condensation.

1. Define "success" in writing BEFORE you run the test

A 90% success rate that requires multiple reconnects is not the same as a 95% rate that just works.

If you redefine success mid-test to fit the data, you have an opinion, not a measurement.

For VPN streaming sessions, our definition was:

  • Catalog displays localized regional content (not the "original country" fallback)
  • One HD video starts within 30 seconds
  • No proxy-type error message during the first 60 seconds
  • Throughput ≥ 5 Mbps sustained

A session that hits ONE of those failure modes = failed. No partial credit.

2. Pre-commit to a sample size

Before knowing the variance, before running anything: pick n and stick to it.

For our 2,850-session VPN streaming study, we picked n = 95 per VPN × service before knowing what the noise floor would be. The 95% confidence interval on a binomial proportion with n=95 is roughly ±9 percentage points. That's enough to distinguish 90% from 75%, not enough to distinguish 90% from 85%.

If you stop the test "when the result looks clean", you have selection bias.

3. Distribute over time slots

Most VPN benchmarks run "10 sessions in a row at 3 PM on a Tuesday" and call it a day. This catches one time slot's worth of routing.

Our protocol: 3 time slots per day — 9:00 AM, 2:00 PM, 9:30 PM Europe/Paris — to capture evening peak congestion, weekday routing differences, and timezone-shifted routing on the providers' end.

Same applies to disk recovery benchmarks (TRIM behavior differs based on cumulative writes since last fstrim cron) and VPS benchmarks (Contabo Nuremberg gets different transit at 9 AM vs 11 PM CET).

4. Log raw observations, not aggregates

The aggregate "90.5% recovery rate" is derived. The 100 per-file MD5 hash comparisons are raw.

If you only publish aggregates, a reader can't recompute with a different definition of success. They have to take your word for it.

Publish:

  • Per-session raw boolean (success / failure)
  • Per-session ancillary measurements (latency, throughput, error type)
  • Software versions, hardware identifiers, network conditions

Then your own conclusion section is one possible interpretation, not the only one.

5. Acknowledge biases in writing

Every measurement has biases. The honest move is to document them up front. Our 4 standard documented biases:

  1. Geographic bias — all tests from one location (Paris 15th in our case)
  2. Temporal bias — tests window doesn't cover all seasons (no World Cup, no Black Friday)
  3. Single-operator bias — one tester = one browser fingerprint
  4. Affiliation bias — we earn commission on NordVPN; this is disclosed and the methodological response is publishing all raw figures

A reader can decide which biases matter for their use case. That's only possible if biases are listed.

6. Make reproduction cheap

If reproducing your benchmark requires $5,000 of specialized hardware, no one will. If it requires a standard Contabo VPS ($5/month) and a stopwatch, dozens will.

Our benchmark setups:

  • VPN streaming: MacBook Pro M2 + Orange 1 Gbps fiber Paris + iPhone 15 Pro for mobile clients
  • Disk recovery: Samsung 870 EVO 250 GB SSD + WD Blue 1 TB HDD (consumer hardware, $80 + $35 retail)
  • WireGuard benchmark: Contabo VPS S ($4.50/month) + Linux 6.5 (standard distro)

Total reproduction cost for the 3 studies: under $500.

7. Publish to Zenodo or OSF with a DOI

A GitHub repo can be deleted. A Zenodo deposit cannot.

Once you have the raw data + methodology in a citable format, journalists, researchers, and SEO competitors who try to debunk your numbers have to point at specific cells in your CSV, not at "well their methodology was sketchy".

DOIs we published:

What this isn't

This isn't about being "scientific" in a pretentious way. It's about making the benchmark debunkable — that's the test.

A benchmark you can't debunk because you didn't publish enough data isn't trustworthy. It's just an opinion.


The full leak audit methodology is at anonymflow.com/en/blog/vpn-leak-audit-protocol — 12 reproducible steps with the criteria each step must meet.

Top comments (0)