Benchmark Deep Dive: Stochastic Analysis of 35,008 Ground-Truth Samples

SQEval Benchmark Deep Dive: Stochastic Analysis of 35K Ground-Truth Samples Research & Data Science Benchmark Deep Dive:Stochastic Analysis of 35,008 Samples Full statistical breakdown of SQEval heuristic engine accuracy — percentiles, variance, per-category MAE, and distribution analysis. By Anton Gorshkov · March 28, 2026 · Engine v1.15.1-STABLE MAE 7.27% 5K dataset Std Dev 5.01% Low variance Median 6.5% Error midpoint P99 19% 99th percentile This is the first time we're publishing a complete stochastic breakdown of the SQEval heuristic engine. We ran two parallel benchmark suites — a 5,003-sample standard set and a 30,005-sample massive set — both backed by real ground-truth data categorized across 7 quality tiers. Here's what the numbers tell us. 1. Error Distribution & Percentiles The error distribution is right-skewed: most predictions land close to ground truth, with a long but thin tail. The median error (6.5%) sits well below the mean (7.27%), confirming the skew. Error Percentile Distribution 0% 5% 10% 15% 22% 1.25% 2.58% 6.5% 11% 14.5% 16.5% 19% P10 P25 P50 P75 P90 P95 P99 Absolute error at each percentile (lower is better) Key Insight 95% of all predictions fall within 16.5 points of ground truth. The median error of 6.5% means that for a typical page, the engine's score lands within half a letter grade of the human-rated target. 2. Scale Stability: 5K vs 30K A core question: does accuracy degrade at scale? We ran identical analysis logic on two independently generated ground-truth sets. The answer is a clear no. Metric 5,003 Samples 30,005 Samples Delta MAE 7.27% 7.31% +0.04 Std Dev 5.01% 5.05% +0.04 Variance 25.11 25.47 +0.36 Median Error 6.5% 6.5% 0.00 Avg System Score 63.42 63.35 -0.07 Max Deviation 22% 22% 0.00 96% Scale-Invariant Accuracy Scaling from 5K to 30K samples increased MAE by only 0.04 percentage points. All percentile boundaries held identical. The engine is deterministic and scale-invariant. 3. Per-Category Accuracy Not all content types are equal. High-quality news and authority sites are easiest to score accurately, while spam and low-quality content produces wider variance. MAE by Category (30K dataset) High Quality News 3.08% (n=4,357) Authority / Gov / Edu 3.84% (n=1,508) Platform / Social / General 7.75% (n=19,230) Low Quality / Spam 10.38% (n=4,890) YMYL / Medical 12.00% (n=11) Education / Reference 15% Tech / Search 16% Global MAE (7.27%) 3.08% Best accuracy: High Quality News & Expertise. Structured content with clear E-E-A-T signals is easiest for the heuristic engine to evaluate. 10.38% Hardest category: Low Quality / Spam (large sample). Deliberately deceptive content blurs the signal boundary between low-quality and mid-tier pages. 4. Dataset Composition The ground-truth dataset mirrors real web distribution. Platform/Social sites dominate at 64%, reflecting the actual internet landscape. 35K samples 64.1% Platform / Social / General (19,230) 16.3% Low Quality / Spam (4,890) 14.5% High Quality News (4,357) 5.0% Authority / Gov / Edu (1,508) <0.1% YMYL, Education, Tech (20) 5. Full Stochastic Summary Statistic Value (5K) Value (30K) Mean (MAE)7.27%7.31% Standard Deviation5.01%5.05% Variance25.1125.47 Median (P50)6.5%6.5% P101.25%1.25% P252.58%2.58% P7511%11% P9014.5%14.5% P9516.5%16.5% P9919%19% Max Deviation22%22% Methodology All benchmarks use real ground-truth data with human-rated target scores across 7 quality categories. HTML is mocked by category to isolate heuristic engine logic from network variability. The engine version tested is v1.15.1-STABLE. Score range: 0–100. Tests executed on March 28, 2026. Try SQEval on your own site Run a real-time quality audit backed by the same engine. START AUDIT →

DEV Community

Benchmark Deep Dive: Stochastic Analysis of 35,008 Ground-Truth Samples

Top comments (0)