Kwansub Yun

Posted on Apr 9

AI SLOP Detector v3.1: Three Formula Refinements and the Adversarial Tester That Found Them

#opensource #python #architecture #devtool

We shipped v2.9.0 with a scoring engine we trusted. We ran tests. Everything passed.

Then we built a tool specifically designed to find cases where the score was less precise than it could be — and it found three.

This is the story of v3.1.0. And the patch that followed six hours later.

Glossary — internal terminology used throughout this post

Term	What it means
Deficit score	The final output of the scorer. 0 = structurally clean, 100 = critical. Derived as `100 × (1 - GQG)`.
GQG	Geometric Quality Gate. A weighted geometric mean of LDR, Inflation quality, DDC, and Purity. The single formula the scorer evaluates.
LDR	Logic Density Ratio. Ratio of executable logic lines to total lines. Low LDR = file is mostly stubs, blanks, or comments.
Inflation	Metric that flags jargon-heavy docstrings unsupported by actual code complexity. A 2-line function with a 30-line docstring using 12 buzzwords scores badly.
DDC	Dead/Duplicate Code ratio. Tracks unreachable paths, copy-pasted blocks, phantom imports.
Purity	Pattern hit rate. How many structural anti-patterns (god functions, stub returns, nested complexity) fire on the file.
Cyclomatic Complexity (CC)	Count of independent code paths. A straight-line function = CC 1. Each `if`, `for`, `while`, `except` adds 1.
fhval	flamehaven-validator. An external tool that interrogates the scorer from outside the codebase. Its purpose is to catch cases where internal test consistency masquerades as correctness.
SPAR	Subcommand of fhval. Adversarial regression loop with three layers. Tests whether the scorer measures what it claims to measure.
JSD	Jensen-Shannon Divergence. A symmetric, bounded (0–1) measure of divergence between two probability distributions. Used here to compare AST node-type histograms between functions.
AST	Abstract Syntax Tree. The parsed structure of source code. An `if` statement, a `return`, a function call each become typed nodes.
function_clone_cluster	New pattern in v3.1.0. Detects files where many functions share near-identical AST structure — the fragmented god function evasion pattern.
placeholder_variable_naming	New pattern in v3.1.0. Detects vocabulary-clean code with zero semantic content: single-letter parameter floods, sequential numbered variables.
AM/GM gap	Core refinement in v3.1.0. The calibrator used an arithmetic mean (simpler approximation); the scorer uses a geometric mean (the precise target formula). Aligning them closes a ~5-7pt estimation gap on uneven files.

Quick context

AI SLOP Detector is a static analyzer that measures structural code quality — not style, not formatting. It scores each file across four dimensions and assigns a deficit between 0 (clean) and 100 (critical):

Dimension	What it measures
LDR	Ratio of executable logic to total lines
Inflation	Jargon, docstring bloat, unsupported claims
DDC	Unreachable paths, copy-pasted blocks
Purity	Pattern hit rate (stubs, god functions, etc.)

These four numbers feed a single formula — a weighted geometric mean — called the GQG. The output is the deficit score: 100 × (1 - GQG).

The calibrator's job is to find the best weights for that formula by searching over thousands of known cases.

Before v3.1.0: the self-scan

We don't ship a version without running the detector against itself. Before cutting v3.1.0, we ran v3.0.3 — a structural debt reduction pass on the three highest-deficit files in the codebase.

Self-scan before: avg_deficit=23.57, 15 deficit files, status=suspicious
Self-scan after:  avg_deficit=20.33, 12 deficit files, status=clean

analysis/cross_file.py dropped from 70.3 to 28.7 (critical → clean). ci_gate.py from 69.3 to 22.3. cli.py from 68.4 to 20.9. The fixes were mechanical: extracted nested closures to private methods, replaced if/elif/else dispatch chains with dict dispatch, removed re-declared constants.

The point is not that these numbers are good. It's that the tool had to earn its own PASS before we shipped the version that refines the formula. Shipping a scoring engine while your own codebase sits at suspicious would have been its own kind of slop.

The adversarial tester: fhval SPAR

In a previous post we described fhval — flamehaven-validator. The core concern: when every tool in an ecosystem is built by the same person against the same baseline, internal consistency can masquerade as correctness. Passing your own tests proves nothing about whether your tests are asking the right questions.

For v3.1.0 we added a spar subcommand — an adversarial regression loop that interrogates the scorer from the outside. Running SPAR against the v3.0.x scorer:

SPAR score: 55 / 100  [FAIL]

Layer A anomalies:
  A3 stub_class_8_methods     expected >= 30  got 20.0  [ANOMALY]
  A4 fragmented_god_function  expected >= 10  got  0.0  [ANOMALY]
  A5 vocab_clean_meaningless  expected >=  8  got  0.0  [ANOMALY]

Layer C blind spots:
  C2 inflation_blindspot      [BLIND_SPOT]
  C3 ddc_annotation_gap       [BLIND_SPOT]

Three gaps. Two documented scope limits. Score: 55 FAIL.

Each gap pointed at a specific detection weakness. The SPAR methodology itself — how Layer A/B/C work, why adversarial ground truth is hard to author from inside the codebase — is a separate topic covered in tomorrow's post. Here we focus on what the gaps told us and what we changed.

Refinement 1: The calibrator and scorer were using different formulas

The scorer computes a weighted geometric mean. The calibrator — which finds optimal weights — was computing a weighted arithmetic mean as its optimization target.

Those are not the same thing, and for a quality gate, the difference is structural.

Consider a file with three dimension scores: LDR=0.9 (good), inflation_quality=0.1 (very bad), DDC=0.8 (good).

Formula	Calculation	Result	Deficit
Arithmetic mean	(0.9 + 0.1 + 0.8) / 3	0.60	40
Geometric mean	(0.9 × 0.1 × 0.8) ^ (1/3)	0.42	58

The arithmetic mean gives deficit=40. The geometric mean gives deficit=58. The gap is 18 points — not rounding, but structural. The geometric mean amplifies weak dimensions because one bad score pulls the entire product down. The arithmetic mean averages over them.

The scorer uses the geometric mean for good reason: a file with excellent LDR but zero actual logic (all docstrings) should not score deficit=30. It should score much higher. The formula enforces that.

The first-generation calibrator used an arithmetic mean as a simpler starting approximation. So it was finding weights that minimize error against a different objective than the scorer actually computes. The result: roughly 5–7 point underestimation on files with uneven dimension profiles — which are precisely the target of this tool.

The AM ≥ GM inequality means the calibrator's scores were always optimistic. For balanced files (all dimensions similar) the gap is small and harmless. For uneven files, it was systematic — and those are the cases that matter most.

Refinement:

# Before (calibrator _recompute_deficit)
quality = (w_ldr * ldr + w_inflation * (1 - inflation_n) + w_ddc * ddc) / total_w

# After — mirrors the scorer's GQG formula exactly
from math import exp, log
gqg = exp(
    (
        w_ldr * log(max(1e-4, ldr))
        + w_inflation * log(max(1e-4, 1.0 - inflation_n))
        + w_ddc * log(max(1e-4, ddc))
    )
    / total_w
)
deficit = min(100.0, 100.0 * (1.0 - gqg))

This is why SPAR anomaly A3 (stub_class_8_methods) jumped from deficit 20.0 to 40.0: the stub class had heavily uneven dimensions, and the geometric mean scored it correctly once the calibrator was trained against the right target.

Refinement 2: The complexity modifier had a dead zone at the common end

The inflation metric applies a complexity modifier to penalize functions that are simultaneously simple and jargon-heavy — a common pattern in AI-generated code: a two-line function surrounded by an elaborate docstring.

The first-generation modifier formula:

# Before
complexity_modifier = max(1.0, 1.0 + (avg_complexity - 3.0) / 10.0)

For CC=1: 1.0 + (1-3)/10 = 0.8 → max(1.0, 0.8) = 1.0
For CC=2: 1.0 + (2-3)/10 = 0.9 → max(1.0, 0.9) = 1.0
For CC=3: 1.0 + (3-3)/10 = 1.0 → max(1.0, 1.0) = 1.0

CC=1, 2, and 3 all received the same modifier: 1.0. This meant simple functions — the three most common complexity levels — paid no complexity premium on inflation, regardless of how jargon-heavy they were. The modifier only activated from CC=4 upward.

Simple jargon-heavy functions are the most common AI code signature. The formula was least sensitive precisely where it needed to be most sensitive.

# After — CC=1 is the baseline, not CC=3
complexity_modifier = max(1.0, 1.0 + (avg_complexity - 1.0) / 10.0)

Now CC=2 gets a 1.10× modifier, CC=3 gets 1.20×. The penalty scales from the simplest meaningful function upward.

Refinement 3: Purity weight was documented but not connected

The GQG formula includes a purity dimension:

# Before
w_pur = 0.10  # hardcoded constant
final_score = gqg_score * (1.0 - w_pur * purity_penalty)

.slopconfig.yaml had a weights.purity field. The calibrator's weight search had a purity parameter. Neither was connected to this constant — users could configure weights.purity: 0.20 and nothing would change.

# After
w_pur = weights.get("purity", 0.10)  # default unchanged; now configurable

One line. The config surface now matches the implementation.

Two new detection patterns

Stub evasion: empty container returns

The existing return_constant_stub pattern caught return True, return 0, return "string" — but not return {}, return [], return (), return set(). These are equally common stub patterns in class skeletons:

class DataProcessor:
    def get_results(self) -> dict:
        return {}  # was not flagged before

    def list_items(self) -> list:
        return []  # was not flagged before

Both are now caught by return_constant_stub and interface_only_class.

Fragmented god function: AST clone detection

SPAR anomaly A4 was a file with 12 one-liner helper functions:

def _compute_r1(x): return x * 1.1
def _compute_r2(x): return x * 1.2
def _compute_r3(x): return x * 1.3
# ... through r12

Each function individually looks clean: low complexity, no nesting, short. No single function exceeds any per-function threshold. But collectively, this is a decomposed god function — a large computation split into structurally identical fragments that evade per-function gates.

The new pattern: function_clone_cluster.

How it works. For each file, build a 30-dimensional histogram of AST node types for every function: how many If nodes, Return nodes, Call nodes, BinOp nodes, and so on. The histogram is normalized to a probability distribution. Then compute pairwise Jensen-Shannon Divergence between all function pairs. JSD is bounded between 0 and 1. Two functions with near-identical AST structure produce JSD close to 0.

Functions with JSD < 0.05 get an edge in a graph. BFS finds connected components. The largest component is the clone cluster.

Thresholds:
  >= 6 functions in cluster: CRITICAL
  >= 4 functions in cluster: HIGH

Why JSD and not simpler metrics. Cosine similarity or Euclidean distance on raw histograms don't handle sparse distributions well — short functions have mostly empty histograms, and small absolute differences dominate. JSD compares distributions rather than raw vectors, stable when most histogram dimensions are near zero. It also has an upper bound of 1, which makes the 0.05 threshold interpretable rather than dataset-dependent.

The JSD threshold (0.05) was calibrated against the internal test corpus. It will produce false positives on files with many similar utility functions — for example, a large set of _validate_field_X() validators that are structurally identical by design. Adjust via --config if needed.

Placeholder variable naming (v1.0)

SPAR anomaly A5 was vocabulary-clean code with zero semantic content:

def aggregate(a, b, c, d, e, f, g):
    r1 = a + b
    r2 = r1 * c
    r3 = r2 - d
    # ... through r12
    return r12

No buzzwords. No docstring bloat. Every traditional linter passes this. The new placeholder_variable_naming pattern applies two checks:

Single-letter parameter density: 5 or more single-letter parameters (excluding self, cls, _) → HIGH.
Sequential numbered variables: a run of 8 or more → HIGH; 4 or more → MEDIUM.

This is v1.0: it detects naming style, not semantic quality. Known false positive zone: scientific and math libraries legitimately use single-letter conventions (x, y, z, mu, sigma). Suppress with domain_overrides in .slopconfig.yaml.

SPAR result after v3.1.0

SPAR score: 85 / 100  [PASS]

Layer A: 5/5 anchors consistent
Layer B: 4 documented limitations (no regressions)
Layer C: C2 inflation_blindspot [BLIND_SPOT — known scope limit]
         C3 ddc_annotation_gap  [BLIND_SPOT — known scope limit]

55 → 85 PASS.

The two remaining blind spots are not gaps to close — they're the documented scope limits of static analysis: a tool that reads AST cannot determine whether arithmetic is semantically meaningful, or whether annotation-heavy imports serve a real runtime purpose. Those require a different class of model. Documenting the ceiling is part of the job.

The full SPAR methodology — how Layer A/B/C work, why Layer A ground truth is hard to author from inside the codebase, and what "validating the validator" means in practice — is covered in tomorrow's post.

v3.1.1: the self-inspection patch

v3.1.0 and v3.1.1 shipped on the same day. The clone detection pattern introduced in v3.1.0 had a visibility gap: function_clone_cluster fired in the Issues section but produced no signal in the Core Metrics table. A community issue caught it within hours.

But before cutting v3.1.1, we ran the tool against itself — and the new patterns found something:

placeholder.py    deficit: 70.3  [CRITICAL_DEFICIT]
python_advanced.py  deficit: 74.0  [CRITICAL_DEFICIT]

Both files are part of the detection engine itself. Root cause: check_node methods with cyclomatic complexity 20–31, caused by compound boolean logic that had accumulated across releases. The tool was flagging its own pattern implementations as having the exact complexity problems it was designed to detect.

We extracted four module-level helpers in placeholder.py (_strip_docstring, _has_abstractmethod, _empty_container_repr, _is_placeholder_stmt) and added _make_god_issue() and _collect_numbered_vars() to python_advanced.py. Each check_node method went from 20–70 lines to 8–15. The detector earned its own PASS before shipping the patch.

placeholder.py      70.3 → 43.7  [CRITICAL → SUSPICIOUS]
python_advanced.py  74.0 → 66.7  [CRITICAL → INFLATED_SIGNAL]

Additional v3.1.1 refinements:

Clone Detection row added to Core Metrics table (CRITICAL/PASS at a glance).
Table style unified to box.ROUNDED across all project output (was mixing three styles).
VS Code extension: extractJson() strips [INFO] log lines before JSON.parse — previously caused silent parse failures when CLI log output appeared alongside JSON. Workspace analysis replaced with a QuickPick list of deficit files sorted by score; clicking opens the file in the editor.

If you installed 3.1.0, upgrade to 3.1.1 before using clone detection in CI.

How this fits alongside existing tools

Tool	Approach	What it sees that others don't
Semgrep	Pattern-matching on AST	Rule violations you've pre-authored
SonarQube	Cognitive complexity, duplication, coverage	Complexity, coverage gaps — not structural emptiness
Radon	Cyclomatic complexity	Raw CC values; used internally by AI SLOP Detector
Bandit	Security rules	Security vulnerabilities
mutmut / cosmic-ray	Mutation testing	Whether your test suite catches real bugs
AI SLOP Detector	Metric-based structural analysis	Docstring theater, stub pipelines, fragmented logic, phantom imports

The key gap: a file can be fully SonarQube-clean while containing zero actual logic — all stubs, all docstrings, all type annotations. Cognitive complexity doesn't measure whether the complexity is real. LDR does. Inflation does.

The complementary tool here is mutation testing. SPAR tests whether the scorer measures what it claims. Mutation testing tests whether your tests catch what they claim to catch. Both are adversarial approaches to the meta-problem: how do you validate the validator?

Score evolution

If you're running AI SLOP Detector on an existing project, upgrading to 3.1.x will change your scores. The formula alignment in Refinement 1 increases deficit on files with uneven dimension profiles, typically by 3–8 points. This is not drift — it's the scorer becoming more precise in the region where it matters most. Files that were borderline suspicious may move into inflated_signal. Check your CI threshold after upgrading.

Previous scores were valid estimates produced by the first-generation model. v3.1.x scores are tighter estimates with better sensitivity where dimensions are uneven — which is precisely the profile of AI-generated code.

Honest limitations

function_clone_cluster threshold (JSD < 0.05) was calibrated against the internal test corpus. It will fire false positives on legitimate utility function clusters. Adjust via --config.
placeholder_variable_naming v1.0 has no semantic context. def distance(x, y, z) is legitimate; the pattern doesn't know that.
SPAR score 85 means five ground truth anchors pass and eight of ten Layer C probes hold. The space of evasion patterns is open-ended. More in tomorrow's SPAR post.
The Layer A corpus is internally authored. External adversarial contributions would make it stronger.

Install / upgrade

pip install ai-slop-detector==3.1.1
# or
pip install --upgrade ai-slop-detector

VS Code extension: search "AI SLOP Detector" in Extensions, or install from VSIX:

code --install-extension vscode-slop-detector-3.1.1.vsix

# Scan a project
slop-detector --project ./your-project

# Machine-readable output
slop-detector --project ./your-project --json | jq '.file_results[] | {file: .file_path, deficit: .deficit_score}'

GitHub: flamehaven01/AI-SLOP-Detector

Previous posts in this series:

DEV Community