Kwansub Yun

Posted on Mar 8

The Tool That Turned on Itself: AI-Slop-Detector v2.9.0 v2.9.1

#python #codequality #opensource #ai

v2.8.0 fixed the math.
v2.9.0 gave the tool memory.
v2.9.1 was the uncomfortable version where we ran the detector on its own source code — and then had to actually fix what it found.
Here's the full story.

1. v2.9.0 — Just one more thing

After shipping v2.8.0, I looked at the codebase and had the thought that's never good:

"This is almost there. Just one more thing."

Three "one more things" later, v2.9.0 was done.

1) Problem 1: The tool had no memory

Every run produced a score. That score disappeared.

You had no way to know if a file was getting better or worse. No way to know if the AI had been touching the same file repeatedly, each time nudging the deficit score up a little further.

Most linters (static analysis tools that check code for problems without running it) work this way — scan, report, forget. For tracking AI-generated code quality over time, that's not enough. The direction of change matters as much as the score itself.

The fix: SQLite auto-recording on every run. (SQLite is a lightweight file-based database — no server required, just a single file on disk.)

# history.py — schema kept flat and queryable
_SCHEMA = """
CREATE TABLE IF NOT EXISTS history (
    id              INTEGER PRIMARY KEY AUTOINCREMENT,
    timestamp       TEXT    NOT NULL,
    file_path       TEXT    NOT NULL,
    file_hash       TEXT    NOT NULL,   -- SHA256 prefix: only records on change
    deficit_score   REAL    NOT NULL,
    ldr_score       REAL    NOT NULL,
    inflation_score REAL    NOT NULL,
    ddc_usage_ratio REAL    NOT NULL,
    pattern_count   INTEGER NOT NULL,
    grade           TEXT    NOT NULL,
    git_commit      TEXT,               -- branch + commit auto-captured
    git_branch      TEXT
);
"""

Two things worth noting:

file_hash means we only write a new row when the file content actually changes. You're not logging every invocation — you're logging every meaningful change.
git_commit and git_branch are captured automatically. So when you look at a quality regression (a point where the score got worse), you can tie it to the exact commit that introduced it.

The DB also auto-migrates on first run using safe ALTER TABLE (a standard SQL command that adds new columns without destroying existing data) — no manual schema management if you're upgrading from an older version.

def detect_regression(self, file_path: str, window: int = 5) -> dict:
    entries = self.get_file_history(file_path, limit=window)
    if len(entries) < 2:
        return {"has_regression": False}

    latest   = entries[0].deficit_score
    baseline = entries[-1].deficit_score  # oldest in window

    delta = latest - baseline  # positive = score went up = got worse

    return {
        "has_regression": delta > 10.0,
        "delta": delta,
        "latest": latest,
        "baseline": baseline,
    }

$ slop-detector my_module.py --show-history

  my_module.py — Last 10 runs
  ──────────────────────────────────────────────
  2026-03-08 09:12  deficit=18.4  grade=B   ← today (worse)
  2026-03-07 22:41  deficit=12.1  grade=A
  2026-03-07 14:03  deficit=11.8  grade=A
  ──────────────────────────────────────────────
  [!] Regression detected: +6.3 over last 3 runs

You're no longer looking at a score. You're looking at a direction.

2) Problem 2: AI confidently imports packages that don't exist

This one surprised me more than it should have.

import advanced_nlp_toolkit          # doesn't exist
from ml_utils import SmartPredictor  # doesn't exist
import dataforge.pipeline as dfp     # doesn't exist

def process(text):
    model = SmartPredictor.load("bert-optimized")
    return advanced_nlp_toolkit.analyze(text)

Syntactically valid. Plausible-looking. Broken at runtime.

AI models generate package names that sound right. They don't verify the packages exist. This is one of the more insidious failure modes — code review usually focuses on logic, not on whether the imports resolve.

Existing unused-import detectors won't catch this either. They check whether the import is used in the file. We check whether the import exists in the environment.

The fix: a four-layer resolution index.

def _module_exists(name: str) -> bool:
    # Layer 1: sys.builtin_module_names — C-compiled modules (sys, builtins, _io)
    #           find_spec returns None for these, so they must be checked first
    # Layer 2: sys.stdlib_module_names — full stdlib (Python 3.10+)
    #           covers _thread, _collections_abc, and other internal modules
    # Layer 3: importlib.metadata.packages_distributions()
    #           handles PIL (Pillow), cv2 (opencv-python) — install name ≠ import name
    #           also normalizes my-lib → my_lib
    # Layer 4: importlib.util.find_spec fallback
    #           slower filesystem check for namespace packages and editable installs
    ...

Layer	Why it's needed
`builtin_module_names`	C modules have no file — `find_spec` returns `None`
`stdlib_module_names`	Covers `_thread`, `_collections_abc`, internal stdlib
`packages_distributions`	`PIL` ≠ `Pillow`, `cv2` ≠ `opencv-python`
`find_spec` fallback	Editable installs, namespace packages

One important detail: relative imports are excluded by design.

A relative import (like from . import utils) references another file within the same project — it's not a third-party package and can't be phantom. Regex would accidentally flag these as missing packages. AST parsing knows the difference because it reads the actual syntax (node.level > 0 means "this import is relative"), not just the text.

for node in ast.walk(tree):
    if isinstance(node, ast.ImportFrom):
        if node.level > 0:   # from . import x  or  from .. import y
            continue         # local file — skip

The detector errs toward false negatives (missing a real problem) on resolution errors rather than false positives (wrongly flagging valid code). In plain terms: if the tool isn't sure whether a package exists, it assumes it does. A false alarm on a legitimate import would erode trust in the tool faster than a missed phantom.

3) Problem 3: ML accuracy of 1.000 is a bug, not a result

After integrating the ML pipeline in v2.8.0:

RandomForest: Accuracy=1.000, Precision=1.000, Recall=1.000, F1=1.000

In machine learning, Accuracy measures how often the model is right overall, Precision measures how rarely it raises false alarms, Recall measures how rarely it misses real problems, and F1 is a combined score. Getting 1.000 on all four simultaneously is not impressive — it is a red flag.

Not a success. A data leakage problem. Data leakage means the model was accidentally given information during training that it wouldn't have in real use — making the results look far better than they actually are.

Labels (the "correct answers" used to train the model) were generated from deficit_score >= 30. Features (the input signals the model learns from) were ldr_score, inflation_score, and ddc_score — the exact components that sum to produce deficit_score. The model learned to reproduce an addition formula. We trained a calculator.

2. The v2.9.0 fixes

Real codebase data. CodeSearchNet (a public dataset of 500k Python functions scraped from open-source repositories) has structure and patterns that our synthetic generator can't produce. Even with self-supervised labels — meaning we let our own math engine label the data rather than humans — a genuinely different distribution (variety of code styles and patterns) forces the model to generalize beyond simply memorizing our formula.

def load_codesearchnet(self) -> list[RealSample]:
    ds = load_dataset("code_search_net", "python", split="train")
    return list(self._label_stream(
        ((row["func_code_string"], row.get("func_name", "")) for row in ds),
        source="code_search_net",
    ))

History data as training signal. Files that go from deficit=42 to deficit=0 across real development runs represent longitudinal change that no generator can produce. We added --export-history specifically so this data can feed back into the ML pipeline.

slop-detector --export-history training_data.jsonl
# → JSONL ready for DatasetLoader.load_jsonl()

One other change: RandomForestClassifier now trains with class_weight="balanced". Real slop rate in public codebases is around 4%. Without balancing, the model just predicts "clean" for everything and gets 96% accuracy — which is useless.

3. v2.9.1 — We ran it on ourselves

A few hours after shipping v2.9.0, we did the obvious thing: ran slop-detector --project src/ on the detector's own source.

It found three files above the threshold.

cli.py                    deficit=53.5   SUSPICIOUS
registry.py               deficit=39.5   SUSPICIOUS
question_generator.py     deficit=30.0   SUSPICIOUS

We had to fix our own code. Here's what that looked like.

We also updated the README during this process to reflect what the self-inspection made obvious:

"Authorship is irrelevant. The code speaks for itself."

It didn't flag those functions because AI wrote them. It flagged them because they were too complex. That's the whole idea.

1) cli.py (53.5 → 29.1): five god functions decomposed

print_rich_report, main, generate_markdown_report, generate_text_report, and _handle_output each exceeded the complexity-10 / 50-line threshold that triggers the god_function pattern. A "god function" is a function that tries to do everything — too long, too complex, impossible to test or reason about in isolation.

Extracted 9 single-responsibility helpers:

Original	Extracted helpers
`print_rich_report()`	`_build_rich_summary_tables`, `_build_rich_files_table`, `_render_rich_project`, `_build_single_file_content`, `_append_pattern_issues_rich`, `_render_rich_single_file`
`main()`	`_build_arg_parser`, `_evaluate_ci_gate`, `_run_optional_features`
`_handle_output()`	`_write_file`
`generate_markdown_report()`	`_md_summary_section`, `_md_test_evidence_section`, `_md_findings_section`

main() complexity dropped from 25 → 14. Every function now fits within limits. The irony of a slop detector with god functions wasn't subtle.

2) registry.py (39.5 → clean): `global` statement + DDC false positive

Two separate issues flagged here.

The global statement. registry.py used a lazy-init singleton pattern — meaning: a single shared instance of an object, created only on first use, stored in a global variable. The pattern detector flags global statements as structural anti-patterns (they create hidden shared state that makes code harder to test and reason about) — and correctly so. We replaced it with eager module-level initialization (creating the object once when the module loads), which also happens to be thread-safe.

DDC false positive. DDC (Dependency-to-Code ratio) measures what fraction of imported packages are actually used. BasePattern was imported only for type annotations — hints that tell other developers and type-checking tools what data types a function expects, but which are never executed at runtime. The usage checker correctly skips annotation-only usage when calculating DDC, which meant it scored the import as 0% used — a false alarm.

The fix: move annotation-only imports under if TYPE_CHECKING: — a Python convention that says "only process this import when a type checker is running, not at runtime."

from __future__ import annotations
from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from slop_detector.patterns.base import BasePattern

DDC now recognizes them as type-checking imports, excluded from the usage ratio by design.

3) question_generator.py (30.0 → clean): same DDC fix + Python 3.8 compatibility

Same TYPE_CHECKING fix for FileAnalysis.

Additionally: the file had drifted to Python 3.10+ union syntax — writing int | None instead of Optional[int] to mean "this can be an integer or nothing." The newer syntax is cleaner, but it breaks on Python 3.8 and 3.9. Converted back to Optional[int], Optional[str], List[Q]. The project targets python >= 3.8 and that constraint has to hold.

4) The final numbers

Metric	v2.9.0	v2.9.1
Deficit files	3	0
Avg deficit score	11.65	9.57
Weighted deficit score	15.88	12.42

188 tests. Zero regressions. Under 5 seconds.

4. What this looks like in practice

pip install ai-slop-detector==2.9.1

# Analyze + auto-record history
slop-detector your_file.py

# See per-file trend
slop-detector your_file.py --show-history

# 7-day project aggregate
slop-detector --history-trends

# Export for ML training
slop-detector --export-history training_data.jsonl

# Opt out of recording
slop-detector your_file.py --no-history

5. Honest limitations

Flamehaven's working principle is straightforward: honesty first, then trust. So before the final thoughts, here is what this tool cannot honestly claim yet.

1) This is a one-person project, and the default weights reflect that.

The core scoring formula:

# src/slop_detector/config.py
"weights": {"ldr": 0.40, "inflation": 0.30, "ddc": 0.30}

These numbers were not derived statistically. They were tuned against Flamehaven's internal codebase and hardcoded. They work well for what we build. For a framework like Django or Spring — where boilerplate (repetitive structural code that the framework requires, not that the developer chose) is a given — the ldr weight of 0.40 will likely generate false positives (incorrectly flagging legitimate code as problematic). This is a known and honest limitation of a tool built by one developer against one architecture.

The mitigation exists: .slopconfig.yaml lets you adjust weights per project. But the burden is currently on the user to know they need to. That's not good enough for a general-purpose gate, and we know it.

2) The ML pipeline took a different path — and we think it's the right one.

The tool includes an optional machine learning layer that can score files alongside the main math-based engine. The honest problem: the training labels were generated by the math engine itself (deficit_score >= 30 = slop). The features fed to the ML model are the exact components that produce that score. In practice, the model learned to approximate the formula it was trained on — not to detect anything independently. We trained a calculator.

# src/slop_detector/ml/__init__.py
"""EXPERIMENTAL: ML-based classification. Requires [ml] extras. Prototype status."""

The obvious fix would be: pull real-world code from a large public dataset (we tried CodeSearchNet — 500k Python functions), label it "AI-generated" vs "human-written," and train a classifier. Accuracy goes up. But here's why we didn't go that route:

The entire premise of this tool is that authorship doesn't matter. A god function written by a senior engineer at 2am is just as problematic as one generated by Copilot. If we train a model to detect "AI code," we're working directly against our own stated philosophy:

"Authorship is irrelevant. The code speaks for itself."

So instead, we redirected the ML toward a different question entirely — not "who wrote this?" but "is this getting worse?"

# Not: "is this AI code?"
# But: "has this file degraded since last time?"
delta = latest_deficit_score - baseline_deficit_score
return {"has_regression": delta > 10.0, "delta": delta}

A file that goes from deficit=12 to deficit=28 across three editing sessions is a signal worth catching — regardless of whether a human or an AI made those edits. That's the signal we want the ML to learn from: real usage history, not authorship labels.

The ML module stays EXPERIMENTAL until enough history data accumulates from real usage to train against. We are not there yet. We will say so plainly.

3) JavaScript and TypeScript analysis is shallower than Python — and that matters.

The tool's core strength is that it reads code structure, not code text. For Python, this means parsing the actual Abstract Syntax Tree (AST) — the logical skeleton of the code — to measure things like how deeply nested a function is, or how complex its control flow is. Text-based tools (like standard linters) can miss these structural problems entirely.

For JavaScript and TypeScript, we achieve the same depth only when an optional library called tree-sitter is installed. Without it, the analyzer falls back to regex pattern matching:

# src/slop_detector/languages/js_analyzer.py
# Fallback (regex mode, zero-dependency):
def _analyze_regex(self, ...):

Regex can find suspicious text patterns. It cannot accurately measure nesting depth or cognitive complexity (how hard a function is for a human brain to follow — the number of branches, loops, and conditions that must be held in mind simultaneously). This means a JS/TS file with deeply nested god functions may score clean if tree-sitter is absent — which directly contradicts the promise that this engine reads structure, not appearance.

It is the most technically inconsistent part of the current codebase. Patch is targeted for v2.9.2.

4) There is no wild benchmark.

188 tests pass at 100%. Every one of them was written against fixtures we created — hand-crafted worst-case examples designed specifically to be caught. There is no published result of running this tool against Django, FastAPI, NumPy, or any large external codebase we did not build.

Until that exists, "this tool works" means "this tool works on code that looks like ours." That is a meaningful claim. It is not a universal one.

Wild benchmark against at least one major open-source Python project is the first prerequisite before v3.0.

6. Final thoughts

We shipped v2.9.0 to give the tool memory. We shipped v2.9.1 because the tool used that memory against us.

And somewhere in the middle of patching our own god functions, the README update started to feel less like a tagline and more like a description of what actually happened:

"Catches the slop that AI produces — before it reaches production. Authorship is irrelevant. The code speaks for itself."

A lot of "AI code quality" tooling frames the question as did a human write this or did an AI? We spent two releases proving that's the wrong question. cli.py had five god functions. The tool found them. Whether Copilot wrote them or we did at 2am is beside the point.

The score is the score.

Over time, the history database becomes a longitudinal record — a timeline of how code quality actually moves, not just what it scores at a single point — of how code evolves under AI-assisted development. The ML pipeline now has a path to train on real signals — not synthetic ones we constructed ourselves. And the self-inspection result is the clearest proof of concept we've had: the math found the problems, and we fixed them.

That's the whole idea. It was always the whole idea.

7. Repository & Documentation

*VS Code Extension — Install directly from the marketplace

GitHub — open source, zero core dependencies.

pip install ai-slop-detector

What part of your codebase would you least want a structural analyzer to look at? That's probably where you should start.

DEV Community

The Tool That Turned on Itself: AI-Slop-Detector v2.9.0 v2.9.1

1. v2.9.0 — Just one more thing

1) Problem 1: The tool had no memory

2) Problem 2: AI confidently imports packages that don't exist

3) Problem 3: ML accuracy of 1.000 is a bug, not a result

2. The v2.9.0 fixes

3. v2.9.1 — We ran it on ourselves

1) cli.py (53.5 → 29.1): five god functions decomposed

2) registry.py (39.5 → clean): `global` statement + DDC false positive

3) question_generator.py (30.0 → clean): same DDC fix + Python 3.8 compatibility

4) The final numbers

4. What this looks like in practice

5. Honest limitations

1) This is a one-person project, and the default weights reflect that.

2) The ML pipeline took a different path — and we think it's the right one.

3) JavaScript and TypeScript analysis is shallower than Python — and that matters.

4) There is no wild benchmark.

6. Final thoughts

7. Repository & Documentation

Top comments (0)

1. v2.9.0 — Just one more thing

1) Problem 1: The tool had no memory

2) Problem 2: AI confidently imports packages that don't exist

3) Problem 3: ML accuracy of 1.000 is a bug, not a result

2. The v2.9.0 fixes

3. v2.9.1 — We ran it on ourselves

1) cli.py (53.5 → 29.1): five god functions decomposed

2) registry.py (39.5 → clean): global statement + DDC false positive

3) question_generator.py (30.0 → clean): same DDC fix + Python 3.8 compatibility

4) The final numbers

4. What this looks like in practice

5. Honest limitations

1) This is a one-person project, and the default weights reflect that.

2) The ML pipeline took a different path — and we think it's the right one.

3) JavaScript and TypeScript analysis is shallower than Python — and that matters.

4) There is no wild benchmark.

6. Final thoughts

7. Repository & Documentation

2) registry.py (39.5 → clean): `global` statement + DDC false positive