DEV Community

Cover image for It Gets Smarter Every Scan: AI-SLOP Detector v3.5.0 and the Self-Calibration Loop
Kwansub Yun
Kwansub Yun

Posted on • Originally published at flamehaven.space

It Gets Smarter Every Scan: AI-SLOP Detector v3.5.0 and the Self-Calibration Loop

cover

Previously: ๐Ÿ”ปv3.1.0 โ€” Three Formula Refinements and the Adversarial Tester That Found Them ยท
๐Ÿ”ปv2.9.0/v2.9.1 โ€” The Tool That Turned On Itself

By late 2025, everyone was building with AI. A weekend was enough to launch a SaaS app, and by Monday it was already on Product Hunt. The code looked finished, the UI worked, and the demo landed. That was also the problem.

In 2026, some of the consequences started arriving in public. Exposed databases, weak security boundaries, brittle automation, and production systems that looked polished enough to ship but had clearly not been understood at the level their surface confidence implied. Not every one of those failures belongs to static analysis, and it would be too easy to pretend otherwise. But many of them still point to the same upstream condition: code that looks complete long before it deserves trust.

That is the layer this release is about.


The breach is the headline. The review gap is the story.

Structurally plausible, functionally thin

A missing security rule is not the same thing as a stubbed auth function. A runtime-only bug is not the same thing as a phantom import.

A broken architecture is not the same thing as a buzzword-heavy helper. These are different failure classes, and any serious tool has to respect that difference.

output scales while oversight stagnates

What they often share, though, is the review environment that let them through. AI increased output volume, increased speed, and increased surface polish.

Review depth did not increase with it. That matters because AI-generated code has a very recognizable habit: it often looks complete before it is complete.

It compiles. It passes tests. It sounds like it knows what it is doing. Then you open the function.

def calculate_quality_score(data: Dict[str, Any]) -> float:
    """
    Advanced multi-dimensional quality assessment using
    proprietary algorithms with statistical normalization,
    entropy-based weighting, and dynamic threshold calibration.
    Returns a score between 0 and 100.
    """
    # TODO: implement the actual algorithm
    return 85.0
Enter fullscreen mode Exit fullscreen mode

This is not noisy code. It is confident emptiness. In an analytics path, it becomes false certainty. In a payment path, it becomes a defect. In an auth path, it becomes risk.

The issue is not that AI writes ugly code. The issue is that AI reliably produces code that is structurally plausible while functionally thin.

That is a narrower claim than โ€œAI is dangerous,โ€ but it is also far more useful.


We ran into this ourselves

4-dimensional weighted geometric mean

This did not begin as a theory about other peopleโ€™s repos. It began when we found a flaw in our own scoring model. Back in v2.8.0, we discovered that our formula was accidentally rewarding spaghetti code.

A large god function could sometimes look healthier than a small clean function because complexity was dividing the penalty instead of amplifying it.

That was backwards, so the math changed.

AI-SLOP Detector now evaluates four dimensions:

  • LDR for logic density
  • Inflation for jargon density relative to real logic
  • DDC for dependency usage rather than dependency presence
  • Purity for critical structural defects that should drag the whole score down

These are combined with a weighted geometric mean, not an arithmetic average.

Why that matters:

  • one strong-looking axis should not be able to hide a collapsed one
  • a polished docstring should not rescue empty logic
  • if one important dimension fails, the whole score should feel it

That is the scoring philosophy underneath the tool. But even that was not enough.


Static analyzers have a threshold problem

Take a perfectly legitimate ML helper:

def prepare_training_batch(
    raw_samples: List[Dict[str, Any]],
    tokenizer: PreTrainedTokenizer,
    max_length: int = 512,
) -> Dict[str, torch.Tensor]:
    """
    Tokenize and pad samples for transformer training.
    Handles attention mask generation and HuggingFace
    tokenizer conventions for batch encoding.
    """
    return tokenizer(
        [s["text"] for s in raw_samples],
        padding=True,
        truncation=True,
        max_length=max_length,
        return_tensors="pt",
    )
Enter fullscreen mode Exit fullscreen mode

There is nothing wrong with this code. But a generic detector may still overreact, because terms like tokenizer, attention mask, and HuggingFace can look suspicious if the analyzer does not understand the domain it is scanning. In a real ML codebase, those terms are normal. In a CRUD backend, some of them may be genuine anomaly signals.

That is the threshold problem. The same threshold can be wrong in one codebase and exactly right in another. A universal threshold sounds elegant, but real repositories are local. They have habits, idioms, and boilerplate that are legitimate inside one domain and suspicious inside another.

So the next problem became obvious: the tool had to learn the project it was scanning. That is the real center of v3.5.0.


What AI-SLOP Detector actually does

scanning for structrural integrity

AI-SLOP Detector is a static analyzer built to catch a specific defect class that shows up repeatedly in AI-generated code: unimplemented stubs, disconnected pipelines, phantom imports, clone-shaped emptiness, placeholder-heavy production paths, and jargon inflation that outruns the actual logic. It is not a style linter, not a full security scanner, and not a runtime verifier. It is a detector for structural hollowness.

That distinction matters because it keeps the claim honest. The tool is not trying to solve every production risk. It is trying to catch one layer that becomes more expensive as AI output scales faster than human review.

pip install ai-slop-detector
slop-detector --init
slop-detector --project.
Enter fullscreen mode Exit fullscreen mode

The workflow is the product story

why universal rules fail real repositories

What makes this release interesting is that it is not just โ€œmore patternsโ€ or โ€œmore language support.โ€ It is a workflow story.

The detector now has a real loop. It scans the file, classifies its role, computes the 4D score, applies structural pattern penalties, and writes the result to history. Then, once enough repeated scans exist, it revisits that history, extracts behavioral signals, tunes the weights inside bounded domain-aware limits, updates the configuration, and keeps scanning. That is the release.

mermaid1

That final stretch is what changed this from โ€œdetector upgradeโ€ into โ€œadaptive detector.โ€ The tool no longer only evaluates code. It also learns from what happens after evaluation.


Self-calibration is the real headline

Mechanical self-calibration

Every scan is recorded to a local SQLite history database. That history is not just there for reporting. It becomes the signal surface for the next tuning step. Once enough repeated scans accumulate, the detector begins asking a simple question: when this file was flagged, what happened next?

That produces two behavior-derived event types. An improvement event means the file was flagged, later changed, and its deficit dropped meaningfully. A false-positive candidate means the file was flagged, then scanned again with the same content and little meaningful score movement.

That difference is more important than it sounds. A lot of โ€œself-improvingโ€ systems quietly learn from their own outputs. They mark something suspicious, then later use that same judgment as the truth signal for tuning. The system becomes better at agreeing with itself. That is not calibration. That is self-imitation with cleaner packaging.

v3.5.0 tries to avoid that trap. Its labels are not taken from the scoring formula. They are inferred from developer behavior around repeated scans. The formula says, โ€œthis looks suspicious.โ€ The next run reveals whether a human treated that suspicion as real.

That signal is not perfect. An unchanged file is not always a false positive. It may be legacy code, low priority, or simply out of scope. But it is still a healthier signal than teaching the formula to imitate its own prior outputs.


What the loop actually looks like

The loop is not mystical. It is mechanical. Repeated scans accumulate, improvement and likely-FP events are extracted, candidate weight sets are evaluated, the search is bounded around the projectโ€™s current domain anchor, and if a strong enough winner appears, the config gets updated. If a calibrated weight drifts too far from the domain anchor, the system emits a warning.

mermaid2

This is what makes the title true. It gets smarter every scan, not because a hidden model is hallucinating taste, but because repeated use creates a bounded feedback loop. That is much less magical, and much more trustworthy.


Why --init matters more now

There is another reason the calibration story works better in v3.5.0. The detector no longer starts from a generic nowhere. --init now performs domain-aware bootstrap, detects the likely project type, and seeds the starting weights accordingly. That means calibration starts near the right neighborhood instead of wandering across the whole map.

That improves the first week of use, not just the tenth. And that matters, because bad first impressions kill adaptive tools. If the detector is only smart after a month of annoying you, it will never survive long enough to get smart.

Good initialization is not a convenience feature. It is part of whether the loop can gather clean signal at all.


JS, TS, and Go are not side quests

v3.5.0 also expands analysis coverage to Go, JS, JSX, TS, and TSX. That is useful on its own, but the deeper significance is architectural. Structurally hollow AI-generated code is not a Python-only phenomenon. If the detectorโ€™s long-term direction is project-local calibration rather than one-size-fits-all scoring, then wider language support is not a side feature. It is the natural expansion of the same idea.

Different languages. Same review gap. Same loop.


The honest boundary

This tool still does not close every gap. It will not fix missing infrastructure controls, catch every runtime bug, prove the architecture is correct, or replace security review. A clean structural profile is not proof of safety.

What it can do is narrow one expensive blind spot: the distance between code that looks finished and code that carries enough actual logic to deserve confidence. That is a smaller claim than โ€œAI risk solved,โ€ but it is also the kind of claim that survives production better.


Why this matters now

AI has made software generation dramatically cheaper. It has not made understanding cheaper. That difference is where governance debt begins to accumulate.

If teams can now generate far more code than they can truly review, then the review stack needs tools that operate below style and above syntax. Not tools that ask whether the code is pretty, but tools that ask whether the implementation carries enough substance for the confidence wrapped around it.

That is the space AI-SLOP Detector is trying to occupy. Not the whole problem. Just one layer that became impossible to ignore.


Quick start

pip install ai-slop-detector
cd my-project/
slop-detector --init
slop-detector --project .
Enter fullscreen mode Exit fullscreen mode

Fix what clearly deserves fixing. Leave legitimate idioms alone. Then keep scanning. If the loop is doing its job, the next pass should know your codebase a little better than the first one did.


GitHub: flamehaven01/AI-SLOP-Detector

Top comments (0)