Kwansub Yun

Posted on Mar 10 • Edited on Mar 18 • Originally published at flamehaven.space

I Built an Ecosystem of 46 AI-Assisted Repos. Then I Realized It Might Be Eating Itself.

#python #architecture #softwareengineering #testing

I want to tell you about a problem I didn't see coming.
In retrospect, it was inevitable.It's not a bug. It's not a security vulnerability. It's a structural blind spot. One that grows quietly as your tools get more sophisticated. And the more sophisticated they get, the harder it becomes to see.

1. What I've Been Building

A bit of context. The problem only makes sense with it.

For the past few years I've been developing a personal AI development framework. Not a product. A working methodology and a set of tools that build on each other.

Stage one.
Before "AI agents" and "skills files" were common vocabulary, I was writing structured markdown documents fed directly to language models. Not prompts. Contracts. Documents that told the model how to think about a specific domain. If you've worked with system prompts or agent instruction files — that's the concept. I was doing an early version of it.
Stage two.
Those documents became Python code. Ideas that previously only existed as model instructions got turned into actual running software.Over time this grew into 46 GitHub repositories. Tools for RAG pipelines, biological simulation, physics modeling, code analysis, medical AI governance, and more. Each repository is a crystallized idea that used to just be a document.
Stage three.
This is where the complexity really climbed.I started building reasoning engines.Tools that don't just run — they analyze, score, and certify other tools. A code quality analyzer using custom mathematical models to detect AI-generated slop.A certification platform that runs automated audits on software projects. A governance engine for medical AI triage systems.

These aren't toy projects. They have test suites, CI pipelines, versioned releases, schema contracts, and governance documentation. Some are being evaluated for clinical contexts. They're real. And they're deeply interconnected.

That interconnection is where the problem lives.

2. The Biology Lesson I Should Have Taken More Seriously

In genetics, inbreeding depression is what happens when a population reproduces only within itself for too long.

The mechanism is simple. Every organism carries harmful genetic variants. Most stay dormant — they're recessive. Two copies from both parents are needed to cause damage. In a genetically diverse population, that rarely happens. In an inbred population, everyone shares the same ancestors. The odds climb. Traits that should stay suppressed start expressing.

The insidious part: the population doesn't look sick. Internal coherence is maintained. Animals look normal. Systems function. The weakness only becomes visible under environmental stress. And by then it's structural. Not fixable with a quick patch.

I'd been thinking about my toolchain in terms of "integration" and "cohesion." These are supposed to be good things in software. But there's a point where cohesion becomes something else.

Here's what my toolchain actually looked like:

Mathematical model design  → built on my framework's philosophy
Validation criteria        → built by my quality analyzer
Validation execution       → uses algorithms from my reasoning engine
Result interpretation      → outputs fed back into my own scoring system

From input to output: the same set of assumptions. Every tool built by the same person. The same mental model. Calibrated against the same baseline.

That's fine — until you need to know whether the mental model itself is wrong.

And this is where AI-assisted development makes it worse. LLMs are good at one thing: producing outputs that conform to whatever assumptions you gave them.

Hand a model your internal scoring criteria. Ask it to evaluate code. It won't tell you the criteria are wrong. It will produce a confident, well-structured, internally consistent evaluation — calibrated exactly to your assumptions.

Normal code throws errors when something breaks. AI polishes over the errors and makes them presentable. The inbreeding doesn't just persist. It gets formatted nicely and returned as JSON.

3. The Moment I Couldn't Ignore It

I have a tool called ai-slop-detector. It uses several metrics — logic density, import usage ratios, AST node distribution, inflation detection — combined with a weighted geometric mean to produce a quality score for Python code. The goal: detect AI-generated code that looks fine but is structurally hollow.

I ran this tool against SIDRCE SaaS. One of my other projects. A certification and audit platform.

The tool flagged a set of imports as suspicious. Potentially phantom dependencies — imported but not meaningfully used.

My conclusion: the tool was wrong. The imports were standard FastAPI absolute imports (from app.module import thing). They look unused to a static analyzer because FastAPI wires them up dynamically at runtime. The tool was doing static analysis on code that only makes sense in a dynamic execution context. The fix: a config override to suppress those specific findings.
That conclusion was technically correct. The imports were legitimate. A version mismatch between the analyzer and the platform caused real noise. A patch was developed. The issue was properly resolved.

But here's what bothered me about the process.

The question "is the tool right, or is our code right?" was answered entirely by me. The person who built both the tool and the code. There was no external reference point. No mechanism that could produce the conclusion "you're rationalizing."

When you're the judge and the defendant, even correct verdicts are epistemically shaky.

4. What Was Actually Missing

I sat down and listed it out:

What existed	What was absent
Mathematical scoring model	Any external criterion to falsify that model
Internal self-validation	A validator built on different assumptions
Sophisticated internal tools	Collision with tools from outside the ecosystem
Strong integration between components	Intentional friction from outside

The phrase I kept coming back to:

Intentional external friction.

Every healthy system needs inputs it didn't generate itself. Inputs that don't already agree with its assumptions. A code review where everyone thinks identically isn't a code review. It's a ritual. A test suite that only tests what developers thought to test doesn't catch unknown unknowns. An AI evaluation framework calibrated on its own outputs is a closed loop.

My toolchain was a closed loop. It had gotten sophisticated enough to look like it wasn't.

5. Building flamehaven-validator

I built fhval — flamehaven-validator — as a standalone tool with one hard constraint: it shares zero imports with the Flamehaven ecosystem. It calls other tools as subprocesses. Reads their JSON output. Doesn't know how my internal scoring formulas work. Reasons only from outputs, not internals.

Four modules. Four different attack angles on the same problem.

1) Delta Gate — Before/After Measurement

The simplest piece. Arguably the most immediately useful.

gate = DeltaGate(
    measure_cmd=["slop-detector", "--project", "{corpus}", "--json"]
)

snapshot_before = gate.snapshot(corpus_path)
# ... make your change ...
snapshot_after  = gate.snapshot(corpus_path)

report = gate.compare(snapshot_before, snapshot_after)
if not report.passed:
    raise SystemExit("Change degraded scores — rejected.")

Before any change, take a snapshot. After the change, take another. If the aggregate quality score goes down, the CI gate blocks the merge. No philosophy required. Just arithmetic.

This sounds trivial. It isn't. The number of times I'd made a "refactoring" that improved readability while quietly degrading measured quality — and never noticed — is uncomfortable to count.

2) Adversarial Search — Code Designed to Fool the Scorer

For each metric my quality tool measures, I wrote code that should score badly but is specifically constructed to look good to the formula.

The LDR attack. Logic Density Ratio measures how much of a file is actual logic versus whitespace and structure. High density should mean good code.

def a(): return 1
def b(): return 2
def c(): return 3
# ... 20 functions ...
def t(): return 20

Very high density. Completely useless. Does the scorer catch it?

The annotation trick. My dependency metric tracks whether imported modules are actually used at runtime. Type annotations look like usage. They aren't.

import os
import json
import pathlib

def process(
    path: Optional[os.PathLike],
    config: Dict[str, json.JSONDecodeError],
) -> None:
    pass  # nothing actually runs

If the scorer counts annotation references as runtime usage, this file scores clean. Zero real dependencies. Full marks.

The purity bypass. My tool penalizes files with critical structural patterns — bare except: clauses, extreme nesting. But you can be structurally terrible without triggering any named pattern.

def deeply_nested(data):
    for a in data:
        for b in a:
            for c in b:
                for d in c:
                    if d:
                        for e in d:
                            if e > 0:
                                result.append(e)

Six levels deep. Radon gives this an F for cyclomatic complexity. My purity metric, with no named pattern triggered, gives it a 1.0.

Each case returns evaded=True if the scorer calls it clean. Not a hypothesis. A measured false negative.

3) External Baseline — Tools With No Shared DNA

This is the direct structural response to the inbreeding problem.

The same files get run through pylint and radon. Standard Python tools. Built by people who've never heard of my scoring model. Then the results are compared.

slop-detector: CLEAN  (deficit score: 18.4)
pylint:        4.2/10 ← disagrees
radon CC:      B      ← agrees

When they disagree, the disagreement is recorded as explanation_needed: True. It stays open. My tool doesn't resolve it. It gets flagged for human review.

There are two directions of disagreement. They're not symmetric.

My tool is stricter than pylint → Maybe my model is over-penalizing something
Pylint is stricter than my tool → My tool has a blind spot ← this is the dangerous one

The second category is exactly what the inbreeding problem produces. An external tool catches something the internal model was designed not to see — because you didn't know it was a problem when you built the model.

4) Assumption Breaker — Making Implicit Math Explicit

My scoring model uses a geometric mean to combine four quality dimensions: logic density, inflation, dependency coherence, and structural purity.

The Assumption Breaker tests the named mathematical properties of that formula. Not the implementation. The math itself.

Each quality dimension has a weight. Structural purity — which penalizes things like extreme nesting and dangerous exception handling — has a weight of 0.10. The Assumption Breaker forced me to confront what that actually means:

Purity weight = 0.10

If purity alone collapses (worst possible structural patterns),
the quality floor is approximately 0.40.

Translation: no matter how structurally horrific a file is,
if it doesn't trigger a named critical pattern,
it cannot drag the quality score below ~40% on its own.

Is that right? Maybe. Purity was assigned a lower weight deliberately. But before writing this test, this property was invisible. It emerged from the math without ever being a conscious design decision. Now it's documented. Testable. Falsifiable.

If someone later argues purity should be enforced more strongly, there's a concrete parameter to change and a test that shows exactly what happens when you change it.

6. What This Actually Achieves — And What It Doesn't

What it actually does:

The delta gate works. It has blocked commits. The adversarial cases produce measurable false negatives — real gaps in the scoring model, now visible rather than latent. The external baseline generates genuine disagreements that get logged and tracked.

The SIDRCE-ASD version conflict I described earlier? Found and resolved via a patch. The process worked. The friction was real.

What it doesn't fully solve:

When disagree_slop_better fires — when pylint is stricter than my tool — someone still has to decide what to do. That someone is me. The window is open. But looking through it is still a choice.

If the consistent response is "our model has more domain context than pylint" — which is sometimes true — the window becomes decorative. The log fills with explained-away disagreements. Nothing changes.

I don't have a clean answer. What I have is a system that makes suppression a deliberate act rather than a default outcome. The disagreement is recorded. Dismissing it requires a conscious decision, not a passive one.

That's not nothing. But it's not the end of the problem either.

7. Why This Is Probably Your Problem Too

If you're building tools that evaluate other tools, you have a version of this.

If you're building an AI evaluation framework, a code quality platform, a security scanner, a test coverage enforcer — especially one that will eventually run on codebases that include the tool itself — you have a version of this.

The standard answer is "write more tests." But tests written by the same team, against the same assumptions, using the same mental model of what failure looks like, are not external validation. They're confirmation that the team's mental model is internally consistent. That's different.

External validation requires things genuinely outside your model:

Tools built by people solving different problems
Metrics not designed to agree with your metrics
Failure modes that weren't in your threat model
Conclusions you didn't want to reach

The biological parallel isn't decorative. The mechanism is the same. Shared assumptions accumulate silently. Internal consistency masquerades as correctness. The failure mode only becomes visible under pressure you didn't anticipate.

One question worth asking about your own toolchain: if your core model is wrong in a systematic way, what would cause you to find out? Is there anything in your current setup that could produce that conclusion? Or does every feedback path eventually route back through assumptions you built?

8. The Question I'm Still Sitting With

The hardest part of building fhval wasn't technical. The adversarial cases are straightforward to write. The external baseline is a few subprocess calls. The delta gate is arithmetic.

The hard part: what do you do when the external tool is right and you genuinely don't want it to be?

When pylint scores something 3.2 and your model says clean — and you've spent months calibrating that model — the available move is to explain why pylint is wrong here, in this context, for this type of code. Sometimes that explanation is correct. The FastAPI import case was real.

But "context matters" is also the most available rationalization for a model that will never update.

I don't have a resolution. I have a system that records the question and keeps it open. Whether that's enough depends on how the disagreements get handled over time. That's a human problem, not a technical one.

If you're working on evaluation infrastructure, AI governance, or any system that ends up validating itself — I'm curious whether you've hit this. And what you did about it.

flamehaven-validator is an internal tool. Core algorithmic IP is not published. The architecture and design patterns described here are the substance of this post.

If this resonates with work you're doing, I'm interested in the conversation.

DEV Community