Why we didn’t use an LLM-first approach for architectural drift detection

#architecture #automation #productivity #showdev

Why we didn’t use an LLM-first approach for architectural drift detection

LLMs are very good at a lot of things in software development.

They can explain code, summarize pull requests, suggest fixes, and point out suspicious logic. For many review tasks, they are genuinely useful.

But when we started working on architectural drift detection, we ran into a different kind of problem.

Architectural drift is usually not a single “bad line of code”.
It is a gradual shift in the shape of a codebase:

boundaries get blurred,
hidden coupling appears,
new state starts leaking into places that used to stay simple,
control flow becomes more irregular,
repo-specific patterns quietly erode over time.

And that is where an LLM-first approach started to feel like the wrong primary layer.

The core issue: architectural drift is not just code understanding

A generic LLM can read code and reason about it.

But architectural drift is not only about understanding what a piece of code does.
It is about understanding whether a change is structurally abnormal for this repository.

That distinction matters.

A pattern can be valid in isolation and still be a bad architectural move in a specific repo.

For example:

introducing a new abstraction where the repo has stayed intentionally simple,
adding hidden state into an area that has historically stayed stateless,
crossing a module boundary that the team has treated as stable,
making a PR that is locally reasonable but globally erosive.

An LLM can often describe such code.
But detecting that it is out of character for this codebase is a different task.

Why LLM-first review was not enough for us

1. The decision is often local, but the damage is global

Large language models are very strong at local reasoning over the code they can see.

But architecture is a global property.
A pull request can look fine line by line while still moving the whole system in the wrong direction.

That is why drift often survives normal review:
tests pass, the code works, nothing looks obviously broken — but the shape of the system worsens.

2. Repo-specific baselines matter more than general code knowledge

Most AI review tools are built around broad priors learned from many repositories.

That is useful for generic review.
It is less useful for questions like:

“Is this kind of abstraction typical here?”
“Does this boundary crossing fit the historical structure of this repo?”
“Is this new dependency normal for this subsystem?”
“Is this complexity spike expected here or is it architectural drift?”

Those are not universal questions.
They are baseline questions.

3. Drift detection needs stability, not just plausible reasoning

For architecture work, noisy comments are deadly.

If the system raises too many vague or unstable warnings, teams stop trusting it very quickly.

We needed a layer that behaves more like structural instrumentation:
repeatable, calibrated, and tied to measurable deviation — not just a smart narrative about code.

4. Explanation and detection are different jobs

LLMs are often excellent at the second part:
explaining why something may be risky.

But the first part — consistently detecting structural deviation relative to a repo baseline — is a separate problem.

We found it useful to separate those two jobs instead of forcing one model to do both.

What we built instead

We built a non-linguistic structural layer first.

The idea is simple:

learn the repository’s structural baseline,
compare each PR against that baseline,
score the deviation,
surface a short risk summary and a few hotspots directly in the PR.

In our case, this became PhaseBrain inside Revieko.

The model is not trying to replace LLMs.
It is trying to do something narrower and more structural:
track roles, boundaries, deviations, and coherence in the evolution of a repo.

That gives us a better primary signal for architectural drift.

Then, if needed, language models can sit on top of that signal and help explain it.

Our view now

For this problem, LLMs are useful — but not as the foundation.

They are strong explainers.
They are not the best primary detector of repo-specific architectural drift.

Architectural drift is less about “what does this code mean?”
and more about
“what does this change do to the structure of this system over time?”

That pushed us toward a structural model first, and a language layer second.

That is the architecture we ended up building.

If you work on long-lived repos, I’d be very interested in your view:

Have you seen PRs that looked reasonable locally but still degraded system structure?
Do you think architectural drift is better modeled as a structural signal than as a pure language task?

Revieko:
https://synqra.tech/revieko