Aniekan Okono

Posted on Jun 10

Why I stopped using LLMs to generate code (and what I use instead)

#ai #devtools #codegen #architecture

I want to be precise about what I mean by that title, because it's easy to read it as anti-AI and it isn't.

I use LLMs every day. I used one to write this article's first outline. I use them to parse prose, extract structure, and summarise documents. They're genuinely useful for that.

What I stopped using them for is generating application code — the part where the output needs to be correct, reproducible, and deployable without a week of cleanup.

Here's why, and what I built instead.

The problem with LLM-generated code isn't the code itself

When Bolt, Lovable, or v0 generate a frontend for you, the output often looks impressive. Clean components, reasonable naming, something that runs on first try. The demo works.

Then you try to deploy it.

The database schema is wrong — or missing entirely. There's no auth, or auth that stores tokens in localStorage (which is a security problem). Multi-tenancy doesn't exist: every query returns every user's data. The OpenAPI spec doesn't match the routes. The migrations aren't there.

None of these are small things. They're the things that take 4–8 weeks to fix before you can show the app to a real user.

The reason this happens is structural, not incidental. LLMs are stateless across the context window. They don't hold a persistent model of your system. Ask an LLM to add an endpoint, and it will. Ask it to fix a bug in that endpoint, and it will — without awareness of what the fix broke downstream. Ask it to add multi-tenancy, and it might touch 60% of the places that need changing and miss the rest.

This isn't a failure of the models. It's a consequence of using a tool designed for language generation to do something that requires deterministic, system-wide consistency.

What code generation actually requires

Think about what a production-ready codebase actually is. It's not a collection of files that individually look reasonable. It's a system where:

The data model drives the migrations, which drive the API shape, which drives the frontend types
Auth is implemented consistently across every route, not just the ones you remembered to mention
Every query is scoped to the correct tenant
Compliance constraints (GDPR consent flags, audit trails, HIPAA access controls) are woven through the data layer, not bolted on as an afterthought
The infrastructure config matches the application config

For all of that to be correct, the generator needs a complete, coherent model of the system before it writes a single line. An LLM working from a PRD in natural language doesn't have that model. It infers it, incompletely, from what you wrote.

The alternative: Model-to-Text generation

Model-driven architecture has existed for decades. The idea: define your system formally first — entities, relationships, capabilities, constraints — and then generate the implementation from that model.

The key property is determinism. Given the same model, you always get the same output. The generator isn't guessing. It's applying a set of transformation rules to a structured input.

This is the approach I took when I built Archiet.

The workflow looks like this:

You provide a PRD (plain prose — what the system does, who uses it, what the rules are)
An LLM parses the PRD into a formal schema — what we call the genome: entities, screens, business rules, capabilities, compliance requirements
A deterministic Model-to-Text engine reads the genome and renders a production-ready ZIP

The LLM is used exactly where it's good: turning unstructured prose into structured data. The code generation step — where reproducibility and correctness matter — uses no LLMs at all.

What "production-ready" actually means in practice

This is where I'll be specific, because "production-ready" gets thrown around loosely.

In Archiet's output, every generated ZIP includes:

Data layer

Alembic migrations generated directly from the entity model — you don't write them, they're derived
Multi-tenant organisation scoping on every query — not added later, baked into the base query class
No raw SQL strings; parameterised queries throughout

Auth

HTTPOnly cookie-based sessions (no localStorage tokens)
CSRF protection enabled by default
Role-based access control generated from the capability model

API

OpenAPI spec that is always in sync with the routes, because both are generated from the same source
Consistent error response shapes across every endpoint

Compliance

GDPR, HIPAA, SOC2, DORA, EU AI Act overlays available — not as checklists, but as actual implementation patterns woven into the generated code
Consent flags, audit trail tables, data retention hooks — present in the output, not left as an exercise for the reader

Quality gate

Every ZIP scores ≥80/100 before delivery
Any hardcoded secret or unfilled placeholder hard-blocks the release
Generated apps are booted in a sandbox (E2B) and tested before the customer downloads them

The unusual part: the open spec

One decision I made early: publish the formal specification that underpins all of this as an open Apache-2.0 standard — archimate-codegen-spec.

The genome schema, the capability catalogue, the ArchiMate-to-genome mapping rules — all public and auditable independently of Archiet. Archiet is the reference implementation, but the spec belongs to the community.

The reason: if you're going to trust a tool to generate production code, you should be able to inspect the rules it's applying. A black-box generator you can't audit is a liability in any compliance-heavy context.

Where LLMs still belong in this pipeline

I want to be clear that I'm not arguing against LLMs in software development. The PRD parsing step is genuinely hard to do without one, and the quality of that parsing directly affects the quality of the output.

What I'm arguing is that there's a category error in using LLMs for the code generation step specifically. The properties you need from a code generator — determinism, consistency, auditability, reproducibility — are exactly the properties that LLMs are architecturally unable to provide.

The right tool for parsing ambiguous human language into structured data: LLM.
The right tool for transforming a complete, formal system model into consistent, correct code: deterministic template engine.

Using the same tool for both because it can do both is like using a hammer to drive screws because you don't want to switch tools. It works well enough until it doesn't, and when it doesn't the failure is hard to diagnose.

What this means for your projects

If you're building something where correctness matters — fintech, healthtech, anything with compliance requirements, anything multi-tenant, anything that needs to pass a security audit — the cleanup cost of LLM-generated code is a real project risk, not a theoretical one.

The alternative isn't to write everything by hand. It's to separate the concern: use AI for the parts that are genuinely hard for machines (understanding your intent), and use deterministic generation for the parts where machines are genuinely better than humans (applying rules consistently at scale).

That's the architecture that lets you go from PRD to deployable ZIP without a cleanup sprint.

If you're curious about the technical details of the genome schema or the M2T engine design, happy to go into it in the comments. And if you want to try it: archiet.com — free tier available.

Top comments (1)

Adam Lewis • Jun 10

Agree the cause is the missing system model rather than the model being weak at code. I'd take a different turn at the conclusion though. You don't have to drop the LLM to fix it. Keep the schema as the single source of truth in the repo and derive the migrations, API and types from it, then back it with checks that fail when a query isn't tenant-scoped or a token lands in localStorage. You get the consistency you're after and still use the LLM for the part it's genuinely good at, turning your prose into that structured model in the first place.

prickles.org/tenet/schema-sovereig...