I want to be precise about what I mean by that title, because it's easy to read it as anti-AI and it isn't.
I use LLMs every day. I used one to write this article's first outline. I use them to parse prose, extract structure, and summarise documents. They're genuinely useful for that.
What I stopped using them for is generating application code — the part where the output needs to be correct, reproducible, and deployable without a week of cleanup.
Here's why, and what I built instead.
The problem with LLM-generated code isn't the code itself
When Bolt, Lovable, or v0 generate a frontend for you, the output often looks impressive. Clean components, reasonable naming, something that runs on first try. The demo works.
Then you try to deploy it.
The database schema is wrong — or missing entirely. There's no auth, or auth that stores tokens in localStorage (which is a security problem). Multi-tenancy doesn't exist: every query returns every user's data. The OpenAPI spec doesn't match the routes. The migrations aren't there.
None of these are small things. They're the things that take 4–8 weeks to fix before you can show the app to a real user.
The reason this happens is structural, not incidental. LLMs are stateless across the context window. They don't hold a persistent model of your system. Ask an LLM to add an endpoint, and it will. Ask it to fix a bug in that endpoint, and it will — without awareness of what the fix broke downstream. Ask it to add multi-tenancy, and it might touch 60% of the places that need changing and miss the rest.
This isn't a failure of the models. It's a consequence of using a tool designed for language generation to do something that requires deterministic, system-wide consistency.
What code generation actually requires
Think about what a production-ready codebase actually is. It's not a collection of files that individually look reasonable. It's a system where:
- The data model drives the migrations, which drive the API shape, which drives the frontend types
- Auth is implemented consistently across every route, not just the ones you remembered to mention
- Every query is scoped to the correct tenant
- Compliance constraints (GDPR consent flags, audit trails, HIPAA access controls) are woven through the data layer, not bolted on as an afterthought
- The infrastructure config matches the application config
For all of that to be correct, the generator needs a complete, coherent model of the system before it writes a single line. An LLM working from a PRD in natural language doesn't have that model. It infers it, incompletely, from what you wrote.
The alternative: Model-to-Text generation
Model-driven architecture has existed for decades. The idea: define your system formally first — entities, relationships, capabilities, constraints — and then generate the implementation from that model.
The key property is determinism. Given the same model, you always get the same output. The generator isn't guessing. It's applying a set of transformation rules to a structured input.
This is the approach I took when I built Archiet.
The workflow looks like this:
- You provide a PRD (plain prose — what the system does, who uses it, what the rules are)
- An LLM parses the PRD into a formal schema — what we call the genome: entities, screens, business rules, capabilities, compliance requirements
- A deterministic Model-to-Text engine reads the genome and renders a production-ready ZIP
The LLM is used exactly where it's good: turning unstructured prose into structured data. The code generation step — where reproducibility and correctness matter — uses no LLMs at all.
What "production-ready" actually means in practice
This is where I'll be specific, because "production-ready" gets thrown around loosely.
In Archiet's output, every generated ZIP includes:
Data layer
- Alembic migrations generated directly from the entity model — you don't write them, they're derived
- Multi-tenant organisation scoping on every query — not added later, baked into the base query class
- No raw SQL strings; parameterised queries throughout
Auth
- HTTPOnly cookie-based sessions (no localStorage tokens)
- CSRF protection enabled by default
- Role-based access control generated from the capability model
API
- OpenAPI spec that is always in sync with the routes, because both are generated from the same source
- Consistent error response shapes across every endpoint
Compliance
- GDPR, HIPAA, SOC2, DORA, EU AI Act overlays available — not as checklists, but as actual implementation patterns woven into the generated code
- Consent flags, audit trail tables, data retention hooks — present in the output, not left as an exercise for the reader
Quality gate
- Every ZIP scores ≥80/100 before delivery
- Any hardcoded secret or unfilled placeholder hard-blocks the release
- Generated apps are booted in a sandbox (E2B) and tested before the customer downloads them
The unusual part: the open spec
One decision I made early: publish the formal specification that underpins all of this as an open Apache-2.0 standard — archimate-codegen-spec.
The genome schema, the capability catalogue, the ArchiMate-to-genome mapping rules — all public and auditable independently of Archiet. Archiet is the reference implementation, but the spec belongs to the community.
The reason: if you're going to trust a tool to generate production code, you should be able to inspect the rules it's applying. A black-box generator you can't audit is a liability in any compliance-heavy context.
Where LLMs still belong in this pipeline
I want to be clear that I'm not arguing against LLMs in software development. The PRD parsing step is genuinely hard to do without one, and the quality of that parsing directly affects the quality of the output.
What I'm arguing is that there's a category error in using LLMs for the code generation step specifically. The properties you need from a code generator — determinism, consistency, auditability, reproducibility — are exactly the properties that LLMs are architecturally unable to provide.
The right tool for parsing ambiguous human language into structured data: LLM.
The right tool for transforming a complete, formal system model into consistent, correct code: deterministic template engine.
Using the same tool for both because it can do both is like using a hammer to drive screws because you don't want to switch tools. It works well enough until it doesn't, and when it doesn't the failure is hard to diagnose.
What this means for your projects
If you're building something where correctness matters — fintech, healthtech, anything with compliance requirements, anything multi-tenant, anything that needs to pass a security audit — the cleanup cost of LLM-generated code is a real project risk, not a theoretical one.
The alternative isn't to write everything by hand. It's to separate the concern: use AI for the parts that are genuinely hard for machines (understanding your intent), and use deterministic generation for the parts where machines are genuinely better than humans (applying rules consistently at scale).
That's the architecture that lets you go from PRD to deployable ZIP without a cleanup sprint.
If you're curious about the technical details of the genome schema or the M2T engine design, happy to go into it in the comments. And if you want to try it: archiet.com — free tier available.
Top comments (1)
Agree the cause is the missing system model rather than the model being weak at code. I'd take a different turn at the conclusion though. You don't have to drop the LLM to fix it. Keep the schema as the single source of truth in the repo and derive the migrations, API and types from it, then back it with checks that fail when a query isn't tenant-scoped or a token lands in localStorage. You get the consistency you're after and still use the LLM for the part it's genuinely good at, turning your prose into that structured model in the first place.
prickles.org/tenet/schema-sovereig...