I've been building web apps for years and the thing that always bothered me is how much ceremony goes into something that should be simple. A task list with auth shouldn't need 15 files across 3 directories with 200 lines of config.
So I built Kilnx, a declarative backend language. 27 keywords, compiles to a single binary, SQL inline, HTML as output. At some point I started wondering: if the language is this small, can a tiny local LLM write it? A model that fits on a phone?
I ran the benchmark. Kilnx won every round.
What Kilnx looks like
A complete app with auth, pagination, htmx, and a SQLite database:
config
database: "sqlite://app.db"
port: 8080
secret: env SECRET_KEY required
model user
name: text required
email: email unique
password: password required
model task
title: text required
done: bool default false
owner: user required
created: timestamp auto
auth
table: user
identity: email
password: password
login: /login
after login: /tasks
page /tasks requires auth
query tasks: SELECT id, title, done FROM task
WHERE owner = :current_user.id
ORDER BY created DESC paginate 20
html
{{each tasks}}
<tr>
<td>{title}</td>
<td>{{if done}}Yes{{end}}</td>
<td>
<button hx-post="/tasks/{id}/delete"
hx-target="closest tr"
hx-swap="outerHTML">Delete</button>
</td>
</tr>
{{end}}
action /tasks/create method POST requires auth
validate task
query: INSERT INTO task (title, owner)
VALUES (:title, :current_user.id)
redirect /tasks
kilnx build app.kilnx -o myapp gives you a ~15MB binary. Registration, login with bcrypt, sessions, CSRF, validation, pagination, htmx inline delete. No framework, no ORM, no node_modules.
The question
The Kilnx grammar fits in 400 lines of docs. Express, Django, and Node.js each have thousands of pages of documentation, dozens of APIs, and multiple ways to do the same thing.
I wanted to know if that difference in surface area shows up when you ask small LLMs to generate code. Not GPT-4 or Claude, but models you run on a laptop with Ollama. Models between 1B and 7B parameters.
Setup
I wrote 10 equivalent tasks across four stacks (Kilnx, Express, Django, vanilla Node.js):
| # | Task | Difficulty |
|---|---|---|
| 1 | Hello World page | trivial |
| 2 | User model definition | easy |
| 3 | Page with database query | easy |
| 4 | Create with validation | medium |
| 5 | Auth + protected route | medium |
| 6 | Delete with htmx response | medium |
| 7 | SSE notifications | medium |
| 8 | Chat websocket | hard |
| 9 | Stripe webhook | hard |
| 10 | Complete mini app | hard |
Five models, three families, all local:
| Model | Parameters | Disk |
|---|---|---|
| Qwen 2.5 7B | 7B | 4.7 GB |
| Qwen 2.5 3B | 3B | 1.9 GB |
| Qwen 2.5 1.5B | 1.5B | 986 MB |
| Phi-4 Mini | 3.8B | 2.5 GB |
| Llama 3.2 1B | 1B | 1.3 GB |
Three validation passes on every output:
- Keyword matching - does the code contain the structural elements the task requires?
-
Syntax check -
kilnx check(semantic analysis),node --check,python compile - LLM-as-judge - Qwen 7B rating syntax/completeness/correctness/idiom (0-3 each)
Every combination ran 3 times. 600 generations, 600 judge evaluations.
About fairness
This is important. Kilnx has never appeared in any training dataset. Zero .kilnx files exist on the internet outside my repo. Express and Django have millions of code examples baked into every LLM's weights.
I gave the models the Kilnx grammar reference (11.7K chars) as prompt context. Express, Django, and Node got no reference docs because they don't need them.
If anything, this setup gives the established frameworks a huge advantage. They've been pre-trained on the entire Stack Overflow + GitHub history. Kilnx gets one document.
The numbers
Structural correctness (keyword score, averaged over 3 runs)
| Model | Kilnx | Express | Node.js | Django |
|---|---|---|---|---|
| Qwen 2.5 7B | 100% | 88% | 93% | 83% |
| Qwen 2.5 3B | 99% | 88% | 89% | 87% |
| Qwen 2.5 1.5B | 99% | 85% | 87% | 74% |
| Phi-4 Mini | 98% | 88% | 93% | 85% |
| Llama 3.2 1B | 90% | 78% | 77% | 77% |
Qwen 3B, a 1.9 GB model, scores 99% on Kilnx, a language it has never encountered. The same model gets 87% on Django, a framework it has seen millions of times during training.
When you shrink from 7B down to 1B, Kilnx drops 10 points. Node.js drops 16. The simpler grammar holds up better as the model gets dumber.
Tokens per task (completion only)
| Framework | Qwen 7B | Qwen 3B | Qwen 1.5B | Phi-4 Mini |
|---|---|---|---|---|
| Kilnx | 105 | 112 | 111 | 95 |
| Django | 195 | 226 | 152 | 199 |
| Express | 302 | 349 | 265 | 315 |
| Node.js | 347 | 381 | 333 | 490 |
3x fewer tokens than Express/Node. This is not a style difference. It's the same functionality. A chat websocket in Kilnx is ~110 tokens:
socket /chat/:room requires auth
on connect
query: select body, author.name, created from chat_message
where room = :room
order by created desc
limit 50
send history
on message
validate
body: required max 500
query: insert into chat_message (body, author, room)
values (:body, :current_user.id, :room)
broadcast to :room fragment chat-bubble
The Express version of the same task runs ~420 tokens of socket.io setup, middleware, database calls, and room management.
Session economics
Kilnx has a cost that Express doesn't: the grammar reference takes ~3,100 prompt tokens. But that's loaded once per session. The per-task completion cost is what scales.
Over a real session with Qwen 3B:
| Tasks | Kilnx | Express | Node.js |
|---|---|---|---|
| 1 | 3,269 | 464 | 501 |
| 10 | 4,277 | 4,640 | 5,010 |
| 25 | 5,957 | 11,600 | 12,525 |
| 50 | 8,757 | 23,200 | 25,050 |
| 100 | 14,357 | 46,400 | 50,100 |
Kilnx becomes cheaper than Express at task 9. By the end of a workday (call it 50-100 tasks with a copilot), you've used 71% fewer tokens. If you're paying per token on an API, that's real money. If you're running locally, it's real time.
Raw output from the 3B
No editing, no cherry-picking. This is Qwen 2.5 3B (1.9 GB on disk) generating a complete app from scratch, having seen the Kilnx grammar for the first time in the prompt:
config
database: env DATABASE_URL default "sqlite://app.db"
port: 8080
secret: env SECRET_KEY required
model task
title: text required
done: bool default false
owner: user required
created: timestamp auto
auth
table: user
identity: email
password: password
login: /login
after login: /tasks
page /tasks layout main requires auth
query tasks: select id, title, done from task
where owner = :current_user.id
order by created desc
paginate 20
html
<input type="search" name="q" placeholder="Search tasks..."
hx-get="/tasks" hx-trigger="keyup changed delay:300ms"
hx-target="#task-list">
<table id="task-list">
<tr><th>Title</th><th>Done</th><th></th></tr>
{{each tasks}}
<tr>
<td>{title}</td>
<td>{{if done}}Yes{{end}}</td>
<td><button hx-post="/tasks/{id}/delete"
hx-target="closest tr"
hx-swap="outerHTML">Delete</button></td>
</tr>
{{end}}
</table>
action /tasks/create method POST requires auth
validate task
query: insert into task (title, owner)
values (:title, :current_user.id)
redirect /tasks
Auth, pagination, htmx search with debounce, inline delete, form validation. It even added the search input on its own, that wasn't in the prompt.
Why I think this happens
Express forces the model to make a lot of decisions. CommonJS or ESM? Which middleware in what order? Prisma or Sequelize or raw queries? Passport or express-session or JWT? EJS or Pug or Handlebars? Each fork is a place where a small model can pick wrong.
Kilnx has one way to do each thing. One keyword for auth, one keyword for pages, one for actions. The model doesn't pick between approaches because there's only one approach. The decision space is so small that even a 1B model mostly gets it right.
I don't think this is unique to Kilnx. Any DSL with a tight, regular grammar would probably show the same pattern. The surface area of a language directly predicts how well small models can generate it. I haven't seen anyone optimize for that yet.
What I'd do with this
If you're an indie dev or a solo founder shipping CRUD apps:
A 3B model running locally gives you 99% accuracy on Kilnx with no API costs, no internet, no privacy concerns. The 7B hits 100%. You don't need to send your code to OpenAI to get a working backend.
If you're using a paid API, the 71% token reduction over a session adds up fast. Especially if you're iterating on features all day.
If you're just curious, the whole language is 27 keywords. You can read the grammar in 10 minutes.
Links
curl -fsSL https://raw.githubusercontent.com/kilnx-org/kilnx/main/install.sh | sh
kilnx run app.kilnx
Top comments (0)