André Ahlert

Posted on Apr 1

I built a backend language that a 3B model writes better than Express

#webdev #ai #programming #javascript

I've been building web apps for years and the thing that always bothered me is how much ceremony goes into something that should be simple. A task list with auth shouldn't need 15 files across 3 directories with 200 lines of config.

So I built Kilnx, a declarative backend language. 27 keywords, compiles to a single binary, SQL inline, HTML as output. At some point I started wondering: if the language is this small, can a tiny local LLM write it? A model that fits on a phone?

I ran the benchmark. Kilnx won every round.

What Kilnx looks like

A complete app with auth, pagination, htmx, and a SQLite database:

config
  database: "sqlite://app.db"
  port: 8080
  secret: env SECRET_KEY required

model user
  name: text required
  email: email unique
  password: password required

model task
  title: text required
  done: bool default false
  owner: user required
  created: timestamp auto

auth
  table: user
  identity: email
  password: password
  login: /login
  after login: /tasks

page /tasks requires auth
  query tasks: SELECT id, title, done FROM task
               WHERE owner = :current_user.id
               ORDER BY created DESC paginate 20
  html
    {{each tasks}}
    <tr>
      <td>{title}</td>
      <td>{{if done}}Yes{{end}}</td>
      <td>
        <button hx-post="/tasks/{id}/delete"
                hx-target="closest tr"
                hx-swap="outerHTML">Delete</button>
      </td>
    </tr>
    {{end}}

action /tasks/create method POST requires auth
  validate task
  query: INSERT INTO task (title, owner)
         VALUES (:title, :current_user.id)
  redirect /tasks

kilnx build app.kilnx -o myapp gives you a ~15MB binary. Registration, login with bcrypt, sessions, CSRF, validation, pagination, htmx inline delete. No framework, no ORM, no node_modules.

The question

The Kilnx grammar fits in 400 lines of docs. Express, Django, and Node.js each have thousands of pages of documentation, dozens of APIs, and multiple ways to do the same thing.

I wanted to know if that difference in surface area shows up when you ask small LLMs to generate code. Not GPT-4 or Claude, but models you run on a laptop with Ollama. Models between 1B and 7B parameters.

Setup

I wrote 10 equivalent tasks across four stacks (Kilnx, Express, Django, vanilla Node.js):

#	Task	Difficulty
1	Hello World page	trivial
2	User model definition	easy
3	Page with database query	easy
4	Create with validation	medium
5	Auth + protected route	medium
6	Delete with htmx response	medium
7	SSE notifications	medium
8	Chat websocket	hard
9	Stripe webhook	hard
10	Complete mini app	hard

Five models, three families, all local:

Model	Parameters	Disk
Qwen 2.5 7B	7B	4.7 GB
Qwen 2.5 3B	3B	1.9 GB
Qwen 2.5 1.5B	1.5B	986 MB
Phi-4 Mini	3.8B	2.5 GB
Llama 3.2 1B	1B	1.3 GB

Three validation passes on every output:

Keyword matching - does the code contain the structural elements the task requires?
Syntax check - kilnx check (semantic analysis), node --check, python compile
LLM-as-judge - Qwen 7B rating syntax/completeness/correctness/idiom (0-3 each)

Every combination ran 3 times. 600 generations, 600 judge evaluations.

About fairness

This is important. Kilnx has never appeared in any training dataset. Zero .kilnx files exist on the internet outside my repo. Express and Django have millions of code examples baked into every LLM's weights.

I gave the models the Kilnx grammar reference (11.7K chars) as prompt context. Express, Django, and Node got no reference docs because they don't need them.

If anything, this setup gives the established frameworks a huge advantage. They've been pre-trained on the entire Stack Overflow + GitHub history. Kilnx gets one document.

The numbers

Structural correctness (keyword score, averaged over 3 runs)

Model	Kilnx	Express	Node.js	Django
Qwen 2.5 7B	100%	88%	93%	83%
Qwen 2.5 3B	99%	88%	89%	87%
Qwen 2.5 1.5B	99%	85%	87%	74%
Phi-4 Mini	98%	88%	93%	85%
Llama 3.2 1B	90%	78%	77%	77%

Qwen 3B, a 1.9 GB model, scores 99% on Kilnx, a language it has never encountered. The same model gets 87% on Django, a framework it has seen millions of times during training.

When you shrink from 7B down to 1B, Kilnx drops 10 points. Node.js drops 16. The simpler grammar holds up better as the model gets dumber.

Tokens per task (completion only)

Framework	Qwen 7B	Qwen 3B	Qwen 1.5B	Phi-4 Mini
Kilnx	105	112	111	95
Django	195	226	152	199
Express	302	349	265	315
Node.js	347	381	333	490

3x fewer tokens than Express/Node. This is not a style difference. It's the same functionality. A chat websocket in Kilnx is ~110 tokens:

socket /chat/:room requires auth
  on connect
    query: select body, author.name, created from chat_message
           where room = :room
           order by created desc
           limit 50
    send history

  on message
    validate
      body: required max 500
    query: insert into chat_message (body, author, room)
           values (:body, :current_user.id, :room)
    broadcast to :room fragment chat-bubble

The Express version of the same task runs ~420 tokens of socket.io setup, middleware, database calls, and room management.

Session economics

Kilnx has a cost that Express doesn't: the grammar reference takes ~3,100 prompt tokens. But that's loaded once per session. The per-task completion cost is what scales.

Over a real session with Qwen 3B:

Tasks	Kilnx	Express	Node.js
1	3,269	464	501
10	4,277	4,640	5,010
25	5,957	11,600	12,525
50	8,757	23,200	25,050
100	14,357	46,400	50,100

Kilnx becomes cheaper than Express at task 9. By the end of a workday (call it 50-100 tasks with a copilot), you've used 71% fewer tokens. If you're paying per token on an API, that's real money. If you're running locally, it's real time.

Raw output from the 3B

No editing, no cherry-picking. This is Qwen 2.5 3B (1.9 GB on disk) generating a complete app from scratch, having seen the Kilnx grammar for the first time in the prompt:

config
  database: env DATABASE_URL default "sqlite://app.db"
  port: 8080
  secret: env SECRET_KEY required

model task
  title: text required
  done: bool default false
  owner: user required
  created: timestamp auto

auth
  table: user
  identity: email
  password: password
  login: /login
  after login: /tasks

page /tasks layout main requires auth
  query tasks: select id, title, done from task
               where owner = :current_user.id
               order by created desc
               paginate 20
  html
    <input type="search" name="q" placeholder="Search tasks..."
           hx-get="/tasks" hx-trigger="keyup changed delay:300ms"
           hx-target="#task-list">
    <table id="task-list">
      <tr><th>Title</th><th>Done</th><th></th></tr>
      {{each tasks}}
      <tr>
        <td>{title}</td>
        <td>{{if done}}Yes{{end}}</td>
        <td><button hx-post="/tasks/{id}/delete"
                    hx-target="closest tr"
                    hx-swap="outerHTML">Delete</button></td>
      </tr>
      {{end}}
    </table>

action /tasks/create method POST requires auth
  validate task
  query: insert into task (title, owner)
         values (:title, :current_user.id)
  redirect /tasks

Auth, pagination, htmx search with debounce, inline delete, form validation. It even added the search input on its own, that wasn't in the prompt.

Why I think this happens

Express forces the model to make a lot of decisions. CommonJS or ESM? Which middleware in what order? Prisma or Sequelize or raw queries? Passport or express-session or JWT? EJS or Pug or Handlebars? Each fork is a place where a small model can pick wrong.

Kilnx has one way to do each thing. One keyword for auth, one keyword for pages, one for actions. The model doesn't pick between approaches because there's only one approach. The decision space is so small that even a 1B model mostly gets it right.

I don't think this is unique to Kilnx. Any DSL with a tight, regular grammar would probably show the same pattern. The surface area of a language directly predicts how well small models can generate it. I haven't seen anyone optimize for that yet.

What I'd do with this

If you're an indie dev or a solo founder shipping CRUD apps:

A 3B model running locally gives you 99% accuracy on Kilnx with no API costs, no internet, no privacy concerns. The 7B hits 100%. You don't need to send your code to OpenAI to get a working backend.

If you're using a paid API, the 71% token reduction over a session adds up fast. Especially if you're iterating on features all day.

If you're just curious, the whole language is 27 keywords. You can read the grammar in 10 minutes.

Links

curl -fsSL https://raw.githubusercontent.com/kilnx-org/kilnx/main/install.sh | sh
kilnx run app.kilnx

DEV Community