James Randall

Posted on Dec 4 • Originally published at jamesdrandall.com

Teaching an LLM to Write Assembly: GBNF-Constrained Generation for a Custom 8-Bit CPU

#ai #llm #machinelearning #retrocomputing

Over the past few weeks I’ve been building a fully-playable 8-bit virtual console from scratch — CPU, instruction set, assembler, sprite system, IDE, the lot. One of the more interesting side quests has been teaching an LLM to generate valid assembly targeting my CPU. You can follow the full build in my YouTube series here. It's called Building a Virtual 8-Bit Console with an AI Assistant and we're up to about part 14 and things are still ongoing. The source code for the project can be found on GitHub here - snapshot at the tag where I added this work.

If you’ve ever tried asking a model to generate domain-specific code, you’ll know what happens: it writes things that look right, but your parser throws it back at you in disgust.

I wanted something different. I needed this to work. I wanted the model to only produce tokens that my assembler would accept.

That led me into the world of GBNF — a compact grammar notation supported by llama.cpp and other inference runtimes, which lets you force the model to emit only syntactically valid output.

In this post I’m going to walk through:

What GBNF actually is
How to design a grammar for an assembly-like DSL
How to plug it into llama.cpp for constrained generation
What worked, what failed, and what surprised me
How it performs when generating real code for a real CPU

And its worth emphasising that this isn’t a theoretical post. It's the exact approach I’m using in my IDE to let developers (or AIs) write programs for the console I’m building.

Even if you're not generating code for a fictional CPU, the principles here apply to any domain where you need guaranteed-valid output: config files, structured data, DSLs, test scripts, game engines, or anything brittle.

Why We Need Grammar-Constrained Generation

LLMs are pattern machines and predictors, not parsers. They’re very good at capturing intent (“draw a sprite at (10,10)”), but surprisingly bad at following the exact syntax rules of a DSL or assembly language — especially a brand-new one they’ve never seen before.

When I first experimented with generating code for my CPU, I tried Qwen locally without any grammar constraints. It was… catastrophic. The model produced:

hallucinated opcodes
invented addressing modes
malformed instructions
missing commas and stray punctuation
registers that don’t exist
syntax that looked plausibly “assembly-ish” but was entirely useless

Claude Sonnet 4.5 did a much better job, and with some prompt tweaking it produced mostly-correct code — but even then it still made small syntactic mistakes that would immediately break the pipeline.

And assembly is brittle:

one stray comma and everything breaks
an unknown opcode and the assembler bails
a missing operand and the whole program collapses

If you simply ask an LLM to “write some assembly,” particularly for an instruction set it cannot have been trained on because you literally just invented it, you get output that looks plausible but fails the moment you feed it into a real assembler.

Grammar-constrained generation changes the game.

Instead of cleaning up a mess afterwards, you make the mess impossible by ensuring the model can only emit legal token sequences. You remove an entire class of errors up front.

You still have to worry about semantics (whether the code actually does the right thing — I’ll cover that in a future post), but at least you’re no longer fighting the syntax.

Yes, what I'm doing here is a slightly off-the-wall demo — generating assembly for a fictional CPU — but the problem generalises. If you’ve ever tried generating anything complex, brittle, and highly specific, you’ve probably hit the same wall. And even if you can coax a bigger model like Claude into doing this cleanly, grammar-constrained generation lets you drop down to smaller, cheaper, local models while still getting reliable, structured output.

What GBNF Is and Why It Works

GBNF is a small grammar format that describes which token sequences are valid in your language. Several inference runtimes (including llama.cpp and vLLM) can use it during decoding to constrain what the model is allowed to output. Its worth pointing out that other runtimes use different approaches but the fundamental approach is the same, its the expression of the grammar that mostly differs.

At a high level:

you describe your language using GBNF: rules, terminals, non-terminals
the runtime loads this grammar
during generation, it masks out any tokens that would violate the grammar
the LLM is effectively forced to walk only along valid paths in your language

This does not magically make the model understand your DSL or, in my case, assembly. It simply means:

if your grammar says opcodes are LOAD | STORE | ADD | SUB,
the model cannot invent LOADX or SUBTRACT or MOVE.

Here's a simplified excerpt from my actual grammar:

root ::= line*
line ::= ws* statement? comment? eol

statement ::= instruction | directive
instruction ::= opcode-noarg
              | opcode-single ws+ operand
              | opcode-double ws+ operand ws* "," ws* operand

opcode-noarg ::= "NOP"i | "RET"i | "RTI"i | "SEI"i | "CLI"i
opcode-single ::= "PUSH"i | "POP"i | "INC"i | "DEC"i | "JMP"i | "CALL"i
opcode-double ::= "LD"i | "ST"i | "MOV"i | "ADD"i | "SUB"i | "AND"i | "OR"i | "XOR"i | "CMP"i

operand ::= immediate | register | memory-ref | identifier
immediate ::= "#" (number | identifier)
register ::= "R"i [0-5] | "SP"i | "PC"i
memory-ref ::= "[" ws* (number | identifier) ws* "]"

number ::= "$" [0-9a-fA-F]+ | [0-9]+
identifier ::= [a-zA-Z_] [a-zA-Z0-9_]*
comment ::= ws* ";" [^\r\n]*
ws ::= [ \t]+
eol ::= "\r"? "\n" | "\r"

The "i" suffix makes matching case-insensitive, so LD, ld, and Ld are all valid. The important part is that the grammar mirrors exactly what my assembler expects to see.

GBNF sits in a sweet spot: simpler than a full PEG/EBNF parser, but expressive enough to model a real assembly language.

Designing a Grammar for a Real Instruction Set

I've been taking a spec first approach to the 8-bit console so I already had comprehensive specifications for the assembler and the CPU. I also had an AI cheatsheet that I'd been feeding to Claude to help it understand the assembler and hardware. And in addition to that I also had a good set of working example code that I'd already written.

And so a great way for me to design the grammar was to start with the things I had.

And what I mean is for me to design the grammar is for me to give the source material to Claude and have it design the grammar and me review it.

The resulting grammar file can be found here.

This came out pretty much spot on first time and I was able to review it line by line and validate it against my examples.

Using GBNF with llama.cpp

Once you have a grammar file, wiring it into llama.cpp is straightforward. The llama.cpp server exposes a /completion endpoint that accepts a grammar parameter containing your GBNF grammar as a string.

Here's the actual TypeScript code I'm using in the IDE:

async generateWithGrammar(prompt: string, grammar: string): Promise<string> {
  const response = await fetch(`${this.codegenUrl}/completion`, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      prompt,
      n_predict: 4096,
      grammar,
      temperature: 0.7,
      stop: ['\n\n\n'],
    }),
  });

  if (!response.ok) {
    const errorText = await response.text();
    throw new Error(`llama.cpp grammar generation failed: ${response.status} ${errorText}`);
  }

  const result = await response.json();
  return result.content;
}

The key parameters:

prompt — the full prompt including any context and instructions
grammar — the GBNF grammar as a string (I load the file and pass its contents)
n_predict — maximum tokens to generate (4096 gives plenty of room for longer programs)
temperature — I use 0.7 for some creativity while staying coherent
stop — I use triple newlines as a stop sequence to end generation cleanly

The interesting moment is the first time the model produces perfectly valid assembly on the first attempt — because the grammar has done the heavy lifting. After using Qwen without constrained generation and seeing just an absolute mass of errors it was genuinely revelatory to see perfectly valid assembly being generated from the same model.

Where GBNF Helps (and Where It Doesn't)

So while grammar-constrained generation is powerful its imporant to me clear about what it deoesn't buy you.

Where GBNF helps

Syntactic correctness — This is the big one. Every line the model emits will parse. No more LOADX when you only have LD. No more forgetting commas between operands. No more invented addressing modes like [R0+R1] when your CPU doesn't support indexed addressing. The grammar makes these errors structurally impossible. Of course I could have errors in my grammar.... turtles all the way down.

Preventing hallucinated opcodes — Without constraints, models love to invent plausible-sounding instructions. I saw Qwen generate SUBTRACT, MOVE, JUMP, and LOADB — none of which exist in my instruction set. My CPU is 6502 inspired with a RISC twist and I also saw it using 6502 operands like LDA and STX. With the grammar in place, the model can only pick from the opcodes I've explicitly defined: LD, ST, ADD, SUB, and so on.

Proper operand structure — My CPU has specific rules: immediates start with #, memory references use square brackets, hex numbers use $ prefix. The grammar enforces all of this. You can't accidentally write LD R0, 42 when you meant LD R0, #42. Qwen with grammar does a better job than Claude without in this regard, I find Claude trips up over addressing mode structures.

Valid registers and directives — I only have R0 through R5, plus SP and PC. The grammar won't let the model reference R6 or AX or any other register that doesn't exist. Same for assembler directives — only .org, .db, .dw, and the others I've defined.

Consistent formatting — The output follows a predictable structure. Comments always start with ;. Labels always end with :. Whitespace is handled consistently. This might seem minor, but it makes the generated code much easier to read and debug.

Where GBNF doesn't help

Semantic correctness — The grammar ensures you write valid assembly, not correct assembly. If you ask the model to add two numbers and it subtracts them instead, the grammar won't catch that. SUB R0, R1, R2 is perfectly valid syntax even when you wanted ADD.

Algorithmic quality — The model might generate a working sprite-drawing routine that's absurdly inefficient — using a loop where a single instruction would do, or loading the same value from memory repeatedly instead of keeping it in a register. The grammar has no opinion on this.

Efficiency — Related to the above: the model doesn't understand your CPU's pipeline, how many cycles each instruction takes, or what operations are expensive. It might generate code that works but runs slowly. For my virtual console this matters less than for real hardware, but it's still a limitation.

Calling conventions — If your architecture has conventions about which registers are caller-saved versus callee-saved, or how to pass arguments, the grammar won't enforce them. The model might trash R5 in a subroutine without saving it first, breaking code that calls it.

Higher-level logic — The model doesn't actually understand what your program is supposed to do. It's pattern-matching based on your prompt and its training. It might write syntactically perfect assembly that does something completely different from what you asked for.

The bottom line

GBNF is a guardrail, not a full program verifier. It solves the "will this even assemble?" problem completely, which is genuinely valuable — you eliminate an entire class of errors up front. But it leaves the "does this code actually work?" problem entirely to you (or to an agentic feedback loop, which I'll discuss in a later section).

Real Results (What the Model Actually Produced)

Let's look at what the model actually generates. I asked it to write a simple program: clear the screen and draw a red square (16x16) in the center of the screen. This is a good test because it requires understanding the video memory layout, coordinate calculations, and the 4bpp pixel packing format.

Here's the exact output from Qwen with grammar constraints:

; Clear Screen and Draw Red Square Example
; Demonstrates clear screen and pixel drawing subroutine in Video Mode 0 (256x160 @ 4bpp)
;
; Subroutine: draw_pixel
;   Inputs: R0 = X coordinate, R1 = Y coordinate, R2 = color (palette index)
;   Draws a single pixel at (X,Y) with the specified color

.org $0B80

; Constants
.define VIDEO_MODE $0101
.define PALETTE_RAM $0200
.define FRAMEBUFFER_START $B000

; Entry point
main:
    ; Set video mode to 0 (256x160 @ 4bpp)
    LD R0, #0
    ST R0, [VIDEO_MODE]

    ; Clear screen to black (color 0)
    CALL clear_screen

    ; Draw 16x16 red square at the center (X=112, Y=72)
    LD R2, #2          ; Color 2 (red)
    LD R1, #72
    LD R3, #16         ; Loop counter for y-axis
.y_loop:
    LD R0, #112        ; X = 112
    LD R4, #16         ; Loop counter for x-axis
.x_loop:
    CALL draw_pixel
    INC R0
    DEC R4
    BRNZ .x_loop
    INC R1
    DEC R3
    BRNZ .y_loop

    ; Infinite loop - program done
done:
    JMP done

; Subroutine: Clear screen to color 0
; Uses register pairs to iterate through framebuffer
clear_screen:
    PUSH R0
    PUSH R2
    PUSH R3
    PUSH R4

    ; R2:R3 will be the address pointer
    LD R2, #$B0        ; High byte of framebuffer start
    LD R3, #$00        ; Low byte
    LD R4, #$50        ; 80 pages (0x5000 bytes = 20480 bytes for mode 0)
    LD R0, #0          ; Color to write (0 = black)

.outer:
.inner:
    ST R0, [R2:R3]     ; Write byte at R2:R3
    INC R3             ; Increment low byte
    BRNZ .inner        ; Loop until R3 wraps to 0

    INC R2             ; Increment high byte (next page)
    DEC R4             ; Decrement page counter
    BRNZ .outer        ; Loop until all pages done

    POP R4
    POP R3
    POP R2
    POP R0
    RET

; Subroutine: Draw pixel at (X, Y) with color
; Inputs: R0 = X coordinate (0-255)
;         R1 = Y coordinate (0-159)
;         R2 = color (palette index 0-15)
;
; Mode 0 is 4bpp, so 2 pixels per byte:
;   - Even X: high nibble
;   - Odd X: low nibble
; Address calculation: 0xB000 + (Y * 128) + (X / 2)
draw_pixel:
    PUSH R0
    PUSH R1
    PUSH R2
    PUSH R3
    PUSH R4
    PUSH R5

    ; Bounds check Y
    CMP R1, #160
    BRC .exit          ; Exit if Y >= 160

    ; Calculate address: 0xB000 + (Y * 128) + (X / 2)
    ; Y * 128 = Y << 7
    ; We'll use R3:R4 as a 16-bit value for the address offset

    LD R3, #0          ; R3 = high byte of offset (initially 0)
    MOV R4, R1         ; R4 = low byte (Y value)

    ; Shift R3:R4 left by 7 positions (multiply Y by 128)
    SHL R4
    ROL R3
    SHL R4
    ROL R3
    SHL R4
    ROL R3
    SHL R4
    ROL R3
    SHL R4
    ROL R3
    SHL R4
    ROL R3
    SHL R4
    ROL R3

    ; Add X/2 to the low byte
    MOV R5, R0         ; Copy X to R5
    SHR R5             ; R5 = X / 2
    ADD R4, R5         ; Add to low byte
    BRC .carry         ; Check for carry
    JMP .no_carry

.carry:
    INC R3             ; Propagate carry to high byte

.no_carry:
    ; Add framebuffer base address (0xB000) to high byte
    ADD R3, #$B0

    ; Now R3:R4 contains the final address
    ; Load the current byte at this address
    LD R5, [R3:R4]

    ; Check if X is even or odd
    AND R0, #1         ; R0 = X & 1 (0 if even, 1 if odd)
    BRNZ .odd_pixel

    ; Even pixel: modify high nibble
    AND R5, #$0F       ; Clear high nibble, keep low nibble
    SHL R2, #4         ; Shift color to high nibble
    OR R5, R2          ; Combine
    JMP .write_byte

.odd_pixel:
    ; Odd pixel: modify low nibble
    AND R5, #$F0       ; Clear low nibble, keep high nibble
    AND R2, #$0F       ; Ensure color is only in low nibble
    OR R5, R2          ; Combine

.write_byte:
    ; Write the modified byte back
    ST R5, [R3:R4]

.exit:
    POP R5
    POP R4
    POP R3
    POP R2
    POP R1
    POP R0
    RET

What the model got right

Perfect syntax — Every single line assembles without error. The grammar did its job: no hallucinated opcodes, no malformed operands, no missing commas. This is the same model that was generating SUBTRACT and LDA before I added constraints. I'm actually yet to see any assembly syntax errors.

Sensible structure — The code follows good assembly conventions: constants defined at the top, a clear entry point, subroutine with proper prologue/epilogue (push/pop all used registers), comments explaining the logic.

Correct algorithm — The pixel-drawing logic is actually sound. It correctly calculates the framebuffer address using Y * 128 + X / 2, handles the 4bpp packing (even pixels in high nibble, odd in low), and preserves the other pixel in the byte when writing.

Hardware-aware — The model understood that my console uses memory-mapped video at $B000, that mode 0 is 256x160 at 4 bits per pixel, and that the palette index for red is 2. All of this came from the context I provided in the prompt which includes the AI cheatsheet I shared earlier.

What surprised me

Color selection — I didn't expect it to find the red color. The system uses custom palettes and although their is a default I'm surprised it found this. Possibly from an example. Or possibly chance!

Screen clearing (or not) - If I generate the code multiple times sometimes it will include code for clearing the screen and sometimes not. If I ask it to add screen clearing code it successfully does so.

Did it actually work?

Yes. I loaded this into the console's IDE, assembled it, and ran it. Or rather my agentic chat assistant did all that for me. A red pixel appeared at (128, 80) — dead center of the screen. No modifications required. When I added screen clearing this worked too.

This is the thing that still surprises me: a (comparatively) small model running locally, with the right constraints and context, can generate working assembly for a CPU that didn't exist until I built it.

Next Steps: Combining GBNF with Agentic Behaviours

I've not really covered this here but I've already combined this model with agentic behaviour in my consoles IDE that can:

Assemble and check for errors (not yet had any of those!)
Access a library of example programs
Run the assembled program
Pause the CPU and inspect the registers and memory
Capture a screenshot and inspect it

The agent is exposed through a chat interface. Oddly enough I don't want the chat talking to me in opcodes so we don't use the grammar unless we're doing code generation. Because I'm running this locally I'm also using different models for chat and code generation. The code generation model has a bigger context window as I need to pass it the cheat sheet and other materials describing the surrounding hardware

The library of example programs has proven essential in the ability of the LLM to generate useful assembly code. I see it both copying subsections of the examples in verbatim and also essentially creating derived works.

Combined with the agentic tools the constrained generation starts to feel like a real assistant rather than a suggestion generator.

Closing

GBNF doesn’t make an LLM understand your language — but it forces it to follow the rules. For a custom 8-bit console, that’s exactly what you need. And if you're working with agentic system that need to generate precise outputs it probably is what you need too.

It’s a satisfying intersection of old-school compiler techniques and modern LLM tooling.

DEV Community