DEV Community

Cover image for LuaJIT is a better LLM runtime than Python
5inq
5inq

Posted on

LuaJIT is a better LLM runtime than Python

The title is clickbait, forgive me. You clicked, time to read :D


Not really sure how to start this blog post, so I'm just gonna wing it.

I have 32 GB of RAM, a Ryzen 9 9950X and an RTX 3060.

Not enough to run 14B in Q8 without CPU offload, and definitely not enough to let Python eat half my memory before the model is even loaded. I've been writing Lua for nearly 14 years, and I have a strong taste for tools that don't leave a huge footprint on my system: few packages, little RAM, little noise. Pretty much the opposite of the Python ecosystem.

So while looking for an inference pipeline that fits these constraints, I went through the usual suspects: llama.cpp at the bottom of the stack, llama-cpp-python right above it, and then the holy grail... llama-cpp-lua... nope. Nothing. No serious Lua wrapper, no complete pipeline, just an old abandoned POC rotting in some corner of GitHub.

Strange... because LuaJIT with FFI is exactly the tool you'd pick today if you had to write a performant C binding: near-zero overhead, clean syntax, aggressive JIT.

Alright. I'm used to painful stacks. AzerothCore, Eluna, WoW modding for... too many years now...

I'll admit though, I wasn't really up for spending 18 hours a day coding a system from scratch, especially since I'm more interested in architecture than raw implementation these days. Lucky for us, we get to enjoy a beautiful tool to speed up the boring parts: AI. So with my buddy Claude and a fair amount of varied prompts, we laid the foundations.

ion7-core

The result is ion7-core: a LuaJIT FFI binding that exposes llama.cpp 1:1 in Lua. 210+ documented functions, zero malloc in the per-token hot path, OpenBLAS plugged in via FFI for matrix math. It's clean — no HTTP, no JSON serialization, no Python between you and the model.

With the base in place, I wanted an honest benchmark. And let me get this out of the way upfront: ion7 does not run models "faster" than Python.
That's mathematically impossible, we're calling the exact same library. A transformer's forward pass is not negotiable, it takes what it takes, end of story.

What is negotiable is everything else. The call cost of binding operations. Idle memory pressure. Allocations wandering around. There, on the other hand, there's a real gap.

CPU backend, Ministral 8B Q8, three fresh runs per backend (process restarted every time):

Metric ion7 lcpp native lcpp-python
gen_tps (tok/s) 15.80 15.85 15.52
peak RSS (MB) 3,969 3,962 6,953
detokenize (calls/s) 7.58 M 9.71 M 55.97 k
piece (calls/s) 676 M 500 M 2.88 M
is_eog (calls/s) 331 M 1.00 G 4.09 M
page-faults / 1k tok 26.8 k 26.7 k 185.8 k

Bench caveats — read before throwing tomatoes
This is a homemade bench, so take it with the usual grain of salt: methodology is perfectible, hardware is a single setup (Ryzen 9 9950X / RTX 3060 / 32 GB DDR5), and I'm open to any critique or repro that would qualify these numbers.

Three runs per backend, fresh process each time, same GGUF file, same prompt, same seed where applicable. Tier-3 micro-bench numbers (detokenize, piece, is_eog) are JIT-warm per-call costs measured at the binding layer.


That said, three trends stand out clearly enough to talk about.

RAM. llama-cpp-python uses noticeably more at peak, about 3 GB more than ion7 on this run, for the same job. And it's not the model, the model lives in the GGUF buffer managed by llama.cpp, identical in all three cases. Those 3 GB are the CPython interpreter, its imports, its ref-counting, its ecosystem. On a Raspberry Pi, in a game mod, in a desktop binary you want to ship, that's a problem.

Binding-layer call cost. Detokenization runs at several million calls per second in LuaJIT versus a few tens of thousands in Python. The order of magnitude is massive (could be human error in there), but it's not magic: Python pays a Python ↔ C string round-trip on every token, plus the GIL, plus ref-counting, every single time. LuaJIT manipulates the C buffer directly through FFI, and the JIT inlines whatever it can.

Page faults. Python racks up noticeably more per 1k tokens, roughly a factor of 7 on this run. That's memory it allocates and frees for nothing, while the GPU watches the train go by.

And the fact that ion7 doesn't beat native lcpp is exactly what you'd expect: same library, just a thinner binding than a Python wrapper. We're not doing miracles here, we're doing well-called C.

I also have a Vulkan bench showing ion7 and native lcpp side by side with the same ratio, but I'm leaving it out of this post on purpose: the llama-cpp-python build I used wasn't compiled with Vulkan support, so comparing it to GPU backends would be unfair. And I don't really like broken charts.

The stack on top

Having llama.cpp accessible from Lua is nice. But naked, it doesn't really do anything useful. So three modules sit on top.

As the French saying goes: La technique c'est sympathique, mais tout seul .. ça pue de la gueule.

ion7-llm is the inference loop. Stop conditions, sampling, sliding-context handling, KV cache. Deliberately minimal — no chains, no agents, no framework thinking on your behalf. A clean loop that does its job, and lets you do yours.

ion7-grammar constrains output at the token level. GBNF on the llama.cpp side, exposed cleanly on the Lua side. If you want 100% valid JSON, well-formed SQL, or any formal language respected to the syntax, you describe the grammar and the sampler mechanically blocks illegal tokens.

ion7-rag stores and retrieves. Dense vector store, FTS5 BM25 on top of SQLite, fusion via RRF (Reciprocal Rank Fusion).

The four modules together form a complete pipeline: you load a model, you talk to it, you constrain it, you give it memory. In Lua. Without Python.

And then what?

The pipeline was running. Now I needed something to plug it into.

I love Cyberpunk 2077. Genuinely, this game is great. And it has an interesting quirk: CET (Cyber Engine Tweaks) exposes a Lua runtime inside the game's process. And ion7 is in Lua. So.

I packaged the pipeline as a mod, plugged in Ministral 8B Q8 with GPU-CPU offload (I'm not insane), and fed it the entire body of Judy Alvarez's text pulled from the game archives — dialogue, messages, journals.

A few hours of friction on the usual integration points (blocking loops, threading, async call handling on the CET side).

And it works. Pretty well, actually. Latency is fine, the model stays in character, and the whole stack runs inside the game process. No inference server on the side, no WebSocket to a local endpoint, nothing. The LLM is inside Cyberpunk.

Who's it for?

I don't really have a conclusion, never been a fan of those.

If you're looking for an LLM pipeline that weighs as little as possible on the machine — game modding, edge deployment, inference on a Raspberry Pi, embedded in a desktop binary, editor plugins, that kind of thing — give ion7 a shot.

It's in Lua, it's small, it runs, and it doesn't require a Python environment that takes three coffees to start up.

Code is here: ion7-labs.github.io

Feel free to send feedback, on the code or on the project itself.
I'm taking it.


Originally written in French. English version translated with Claude.

Top comments (0)