DEV Community

Cover image for 🐯 I Built a 8ns JIT Compiler in Python and This Is Why It Matters for AI
Divyanshu Sinha
Divyanshu Sinha

Posted on

🐯 I Built a 8ns JIT Compiler in Python and This Is Why It Matters for AI

Python is slow.
We all know that.

But what if I told you I just ran this Python JIT function in ~8 nanoseconds?

f = jit("a * b + c")
print(f(3, 4, 5))   # 17
Enter fullscreen mode Exit fullscreen mode

No NumPy.
No PyTorch.
No LLVM.
No C extensions.

Just Python → Assembly → CPU.

Let me explain what’s really happening and why this is very relevant to AI.


🚀 The idea: skip everything and talk directly to the CPU

Most AI frameworks work like this:

Python → Graph → IR → Optimizer → Kernel → CPU/GPU
Enter fullscreen mode Exit fullscreen mode

That’s powerful — but expensive.

I asked a simpler question:

What if Python could emit raw machine code and run it immediately?

That’s why I built TigerASM a runtime assembler for Python, written in Rust.

And on top of it, I built a tiny JIT compiler.


🧠 The JIT (simple, honest, dangerous)

This JIT takes a math expression, compiles it to assembly, and executes it natively.

Here’s the full working code 👇

from tigerasm import TigerASM

ARG_REGS = ["rdi", "rsi", "rdx", "rcx", "r8", "r9"]

class JITFunction:
    def __init__(self, expr: str, args: list[str]):
        self.expr = expr
        self.args = args
        self.asm = TigerASM()
        self.body_asm = self._compile()

    def _compile(self) -> str:
        asm = []

        tokens = self.expr.replace("+", " + ").replace("*", " * ").split()

        def load_to_rax(token):
            if token.isdigit():
                asm.append(f"mov rax, {token}")
            else:
                reg = ARG_REGS[self.args.index(token)]
                asm.append(f"mov rax, {reg}")

        load_to_rax(tokens[0])

        i = 1
        while i < len(tokens):
            op = tokens[i]
            val = tokens[i + 1]

            if val.isdigit():
                asm.append(f"mov rbx, {val}")
                rhs = "rbx"
            else:
                rhs = ARG_REGS[self.args.index(val)]

            if op == "+":
                asm.append(f"add rax, {rhs}")
            elif op == "*":
                asm.append(f"imul rax, {rhs}")

            i += 2

        asm.append("ret")
        return "\n".join(asm)

    def __call__(self, *args):
        self.asm.clear()
        self.asm.clear_memory()

        prologue = []
        for i, val in enumerate(args):
            prologue.append(f"mov {ARG_REGS[i]}, {int(val)}")

        full_code = "\n".join(prologue) + "\n" + self.body_asm

        self.asm.asm(full_code)
        self.asm.execute()
        return self.asm.get("rax")


def jit(expr: str):
    vars_ = []
    for t in expr.replace("+", " ").replace("*", " ").split():
        if t.isidentifier() and t not in vars_:
            vars_.append(t)
    return JITFunction(expr, vars_)
Enter fullscreen mode Exit fullscreen mode

Usage:

f = jit("a * b + c")

print(f(3, 4, 5))    # 17
print(f(10, 20, 30)) # 230
Enter fullscreen mode Exit fullscreen mode

⚡ Benchmark: the shocking part

Measured with perf_counter_ns():

Avg time per call: ~8.02 ns
Enter fullscreen mode Exit fullscreen mode

That’s roughly:

  • ~28 CPU cycles
  • Just a handful of mov, imul, add, ret

For comparison:

Operation Time
Python arithmetic ~100 ns
NumPy scalar ~60 ns
PyTorch eager ~80 ns
TigerASM JIT ~8 ns

This is near hand-written C speed.


🤖 Why this matters for AI

Modern AI is dominated by huge frameworks, but under the hood AI is still:

  • matrix multiplies
  • activation functions
  • reductions
  • element-wise ops

Those are tiny kernels, repeated millions of times.

With a system like this, you can:

  • JIT-compile custom activation functions
  • Generate model-specific kernels
  • Fuse operations at runtime
  • Eliminate Python and framework overhead entirely

This is especially powerful for:

  • research
  • custom models
  • edge / CPU-only inference
  • experimental architectures

🐯 What TigerASM gives you

TigerASM is not safe and that’s intentional.

It gives you:

  • raw registers
  • raw memory
  • executable machine code
  • zero abstraction

If you emit a wrong instruction, Python can crash.

That’s the cost of real performance.


🔮 What’s next

This is just the beginning.

Next steps:

  • loop lowering (while, for)
  • vectorized ops
  • memory-backed tensors
  • ML kernels without PyTorch
  • a dedicated tigerasm-jit package

🧠 Final thought

Most AI speedups come from more abstraction.

This one comes from removing everything.

When Python can talk directly to the CPU,
you stop optimizing code
you start designing instructions.

🐯⚡

Top comments (0)