Divyanshu Sinha

Posted on Dec 15, 2025

🐯 I Built a 8ns JIT Compiler in Python and This Is Why It Matters for AI

#tigerasm #python #ai #opensource

Python is slow.
We all know that.

But what if I told you I just ran this Python JIT function in ~8 nanoseconds?

f = jit("a * b + c")
print(f(3, 4, 5))   # 17

No NumPy.
No PyTorch.
No LLVM.
No C extensions.

Just Python → Assembly → CPU.

Let me explain what’s really happening and why this is very relevant to AI.

🚀 The idea: skip everything and talk directly to the CPU

Most AI frameworks work like this:

Python → Graph → IR → Optimizer → Kernel → CPU/GPU

That’s powerful — but expensive.

I asked a simpler question:

What if Python could emit raw machine code and run it immediately?

That’s why I built TigerASM a runtime assembler for Python, written in Rust.

And on top of it, I built a tiny JIT compiler.

🧠 The JIT (simple, honest, dangerous)

This JIT takes a math expression, compiles it to assembly, and executes it natively.

Here’s the full working code 👇

from tigerasm import TigerASM

ARG_REGS = ["rdi", "rsi", "rdx", "rcx", "r8", "r9"]

class JITFunction:
    def __init__(self, expr: str, args: list[str]):
        self.expr = expr
        self.args = args
        self.asm = TigerASM()
        self.body_asm = self._compile()

    def _compile(self) -> str:
        asm = []

        tokens = self.expr.replace("+", " + ").replace("*", " * ").split()

        def load_to_rax(token):
            if token.isdigit():
                asm.append(f"mov rax, {token}")
            else:
                reg = ARG_REGS[self.args.index(token)]
                asm.append(f"mov rax, {reg}")

        load_to_rax(tokens[0])

        i = 1
        while i < len(tokens):
            op = tokens[i]
            val = tokens[i + 1]

            if val.isdigit():
                asm.append(f"mov rbx, {val}")
                rhs = "rbx"
            else:
                rhs = ARG_REGS[self.args.index(val)]

            if op == "+":
                asm.append(f"add rax, {rhs}")
            elif op == "*":
                asm.append(f"imul rax, {rhs}")

            i += 2

        asm.append("ret")
        return "\n".join(asm)

    def __call__(self, *args):
        self.asm.clear()
        self.asm.clear_memory()

        prologue = []
        for i, val in enumerate(args):
            prologue.append(f"mov {ARG_REGS[i]}, {int(val)}")

        full_code = "\n".join(prologue) + "\n" + self.body_asm

        self.asm.asm(full_code)
        self.asm.execute()
        return self.asm.get("rax")


def jit(expr: str):
    vars_ = []
    for t in expr.replace("+", " ").replace("*", " ").split():
        if t.isidentifier() and t not in vars_:
            vars_.append(t)
    return JITFunction(expr, vars_)

Usage:

f = jit("a * b + c")

print(f(3, 4, 5))    # 17
print(f(10, 20, 30)) # 230

⚡ Benchmark: the shocking part

Measured with perf_counter_ns():

Avg time per call: ~8.02 ns

That’s roughly:

~28 CPU cycles
Just a handful of mov, imul, add, ret

For comparison:

Operation	Time
Python arithmetic	~100 ns
NumPy scalar	~60 ns
PyTorch eager	~80 ns
TigerASM JIT	~8 ns

This is near hand-written C speed.

🤖 Why this matters for AI

Modern AI is dominated by huge frameworks, but under the hood AI is still:

matrix multiplies
activation functions
reductions
element-wise ops

Those are tiny kernels, repeated millions of times.

With a system like this, you can:

JIT-compile custom activation functions
Generate model-specific kernels
Fuse operations at runtime
Eliminate Python and framework overhead entirely

This is especially powerful for:

research
custom models
edge / CPU-only inference
experimental architectures

🐯 What TigerASM gives you

TigerASM is not safe and that’s intentional.

It gives you:

raw registers
raw memory
executable machine code
zero abstraction

If you emit a wrong instruction, Python can crash.

That’s the cost of real performance.

🔮 What’s next

This is just the beginning.

Next steps:

loop lowering (while, for)
vectorized ops
memory-backed tensors
ML kernels without PyTorch
a dedicated tigerasm-jit package

🧠 Final thought

Most AI speedups come from more abstraction.

This one comes from removing everything.

When Python can talk directly to the CPU,
you stop optimizing code
you start designing instructions.

🐯⚡

Top comments (1)

Divyanshu Sinha • Dec 17 '25

For more!
github.com/DivyanshuSinha136/TigerASM