Python is slow.
We all know that.
But what if I told you I just ran this Python JIT function in ~8 nanoseconds?
f = jit("a * b + c")
print(f(3, 4, 5)) # 17
No NumPy.
No PyTorch.
No LLVM.
No C extensions.
Just Python → Assembly → CPU.
Let me explain what’s really happening and why this is very relevant to AI.
🚀 The idea: skip everything and talk directly to the CPU
Most AI frameworks work like this:
Python → Graph → IR → Optimizer → Kernel → CPU/GPU
That’s powerful — but expensive.
I asked a simpler question:
What if Python could emit raw machine code and run it immediately?
That’s why I built TigerASM a runtime assembler for Python, written in Rust.
And on top of it, I built a tiny JIT compiler.
🧠 The JIT (simple, honest, dangerous)
This JIT takes a math expression, compiles it to assembly, and executes it natively.
Here’s the full working code 👇
from tigerasm import TigerASM
ARG_REGS = ["rdi", "rsi", "rdx", "rcx", "r8", "r9"]
class JITFunction:
def __init__(self, expr: str, args: list[str]):
self.expr = expr
self.args = args
self.asm = TigerASM()
self.body_asm = self._compile()
def _compile(self) -> str:
asm = []
tokens = self.expr.replace("+", " + ").replace("*", " * ").split()
def load_to_rax(token):
if token.isdigit():
asm.append(f"mov rax, {token}")
else:
reg = ARG_REGS[self.args.index(token)]
asm.append(f"mov rax, {reg}")
load_to_rax(tokens[0])
i = 1
while i < len(tokens):
op = tokens[i]
val = tokens[i + 1]
if val.isdigit():
asm.append(f"mov rbx, {val}")
rhs = "rbx"
else:
rhs = ARG_REGS[self.args.index(val)]
if op == "+":
asm.append(f"add rax, {rhs}")
elif op == "*":
asm.append(f"imul rax, {rhs}")
i += 2
asm.append("ret")
return "\n".join(asm)
def __call__(self, *args):
self.asm.clear()
self.asm.clear_memory()
prologue = []
for i, val in enumerate(args):
prologue.append(f"mov {ARG_REGS[i]}, {int(val)}")
full_code = "\n".join(prologue) + "\n" + self.body_asm
self.asm.asm(full_code)
self.asm.execute()
return self.asm.get("rax")
def jit(expr: str):
vars_ = []
for t in expr.replace("+", " ").replace("*", " ").split():
if t.isidentifier() and t not in vars_:
vars_.append(t)
return JITFunction(expr, vars_)
Usage:
f = jit("a * b + c")
print(f(3, 4, 5)) # 17
print(f(10, 20, 30)) # 230
⚡ Benchmark: the shocking part
Measured with perf_counter_ns():
Avg time per call: ~8.02 ns
That’s roughly:
- ~28 CPU cycles
- Just a handful of
mov,imul,add,ret
For comparison:
| Operation | Time |
|---|---|
| Python arithmetic | ~100 ns |
| NumPy scalar | ~60 ns |
| PyTorch eager | ~80 ns |
| TigerASM JIT | ~8 ns |
This is near hand-written C speed.
🤖 Why this matters for AI
Modern AI is dominated by huge frameworks, but under the hood AI is still:
- matrix multiplies
- activation functions
- reductions
- element-wise ops
Those are tiny kernels, repeated millions of times.
With a system like this, you can:
- JIT-compile custom activation functions
- Generate model-specific kernels
- Fuse operations at runtime
- Eliminate Python and framework overhead entirely
This is especially powerful for:
- research
- custom models
- edge / CPU-only inference
- experimental architectures
🐯 What TigerASM gives you
TigerASM is not safe and that’s intentional.
It gives you:
- raw registers
- raw memory
- executable machine code
- zero abstraction
If you emit a wrong instruction, Python can crash.
That’s the cost of real performance.
🔮 What’s next
This is just the beginning.
Next steps:
- loop lowering (
while,for) - vectorized ops
- memory-backed tensors
- ML kernels without PyTorch
- a dedicated
tigerasm-jitpackage
🧠 Final thought
Most AI speedups come from more abstraction.
This one comes from removing everything.
When Python can talk directly to the CPU,
you stop optimizing code
you start designing instructions.
🐯⚡
Top comments (0)