DEV Community

Cover image for JIT vs Interpreters: Benchmarking LLM-Generated Code Execution
delimitter
delimitter

Posted on

JIT vs Interpreters: Benchmarking LLM-Generated Code Execution

Your AI Agent Writes Python. What If It Compiled to Native?


Who this is for. If you're building agentic workflows where LLMs generate and execute code — the execution speed of that code directly affects your agent's throughput. This article measures it.


Token efficiency is half the story. The other half: how fast does the generated code actually run? We benchmarked Synoema's Cranelift JIT against Python, Node.js, TypeScript (tsx), and C++ (-O2) across 12 algorithmic tasks.

Part of Token Economics of Code series.

Methodology

Hardware: Apple Silicon (macOS Darwin 25.3.0)

Runtimes: Synoema JIT (Cranelift, --release), CPython 3.12, Node.js (V8), TypeScript via tsx, C++ (g++ -O2)

Measurement: 3 warm-up runs discarded, 5 measured runs, median reported with p5/p95 percentiles.

Fairness: Identical algorithms across all languages. No language-specific optimizations.

cargo run --manifest-path benchmarks/runner/Cargo.toml -- run --phases runtime -v
Enter fullscreen mode Exit fullscreen mode

Results: Overview

Language Avg median (ms) vs Synoema
C++ (-O2) 2.0 2.5x faster
Synoema JIT 5.2 baseline
Python 3.12 27.6 5.3x slower

Results: Per-Task (12 tasks)

Task C++ (ms) Synoema (ms) Python (ms) Synoema vs Python
binary_search 2.1 7.4 16.7 2.3x faster
collatz 2.3 5.7 16.4 2.9x faster
factorial 1.4 JIT fail 17.2 --
fibonacci 3.7 JIT fail 145.6 --
filter_map 2.3 5.2 16.6 3.2x faster
fizzbuzz 1.7 5.7 16.8 3.0x faster
gcd 2.4 5.6 16.8 3.0x faster
matrix_mult 1.5 8.4 17.6 2.1x faster
mergesort 2.1 6.6 17.4 2.6x faster
quicksort 1.4 6.0 16.7 2.8x faster
string_ops 2.0 5.1 16.3 3.2x faster
tree_traverse 1.5 6.5 17.0 2.6x faster

factorial and fibonacci fail in JIT mode (known limitation -- being addressed).

Analysis

JIT Compilation Overhead

Synoema's times include Cranelift JIT compilation (10-50ms one-time cost). For short tasks, this overhead is visible. For longer computations, it's negligible.

Key insight: JIT overhead is constant, interpreter overhead is proportional to work.

Where Synoema Wins

  • Recursive algorithms: no interpreter loop overhead
  • Tight numeric loops (collatz, gcd): native integer operations
  • Pattern matching: compiled to jump tables

Where Synoema Loses

  • String-heavy operations: Python's C-implemented string library is highly optimized
  • Very short programs: JIT overhead dominates when computation < 10ms
  • vs C++ always: Cranelift generates ~86% quality code vs LLVM/GCC

Honest Comparison

The comparison that matters for AI agents:

Synoema (JIT, type-safe, fewer tokens on functional code)
    vs
Python (interpreted, duck-typed, dominant in LLM generation)
Enter fullscreen mode Exit fullscreen mode

Implications for AI Agents

Python:   generate (1.5s) -> interpret (Nms)
Synoema:  generate (0.8s, fewer tokens) -> JIT (50ms) -> native (N/4 ms)
Enter fullscreen mode Exit fullscreen mode

The real question: what's the total cost of the generate -> execute -> analyze cycle? Token efficiency + compilation speed + type guarantees create compound savings.

Try It

git clone https://github.com/synoema/synoema
cd synoema
cargo run --manifest-path benchmarks/runner/Cargo.toml -- run --phases runtime -v
Enter fullscreen mode Exit fullscreen mode

What's Next

Next: we sent the same prompts to 10 LLM models and measured who generates correct Synoema code.


*Part of Token Economics of Code series by @andbubnov.*llmprogrammingrust

Top comments (0)