ESPB: WASM-like bytecode interpreter for ESP32 with seamless FreeRTOS integration. Part 2: The JIT Compiler

#jit #esp32 #interpretator

Hi.

Exactly 3 months have passed since the first publication.
During this time, I’ve shaken things up significantly: I added a full-fledged JIT for Xtensa and RISC-V and implemented a heap of optimizations in the translator. I tested it on ESP32, ESP32-C3, and ESP32-C6 chips (the latter was tested on a residual basis—I only ran the main test; primary debugging was done on the first two).

Here are the main innovations.

1. Fast Symbols: Killing strcmp in Linking

Among the new features, I added Fast Symbols in addition to the standard symbol tables. These are two tables: one for system functions (ESP-IDF) and another custom one for your convenience.
The core idea is that we remove the string name of functions from the binary, leaving only the pointer. This approach requires strict coordination between the table and the translator so that the translator knows exactly which index to use for calling a function via libffi. This reduces the space occupied in flash memory and eliminates the slow strcmp during module loading.
At runtime, linking turns into instant address retrieval from a flat array:
But what about Kconfig?
In ESP-IDF, modules can be disabled (for example, cutting out GPIO). If we simply remove a function from the array, all subsequent indices will shift, and the wrong function will be called. This issue is solved via a macro:

// idf_fast.sym
ESPB_SYM("printf", (const void*)&printf)
ESPB_SYM("vTaskDelay", (const void*)&vTaskDelay)
// If GPIO is disabled in menuconfig, the macro substitutes NULL but keeps the index!
ESPB_SYM_OPT(CONFIG_ESPB_IDF_GPIO, "gpio_set_level", (const void*)&gpio_set_level)

The array size and the order of indices remain absolutely stable regardless of the firmware configuration.

2. JIT Compiler

The second feature is JIT. I decided that the best approach is to give the developer the ability to manually mark the specific functions in the code that need to be translated into machine code.
ESPB is originally designed as a register machine (up to 256 virtual registers). All the complex mathematics (Graph Coloring, Register Allocation) are handled by the C# translator on the PC. The ESPB runtime on the microcontroller is left with the simplest task: generating Xtensa or RISC-V instructions.
How it works:

In the C/C++ script code, the developer marks a heavy function with the JIT_HOT macro.
The translator sees this and sets the ESPB_FUNC_FLAG_HOT flag in the function header within the .espb file.
When instantiating the module, the runtime allocates memory via heap_caps_malloc(size, MALLOC_CAP_EXEC) (memory where execution is permitted).
The JIT engine generates the binary code and places the pointer in the table.
Cold code (e.g., one-time initialization) remains in Flash memory as bytecode, saving expensive IRAM.

P.S. Surprisingly, implementing JIT for the Xtensa architecture was the hardest part due to its register window ABI and literal pools.

3. Moment of Truth: ESPB vs WAMR

I went to the trouble of preparing a project for the wasm-micro-runtime (WAMR) from Espressif with an implementation of the Fibonacci(85) test, identical to the one I use for ESPB.

Tests were conducted on an ESP32-C3 chip (160 MHz):

The pure ESPB interpreter currently works slower than the WAMR interpreter. My efforts here weren't enough, and there is room for growth (ESPB currently lacks super-opcodes where multiple actions are baked into one instruction, and the .espb translator—as well as the interpreter—can still be optimized for a long time).
But the good news is that hot code works, and here we are 2+ times faster than WAMR's best mode, judging by this single test, of course. By the way, WAMR for ESP32 generally does not have a JIT compiler, only Classic and Fast interpreter modes.
It is evident that the team of programmers meticulously optimized the WAMR interpreters, which commands respect. The comparison currently isn't in favor of the creation of a suffering indie-coder, however.
I am not considering the AOT mode in WAMR, as the main idea is to make a single bytecode work on all systems.
By the way, another direction for development (besides optimization) is emerging here: I see it as "AOT on Device". That is, compiling all code into JIT, placing it in a partition on the flash, and subsequently executing it via XIP (Execute In Place). All of this needs to be generously diluted with a GOT (Global Offset Table) so that main firmware updates via OTA allow this AOT version to continue working. I need to conduct experiments first, but I think this direction should be viable. I'm actually considering this as the main mode if it works out.

4. Battle for Memory

I compiled five firmware variants for ESP32-C3: from an empty "Hello World" to "full option."

Figures from the build report (idf.py size):

What we see:

Smallest Engine: The pure ESPB interpreter (No JIT) takes up less space in Flash memory than even the most basic WAMR Classic (~2.5 KB less).
Note: The WASM was generated with Lib pthread, Libc builtin, Libc WASI, and Loader mode-normal options.
Cost of JIT: Enabling the JIT compiler in ESPB increases the firmware size (Flash) by approximately 56 KB. Static RAM consumption (DRAM) does not change.
DRAM: All runtimes add about 11–12 KB to RAM consumption relative to an empty project.

Script Sizes:

Size of .wasm file — 1277 bytes (uses LEB128 compression).
Size of .espb file — 1511 bytes (fixed types for speed).
Generated JIT code for two test functions occupied 2494 bytes in IRAM.

5. FFI: Death to "Glue Code"

Simple functions are easy to call everywhere. But the real pain begins when you need to use callbacks. Imagine a task: create a software FreeRTOS timer (xTimerCreate) that calls a function inside your script when triggered.
Let's see how this is solved in WAMR and ESPB.
WAMR: Architectural Pain
WASM is isolated from the microcontroller's memory. You cannot simply pass a pointer to a function into FreeRTOS because the native code doesn't know where to look for this function inside the virtual machine.

Step 1. Write the script (Guest side).
We cannot pass the function directly. We have to pass its index in the table.

// WASM (Guest)
typedef void (*timer_cb_t)(uint32_t, uint32_t);

// Get function index (in wasm32 this is not an address, but an index!)
timer_cb_t cb_ptr = test_timer_cb;
uint32_t cb_func_idx = (uint32_t)(uintptr_t)cb_ptr;

// Call custom wrapper, passing index instead of pointer
xTimerCreate_native("tmr", 2000, 1, 0, cb_func_idx);

Step 2. Write the "Bridge" in firmware (Host side).
This is the scary part. We need to create a context structure, write a native wrapper for timer creation, and a native callback adapter.

// Host (Firmware)

// 1. Context to pass arguments through
typedef struct {
    wasm_exec_env_t cb_exec_env;
    uint32_t        cb_func_idx; 
    // ... more fields for instance and handle
} wasm_timer_ctx_t;

// 2. Native callback adapter
static void native_timer_callback(TimerHandle_t xTimer) {
    wasm_timer_ctx_t *ctx = (wasm_timer_ctx_t *)pvTimerGetTimerID(xTimer);
    uint32_t argv[2] = { ctx->wasm_handle, ctx->timer_id };

    // Manual interpreter call
    wasm_runtime_call_indirect(ctx->cb_exec_env, ctx->cb_func_idx, 2, argv);
}

// 3. Wrapper over xTimerCreate
static uint32_t native_xTimerCreate(wasm_exec_env_t exec_env, 
                                    const char *name, uint32_t period, 
                                    uint32_t reload, uint32_t id, 
                                    uint32_t cb_idx) {
    // ... need to allocate context, save env, create timer ...
    // ... pass native_timer_callback instead of real callback ...
    return (uint32_t)handle;
}

// 4. Registration with scary signatures
static NativeSymbol native_symbols[] = {
    { "xTimerCreate_native", native_xTimerCreate, "($iiii)i", NULL }
};

Result: ~100 lines of code just to start one timer.

ESPB: Zero Glue Code
In ESPB, I solved this problem systemically. The translator knows that xTimerCreate accepts a callback. The runtime generates a trampoline on the fly via libffi in IRAM, which FreeRTOS sees as a standard C function.

Step 1. Write the script.
It's just standard C code. We pass the test_timer_cb function as is.

// ESPB (Script)
static void test_timer_cb(TimerHandle_t xTimer) {
    printf("Timer tick!\n");
}

void app_main(void) {
    // Call standard FreeRTOS API
    TimerHandle_t t = xTimerCreate("tcb", pdMS_TO_TICKS(2000), 
                                   pdTRUE, NULL, test_timer_cb);
    if (t) xTimerStart(t, 0);
}

Step 2. Add to firmware.
We don't need wrappers. We simply export 3 functions: create timer, get ID (for context), and the command control (since xTimerStart is a macro over xTimerGenericCommand).

// Host (Firmware) - Symbol Table
ESPB_SYM("xTimerCreate", (const void*)&xTimerCreate)
ESPB_SYM("pvTimerGetTimerID", (const void*)&pvTimerGetTimerID)
ESPB_SYM("xTimerGenericCommand", (const void*)&xTimerGenericCommand)

Result: 0 lines of glue code (only symbol registration). You write the script as if it were part of the firmware. The runtime itself understands that a function pointer was passed and creates a native closure for it.

I uploaded the WAMR project to GitHub:
https://github.com/smersh1307n2/wamr

6. Developer Experience: No "Header Hell"

Usually, development for custom VMs is painful. The IDE doesn't see system headers (FreeRTOS), autocompletion doesn't work, and to compile a script, you have to manually specify hundreds of paths to include directories.
I solved this problem radically: using the standard ESP-IDF build system.
You write script code in a normal C/C++ project inside VS Code. IntelliSense, code navigation, and error highlighting work because the project is configured as a legal ESP32 application.
For bytecode compilation, I wrote a PowerShell script get-ir-cmake.ps1 that performs magic:

Pulls actual build flags directly from your project's CMake.
Compiles script files using Clang into LLVM Bitcode (.bc).
Links the result (llvm-link) into a single .bc file, ready to be sent to the translator.

You only have to write code and mark critical sections with JIT_HOT.

7. Translation (Desktop Client)
I created the ESPB Desktop Client. This is a lightweight utility that works in conjunction with a cloud translator.
You simply feed the client the required files.
The client sends them to the cloud, where the server performs register optimization, calculates metadata for FFI, replaces string function names with indices (Fast Symbols), and returns the finished .espb file to the specified location in the ESPB Desktop Client.

On that note, allow me to take my leave.

Online Translator:
http://espb.runasp.net/
Interpreter Repository:
https://github.com/smersh1307n2/ESPB
Project for preparing LLVM IR: https://github.com/smersh1307n2/ESP32_PRJ_TO_LLVM
ESPB_Desktop_Client:
https://github.com/smersh1307n2/ESPB_Desktop_Client
I also recorded a video supplement:
https://www.youtube.com/watch?v=UbcuU-mabLs