shenjinti

Posted on May 20

I Built a Register-VM JavaScript Engine in Rust with opencode.ai — Beating QuickJS

#programming #ai #javascript #opensource

Three weeks ago I signed up for opencode.ai's $5 plan. I had this crazy idea: build an ES2023-compliant JavaScript engine in Rust. The result? pipa (枇杷) — a register-based VM that scores 1256 on the V8 benchmark, edging out QuickJS (1219).

The Stack

AI: opencode.ai ($5/mo first month) × DeepSeek-v4-Flash
Language: Rust (edition 2024)
Binary size: ~5.2 MB (including REPL, HTTP client, WebSocket)
Zero external deps for: Regex, JSON, Base64, BigInt, Fetch, WebSocket, SSE

Built Over 3 Weeks

Looking back at the opencode session logs, the earliest work on pipa started on April 28. Over the next 22 days, I worked through ~85 sessions. opencode tracked every diff — from the initial skeleton (lexer, parser, NaN-boxed value system, register VM, opcodes) through incremental refinements.

The first commit created the entire foundation: ~75,000 lines across 205 source files. Then came iterative improvements — peephole optimizer passes, inline cache tuning, GC pinning logic, WebSocket handshake fixes, benchmark harnesses, and optimization levels O0 through O3.

Why a Register VM?

Stack VMs need constant shuffle/dup. Register VMs name operands explicitly — fewer instructions, more optimization surface.

pipa's instructions are u8 opcode + u16 register operands, variable-length (1/3/5/7/9/11 bytes). Fused EqJumpIf combines compare + branch. LoadInt8 r, imm8 loads small integers in 3 bytes. Compact Jump8 uses 2 bytes for near jumps. The peephole optimizer runs 3 passes, eliminating Move rX, rX and folding zero/one patterns.

NaN-Boxing: Everything in 64 Bits

0x7FF8_?TTT_?PPP_PPPP_PPPP_PPPP
         ^^^^ tag  ^^^^^^^^^^^^^ payload

Every JS value fits in one u64. A canonical NaN prefix + 4-bit tag + 47-bit payload encodes all 10 types. Type dispatch is a single shift+mask. Small integers avoid boxing entirely. Hot-path both_int() checks both operands with one bitmask — straight to i64 ops.

Zero-Dependency Builtins

Regex (src/regexp/): Layered execution — LiteralFast → FastClass → DFA → OptimizedNFA → FullNFA. The analyzer picks the fastest engine per pattern. No regex crate.
JSON: Hand-written recursive descent parser and serializer.
Fetch/WebSocket/SSE (src/http/): Rustls-based TLS, chunked transfer encoding, compression, WebSocket frame parsing — all from scratch.
BigInt, Base64, Unicode: Hand-rolled.

Generational GC

16 MB nursery, bump allocation, minor GC clears dead objects. If 10 consecutive minors find no garbage, it promotes to full GC with incremental marking and write barriers. Triggered every 16K allocations.

Performance Optimizations: What Actually Worked

Beyond the baseline compiler, pipa has a stack of optimizations that together make it faster than QuickJS:

1. Peephole Optimizer (3 Passes)

src/compiler/peephole.rs runs up to 3 passes over bytecode. Each pass locally transforms patterns:

Self-move: Move rX, rX → NOP
Zero/one folding: LoadInt rX, 0; Add rX, rY, rX → Move rX, rY
Divide-by-one: LoadInt rX, 1; Div rX, rY, rX → Move rX, rY
Jump threading: Jump → Jump chains resolved to the final target
Double negation: Not; Not → Move, same for Neg; Neg, BitNot; BitNot
Constant folding: LoadInt 3; LoadInt 4; Add → single LoadInt 7 (supports add/sub/mul/and/or/xor)
Branch on constant: LoadTrue; JumpIf rX → Jump (always taken), LoadFalse; JumpIf rX → NOP
LoadInt → LoadInt8 shrink: Integers in [-128, 127] that feed a Move get shortened to 3-byte LoadInt8

2. Fused Compare-Jump Instructions

Instead of computing a boolean to a register then branching, EqJumpIf, LtJumpIfNot, StrictEqJumpIf etc. (16 fused pairs) directly compare two values and conditionally jump in a single opcode. This eliminates intermediate register pressure and reduces instruction count on every comparison.

3. NaN-Boxing Fast Paths

both_int() in src/value.rs:194 uses a single bitmask operation to check whether both operands are tagged integers:

const TAG_MASK: u64 = QNAN_BASE | (0xF << TAG_SHIFT);
(a.0 & TAG_MASK) == INT_TAG_BITS && (b.0 & TAG_MASK) == INT_TAG_BITS

This is used 29 times in vm.rs across all arithmetic and comparison opcodes. When both operands are integers, the VM directly performs i64 arithmetic — no type dispatch, no ToNumber coercion. Similarly, both_raw_float() skips tagging for raw f64 pairs.

is_fast_int() in new_float() automatically converts f64 values that fit in 47 signed bits back to tagged ints, maximizing the chance future operations hit the fast path.

4. Shape + Inline Cache

src/object/shape.rs implements a V8-style hidden class tree. Each object points to a Shape that records property-name-to-offset mappings via transition chains (add property → new child shape).

src/compiler/ic.rs implements two-way polymorphic inline caches (IC_POLY=2). Each cache slot stores (shape_id, property_offset). On property access, the VM compares the object's shape ID against cached slots — on hit, it reads directly from the cached offset, bypassing hash lookup entirely. For monomorphic hotspots (one shape), this is a single shape compare + pointer offset load.

5. Cached Raw Pointers in VM

The VM struct caches 7 raw pointers updated on each call frame push:
cached_code_ptr, cached_const_ptr, cached_registers_base, cached_registers_ptr, cached_ic_table_ptr, cached_upvalue_slot_ptr, cached_upvalues_len

Hot-path instruction handlers access bytecode, constants, and registers through these cached pointers rather than chasing through frames[frame_index] on every instruction. This eliminates pointer indirection in the inner loop (~9000 lines of execute_inner).

6. Opcode-Level Shortcuts

LoadInt8 r, imm8: Loads [-128, 127] in 3 bytes instead of 7 for full LoadInt
AddImm8 / SubImm8 / LteImm8: Arithmetic with small integer constants encoded directly in the instruction
Call0 / Call1 / Call2 / Call3: Specialized call opcodes that skip argument counting
Jump8 / JumpIf8: 2-byte jumps for near targets (saves 3 bytes per jump)
Micro function inlining (O2+): x => x.prop or x => x.a.b detected in codegen and inlined as GetNamedProp sequences, eliminating frame allocation

7. Generational GC with Incremental Marking

The GC (src/runtime/gc.rs, ~1550 lines) uses a 16 MB nursery with bump allocation. Minor GC is nearly free — scan roots, sweep white. If 10 consecutive minor GCs find nothing to collect, it promotes to full GC. Full GC uses incremental marking with configurable budget per step, avoiding long stop-the-world pauses. Write barriers protect black→white references during incremental marking.

8. Layered Regex Engine

The regex engine (src/regexp/) is the biggest single win — 3× faster than QuickJS on the V8 RegExp benchmark. It uses a five-tier execution strategy:

Mode	When	How
`LiteralFast`	Pure literal string (no metacharacters)	`memcmp`
`FastClass`	Simple character class only	Byte-by-byte `FastClassMatcher`
`LiteralDFA/ClassDFA`	Simple patterns compilable to DFA	Linear-time DFA execution
`OptimizedNFA`	Medium complexity with optimizations	Optimized NFA bytecode
`FullNFA`	Complex patterns with backreferences	Full backtracking NFA with pooled contexts

The pattern analyzer (analyze_pattern) classifies each regex at compile time and selects the cheapest engine that can handle it. Most real-world regexes hit the DFA or faster tiers.

test262 Coverage: What We Tried

Running the official ECMAScript test suite (test262) has been an ongoing effort. Here's what's in place:

test262 Runner

examples/test262_runner.rs (~500 lines) is a standalone test harness that:

Parses YAML frontmatter from test files (/*--- ... ---*/)
Handles test flags (onlyStrict, noStrict, raw, async, generated)
Resolves test includes (harness helper files like assert.js, sta.js)
Detects negative tests (expected failures with specific error types)
Tracks per-file pass/fail statistics with console output

test262.sh scripts the full flow: clone test262 from tc39, then invoke the runner.

Current Coverage: ~45%

The engine passes roughly 45% of the test262 suite. Major working areas:

Lexical grammar: Identifiers, keywords, Unicode escape sequences, line terminators
Expressions: Primary, left-hand-side, unary, binary, conditional, assignment
Statements: Block, variable/function declarations, if/switch/while/for/for-in/for-of, try/catch/finally, return/throw/break/continue
Built-in objects: Object, Array, String, Number, Boolean, Symbol, BigInt, Math, Date, RegExp, JSON, Map, Set, Promise, Proxy, Reflect, TypedArrays, Intl
Control flow: Aberrant behavior detection, early error handling
Modules: import/export, namespace objects

What's Missing

The remaining 55% mostly falls into:

Annex B (browser extensions): __proto__, Object.prototype.__defineGetter__, RegExp.$1 etc.
Edge cases in temporal dead zone and class field initialization
Async generator finalization corner cases
Atomics/SharedArrayBuffer: The single-threaded model doesn't support these yet
Intl.NumberFormat v3: Recent additions to the Intl spec

The MIR layer (src/compiler/mir.rs, ~460 lines) is designed but not yet connected to the main pipeline — once wired, it will enable better dead code elimination, type inference, and potentially SSA-based optimizations.

How opencode Helped

I'd never have built this without AI pair programming. The workflow — describe what I want, get working code, test, refine — collapsed months of work into 3 weeks.

The session timeline tells the story:

Period	Focus
Week 1 (Apr 28–May 4)	Core engine: lexer, parser, register VM, NaN-boxing, opcode encoding, basic builtins
Week 2 (May 5–11)	Optimizations: peephole passes, shape system, inline caches, regex engine, bytecode serialization
Week 3 (May 13–20)	HTTP stack: fetch, WebSocket, SSE, EventSource; process API; benchmarks; REPL polish

DeepSeek-v4-Flash understood Rust borrow semantics, generated correct unsafe NaN-boxing code, wired up TLS handshakes, and debugged GC edge cases. Without it, this project would have stayed a weekend thought experiment.

Try It

cargo install pipa-js
pipa script.js
pipa -compile input.js output.jsc   # pre-compile to bytecode
pipa -O3 script.js                   # max optimization

MIT licensed on GitHub. PRs and stars welcome.

DEV Community