Three weeks ago I signed up for opencode.ai's $5 plan. I had this crazy idea: build an ES2023-compliant JavaScript engine in Rust. The result? pipa (枇杷) — a register-based VM that scores 1256 on the V8 benchmark, edging out QuickJS (1219).
The Stack
- AI: opencode.ai ($5/mo first month) × DeepSeek-v4-Flash
- Language: Rust (edition 2024)
- Binary size: ~5.2 MB (including REPL, HTTP client, WebSocket)
- Zero external deps for: Regex, JSON, Base64, BigInt, Fetch, WebSocket, SSE
Built Over 3 Weeks
Looking back at the opencode session logs, the earliest work on pipa started on April 28. Over the next 22 days, I worked through ~85 sessions. opencode tracked every diff — from the initial skeleton (lexer, parser, NaN-boxed value system, register VM, opcodes) through incremental refinements.
The first commit created the entire foundation: ~75,000 lines across 205 source files. Then came iterative improvements — peephole optimizer passes, inline cache tuning, GC pinning logic, WebSocket handshake fixes, benchmark harnesses, and optimization levels O0 through O3.
Why a Register VM?
Stack VMs need constant shuffle/dup. Register VMs name operands explicitly — fewer instructions, more optimization surface.
pipa's instructions are u8 opcode + u16 register operands, variable-length (1/3/5/7/9/11 bytes). Fused EqJumpIf combines compare + branch. LoadInt8 r, imm8 loads small integers in 3 bytes. Compact Jump8 uses 2 bytes for near jumps. The peephole optimizer runs 3 passes, eliminating Move rX, rX and folding zero/one patterns.
NaN-Boxing: Everything in 64 Bits
0x7FF8_?TTT_?PPP_PPPP_PPPP_PPPP
^^^^ tag ^^^^^^^^^^^^^ payload
Every JS value fits in one u64. A canonical NaN prefix + 4-bit tag + 47-bit payload encodes all 10 types. Type dispatch is a single shift+mask. Small integers avoid boxing entirely. Hot-path both_int() checks both operands with one bitmask — straight to i64 ops.
Zero-Dependency Builtins
-
Regex (
src/regexp/): Layered execution — LiteralFast → FastClass → DFA → OptimizedNFA → FullNFA. The analyzer picks the fastest engine per pattern. Noregexcrate. - JSON: Hand-written recursive descent parser and serializer.
-
Fetch/WebSocket/SSE (
src/http/): Rustls-based TLS, chunked transfer encoding, compression, WebSocket frame parsing — all from scratch. - BigInt, Base64, Unicode: Hand-rolled.
Generational GC
16 MB nursery, bump allocation, minor GC clears dead objects. If 10 consecutive minors find no garbage, it promotes to full GC with incremental marking and write barriers. Triggered every 16K allocations.
Performance Optimizations: What Actually Worked
Beyond the baseline compiler, pipa has a stack of optimizations that together make it faster than QuickJS:
1. Peephole Optimizer (3 Passes)
src/compiler/peephole.rs runs up to 3 passes over bytecode. Each pass locally transforms patterns:
-
Self-move:
Move rX, rX→ NOP -
Zero/one folding:
LoadInt rX, 0; Add rX, rY, rX→Move rX, rY -
Divide-by-one:
LoadInt rX, 1; Div rX, rY, rX→Move rX, rY -
Jump threading:
Jump → Jumpchains resolved to the final target -
Double negation:
Not; Not→Move, same forNeg; Neg,BitNot; BitNot -
Constant folding:
LoadInt 3; LoadInt 4; Add→ singleLoadInt 7(supports add/sub/mul/and/or/xor) -
Branch on constant:
LoadTrue; JumpIf rX→Jump(always taken),LoadFalse; JumpIf rX→ NOP -
LoadInt → LoadInt8 shrink: Integers in
[-128, 127]that feed aMoveget shortened to 3-byteLoadInt8
2. Fused Compare-Jump Instructions
Instead of computing a boolean to a register then branching, EqJumpIf, LtJumpIfNot, StrictEqJumpIf etc. (16 fused pairs) directly compare two values and conditionally jump in a single opcode. This eliminates intermediate register pressure and reduces instruction count on every comparison.
3. NaN-Boxing Fast Paths
both_int() in src/value.rs:194 uses a single bitmask operation to check whether both operands are tagged integers:
const TAG_MASK: u64 = QNAN_BASE | (0xF << TAG_SHIFT);
(a.0 & TAG_MASK) == INT_TAG_BITS && (b.0 & TAG_MASK) == INT_TAG_BITS
This is used 29 times in vm.rs across all arithmetic and comparison opcodes. When both operands are integers, the VM directly performs i64 arithmetic — no type dispatch, no ToNumber coercion. Similarly, both_raw_float() skips tagging for raw f64 pairs.
is_fast_int() in new_float() automatically converts f64 values that fit in 47 signed bits back to tagged ints, maximizing the chance future operations hit the fast path.
4. Shape + Inline Cache
src/object/shape.rs implements a V8-style hidden class tree. Each object points to a Shape that records property-name-to-offset mappings via transition chains (add property → new child shape).
src/compiler/ic.rs implements two-way polymorphic inline caches (IC_POLY=2). Each cache slot stores (shape_id, property_offset). On property access, the VM compares the object's shape ID against cached slots — on hit, it reads directly from the cached offset, bypassing hash lookup entirely. For monomorphic hotspots (one shape), this is a single shape compare + pointer offset load.
5. Cached Raw Pointers in VM
The VM struct caches 7 raw pointers updated on each call frame push:
cached_code_ptr, cached_const_ptr, cached_registers_base, cached_registers_ptr, cached_ic_table_ptr, cached_upvalue_slot_ptr, cached_upvalues_len
Hot-path instruction handlers access bytecode, constants, and registers through these cached pointers rather than chasing through frames[frame_index] on every instruction. This eliminates pointer indirection in the inner loop (~9000 lines of execute_inner).
6. Opcode-Level Shortcuts
-
LoadInt8 r, imm8: Loads
[-128, 127]in 3 bytes instead of 7 for full LoadInt - AddImm8 / SubImm8 / LteImm8: Arithmetic with small integer constants encoded directly in the instruction
- Call0 / Call1 / Call2 / Call3: Specialized call opcodes that skip argument counting
- Jump8 / JumpIf8: 2-byte jumps for near targets (saves 3 bytes per jump)
-
Micro function inlining (O2+):
x => x.proporx => x.a.bdetected in codegen and inlined asGetNamedPropsequences, eliminating frame allocation
7. Generational GC with Incremental Marking
The GC (src/runtime/gc.rs, ~1550 lines) uses a 16 MB nursery with bump allocation. Minor GC is nearly free — scan roots, sweep white. If 10 consecutive minor GCs find nothing to collect, it promotes to full GC. Full GC uses incremental marking with configurable budget per step, avoiding long stop-the-world pauses. Write barriers protect black→white references during incremental marking.
8. Layered Regex Engine
The regex engine (src/regexp/) is the biggest single win — 3× faster than QuickJS on the V8 RegExp benchmark. It uses a five-tier execution strategy:
| Mode | When | How |
|---|---|---|
LiteralFast |
Pure literal string (no metacharacters) | memcmp |
FastClass |
Simple character class only | Byte-by-byte FastClassMatcher
|
LiteralDFA/ClassDFA |
Simple patterns compilable to DFA | Linear-time DFA execution |
OptimizedNFA |
Medium complexity with optimizations | Optimized NFA bytecode |
FullNFA |
Complex patterns with backreferences | Full backtracking NFA with pooled contexts |
The pattern analyzer (analyze_pattern) classifies each regex at compile time and selects the cheapest engine that can handle it. Most real-world regexes hit the DFA or faster tiers.
test262 Coverage: What We Tried
Running the official ECMAScript test suite (test262) has been an ongoing effort. Here's what's in place:
test262 Runner
examples/test262_runner.rs (~500 lines) is a standalone test harness that:
- Parses YAML frontmatter from test files (
/*--- ... ---*/) - Handles test flags (onlyStrict, noStrict, raw, async, generated)
- Resolves test includes (harness helper files like
assert.js,sta.js) - Detects negative tests (expected failures with specific error types)
- Tracks per-file pass/fail statistics with console output
test262.sh scripts the full flow: clone test262 from tc39, then invoke the runner.
Current Coverage: ~45%
The engine passes roughly 45% of the test262 suite. Major working areas:
- Lexical grammar: Identifiers, keywords, Unicode escape sequences, line terminators
- Expressions: Primary, left-hand-side, unary, binary, conditional, assignment
- Statements: Block, variable/function declarations, if/switch/while/for/for-in/for-of, try/catch/finally, return/throw/break/continue
- Built-in objects: Object, Array, String, Number, Boolean, Symbol, BigInt, Math, Date, RegExp, JSON, Map, Set, Promise, Proxy, Reflect, TypedArrays, Intl
- Control flow: Aberrant behavior detection, early error handling
- Modules: import/export, namespace objects
What's Missing
The remaining 55% mostly falls into:
-
Annex B (browser extensions):
__proto__,Object.prototype.__defineGetter__,RegExp.$1etc. - Edge cases in temporal dead zone and class field initialization
- Async generator finalization corner cases
- Atomics/SharedArrayBuffer: The single-threaded model doesn't support these yet
- Intl.NumberFormat v3: Recent additions to the Intl spec
The MIR layer (src/compiler/mir.rs, ~460 lines) is designed but not yet connected to the main pipeline — once wired, it will enable better dead code elimination, type inference, and potentially SSA-based optimizations.
How opencode Helped
I'd never have built this without AI pair programming. The workflow — describe what I want, get working code, test, refine — collapsed months of work into 3 weeks.
The session timeline tells the story:
| Period | Focus |
|---|---|
| Week 1 (Apr 28–May 4) | Core engine: lexer, parser, register VM, NaN-boxing, opcode encoding, basic builtins |
| Week 2 (May 5–11) | Optimizations: peephole passes, shape system, inline caches, regex engine, bytecode serialization |
| Week 3 (May 13–20) | HTTP stack: fetch, WebSocket, SSE, EventSource; process API; benchmarks; REPL polish |
DeepSeek-v4-Flash understood Rust borrow semantics, generated correct unsafe NaN-boxing code, wired up TLS handshakes, and debugged GC edge cases. Without it, this project would have stayed a weekend thought experiment.
Try It
cargo install pipa-js
pipa script.js
pipa -compile input.js output.jsc # pre-compile to bytecode
pipa -O3 script.js # max optimization
MIT licensed on GitHub. PRs and stars welcome.
Top comments (0)