DEV Community

Cover image for I Scaled PHP Until It Broke. Three llama.cpp Patterns Saved It.
Vitalii Cherepanov
Vitalii Cherepanov

Posted on

I Scaled PHP Until It Broke. Three llama.cpp Patterns Saved It.

I read the llama.cpp source code.

Sixty thousand lines of C++ that single-handedly made local LLM inference possible on a laptop. This isn't "best practices from a textbook" — it's code where every line is responsible for keeping matrix multiplication inside the L2 cache and off the RAM bandwidth budget.

I write PHP. A language where every value is wrapped in a zval, every object carries a 30+ byte header, and any foreach allocates a hash iterator. The comparison is unfair by definition. But I got curious: which of llama.cpp's tricks would even survive the transplant? And what would happen when I pushed the dataset to a billion records?

I built a benchmark suite. Six optimizations from llama.cpp, translated to PHP 8.4 with JIT. Real numbers, statistical methodology, p99 latencies. Then I scaled the input from 1 million to 1 billion records, to see where the tricks stop being nice-to-haves and become the only path on which the code can finish.

Half of my hypotheses were wrong. That's the actual story.


TL;DR

Pattern At 10M records At 100M+ Verdict
B01: Memory-mapped lookup per-call 7× slower Load 226× faster, 0 PHP heap Process-level win, not call-level
B02: SplFixedArray vs array Slower on speed, 1.68× memory savings Both run to 1B; 9 GB gap Memory only, never speed
B03: Object pool in hot loop 4.43× faster Scales linearly ✅ Use in long-running workers
B04: Lookup table vs match lookup 5.8× faster, match=switch Scales linearly ✅ Data-driven dispatch → lookup
B05: Generator vs full array 1.24× faster, memory O(1) Naive OOMs, generator finishes 🔥 Survival tool
B06: Column vs row layout 8.66× faster single-col scan Naive OOM at 100M, column 959ms 🔥 Survival tool

Half of the patterns transition from "optimization" to "the only path the code can finish on" once you scale up. Half don't. And one pattern (SplFixedArray) turned out to be the opposite of what's been written about it for the last ten years.

Let's go through them one by one.


B01: mmap reads gigabytes fast, but NOT per call

Hypothesis: memory-mapping large read-only tables is faster than loading them via json_decode. The llama.cpp parallel — models are loaded via ggml_mmap (see src/llama-mmap.cpp), not through fread into a malloced buffer.

PHP translation: open libc.dylib via FFI, call mmap(), take the pointer, FFI::cast('uint32_t*', $ptr) for typed access:

$ffi = FFI::cdef("
    void *mmap(void *addr, size_t length, int prot, int flags, int fd, long offset);
    int open(const char *pathname, int flags);
", "libc.dylib");

$fd  = $ffi->open('data/lookup.bin', 0);
$ptr = $ffi->mmap(null, $size, 1, 2, $fd, 0);
$table = FFI::cast('uint32_t*', $ptr);

// Access: $table[$id * 2 + 1] returns the value for key $id
Enter fullscreen mode Exit fullscreen mode

Result on 10 million records:

  • Load time: JSON 454 ms vs mmap 1.1 ms → mmap is 226× faster on load
  • PHP heap after load: JSON 256 MB vs mmap 0 bytes
  • Per-lookup p99: JSON 708 ns vs mmap 5.4 µs → mmap is 7× SLOWER per call

Wait. mmap loses 7× per call. The JIT optimizes $arr[$id] so well that an FFI dereference with a cast overhead can't survive a tight read loop.

At 1 billion records, mmap loads a 16 GB binary in 228 milliseconds at zero PHP heap. The JSON path doesn't even exist — the fixture would be 100+ GB of JSON text, physically unrealistic to generate.

Verdict: mmap isn't "faster per call." It's a different category of optimization. It buys you load time, flat PHP heap, and table sharing across N PHP-FPM workers via the kernel page cache. Inside a single process in a tight read loop, it loses to the JIT. Across processes, it wins by orders of magnitude — cross-process cold start for a second worker is 2641× faster, because the pages are already in the kernel page cache.

Use mmap when a fleet of workers shares a fat read-only table. Don't use it for tight read loops inside one process.


B02: SplFixedArray saves memory, but never speed

Hypothesis: on dense numeric data, SplFixedArray should be both faster (no hash overhead) and more memory-efficient. The llama.cpp parallel — ggml_tensor works with packed arenas, not arrays of pointers to boxed objects.

Result on 10 million integers:

  • Memory: array 256 MB vs SFA 152 MB → SFA saves 1.68×
  • Iterate: array 12.2 ms vs SFA 93.8 ms → SFA is 7.7× SLOWER
  • Populate: array 56.5 ms vs SFA 108.8 ms → 1.9× slower
  • Random reads (1M): array 23.9 ms vs SFA 98.5 ms → 4× slower

I expected an OOM crossover, so I pushed the sweep up to a billion integers hoping the regular array would hit the RAM ceiling. It didn't. At 1B elements: array 24 GB peak vs SFA 14.9 GB. SFA's speed disadvantage held at every tier.

Verdict: SplFixedArray on modern PHP is memory-only, never speed. The folklore "use SplFixedArray for large numeric data because it's faster" is advice from 2014. JIT in PHP 8.4 optimizes packed integer-keyed arrays so aggressively that the specialized structure loses to the general one. Reach for SFA when you're memory-constrained inside a long-running worker. Don't expect a speedup.

This is the most counter-intuitive finding in this article. I didn't believe it at first and re-ran the whole sweep twice. The numbers held.


B03: Object pool — the only classic optimization that still earns its keep

Hypothesis: in a hot loop, reusing a small pool of pre-allocated objects is faster than new on every iteration. The llama.cpp parallel — the tensor allocator never calls malloc inside an inner loop. It works against a pre-allocated arena via ggml_new_tensor_impl.

Translation: a pool of 5 Point3D instances, reused via direct property assignment:

final class Point3D {
    public function __construct(
        public float $x = 0.0,
        public float $y = 0.0,
        public float $z = 0.0,
    ) {}
}

$pool = array_map(fn() => new Point3D(), range(0, 4));
$idx  = 0;

for ($i = 0; $i < 5_000_000; $i++) {
    $p = $pool[$idx++ % 5];
    $p->x = $x; $p->y = $y; $p->z = $z;
    // ... work with $p
}
Enter fullscreen mode Exit fullscreen mode

Result on 5 million allocations: naive 813 ms vs pool 179 ms → 4.43× faster.

GC cycles: zero in both. Point3D isn't cyclical, PHP's GC never trips. All the savings come from the allocator path: new in Zend Engine is a light but non-zero code path (zend_object_newemalloc → property init × N). Five million times adds up.

Verdict: works as expected. In CLI scripts the win is real but not critical. In long-running workers (queues, websockets, daemons), tail latency from allocator pressure compounds over time and becomes a headache — that's where pooling earns its keep.


B04: Lookup table beats match and switch (and those two are equivalent)

Hypothesis: for dispatch logic with 16+ cases in a hot loop, an array lookup beats match and switch. The llama.cpp parallel — token dispatch in llama_token_to_piece uses tables, not switches.

Translation: a 32-case classifier implemented three ways — switch, match, and a pre-built $lookup = [0 => 'A', 1 => 'B', ...].

Result on 10 million dispatches:

  • switch: 358 ms (27.9M ops/sec)
  • match: 365 ms (27.4M ops/sec)
  • lookup: 61.7 ms (162M ops/sec) — 5.8× faster

match and switch are tied. Both compile to the same jump table for integer cases. PHP 8.4 JIT polishes both forms to the same result. If you rewrote switch to match for "modernization" — you got readability, not speed.

Where the lookup win evaporates: if dispatch produces a string for downstream === comparisons, the gain is eaten by the string compares further down the pipeline.

Verdict: match-shaped problems (closed compile-time set, exhaustiveness wanted) stay in match. Data-driven dispatch (table loaded from config, generated at runtime) goes in a lookup. The "match vs switch for perf" debate is closed — they're equivalent.


B05: Generator — the main survival tool on large streams

Hypothesis: a generator reduces peak memory from O(N) to O(1) with a minor throughput penalty. The llama.cpp parallel — tokens stream through a callback rather than accumulating in a buffer (llama_decodellama_get_logits_ith).

PHP translation: replace function process(): array with function process(): Generator:

function records(): Generator {
    foreach (read_csv('data.csv') as $row) {
        yield ['id' => $row[0], 'value' => $row[1]];
    }
}
Enter fullscreen mode Exit fullscreen mode

Result on 5 million records:

  • Wall time: naive 525 ms vs gen 449 ms → gen 1.24× faster
  • Peak memory: naive 1.88 GB vs gen 0 bytes of PHP heap

The generator isn't just lower-memory — it's also faster on wall time, because the array never needs to be fully materialized before processing starts.

Now scale. At 100 million records, naive OOMs — the kernel kills the process with SIGKILL after 28.6 seconds. The generator finishes the same 100M in 10.4 seconds at zero PHP heap. At 500M, the generator still works (45.7 seconds). Naive doesn't even attempt.

If I had to pull one sentence out of this entire article and put it on a banner, it would be this:

At 100,000 records, a generator is a 1.24× nice-to-have. At 100 million records, it's the only path on which the code can finish.

Verdict: default for any single-pass stream you don't need to revisit. Materialize the array only when you need random access, multiple passes, or count() before processing.


B06: Column-oriented layout — not cache locality, but escape from boxing

Hypothesis: on analytical single-column scans, column-oriented layout is faster than row-oriented due to cache locality. The llama.cpp parallel — tensors are stored per-channel (SoA), not per-element (AoS).

PHP translation: instead of SplFixedArray of stdClass with 5 fields — 5 parallel SplFixedArray instances, one per field:

// Row-oriented (naive)
$rows = new SplFixedArray($n);
for ($i = 0; $i < $n; $i++) {
    $obj = new stdClass();
    $obj->f1 = ...; $obj->f2 = ...; /* ... */ $obj->f5 = ...;
    $rows[$i] = $obj;
}
$sum = 0;
foreach ($rows as $r) $sum += $r->f3;

// Column-oriented (optimized)
$f3 = new SplFixedArray($n); // and so on for each field
for ($i = 0; $i < $n; $i++) $f3[$i] = ...;
$sum = 0;
for ($i = 0; $i < $n; $i++) $sum += $f3[$i];
Enter fullscreen mode Exit fullscreen mode

Result on 5 million records: column is 8.66× faster on single-column scan. On full-row scan (sum f1..f5) — 1.92× faster.

And here's where it gets interesting. I expected ladder steps on the ns/record chart — where the working set stops fitting in L1, then L2, then L3 cache. I didn't see them. The curves are flat across the whole 100K → 100M range: column holds at ~9.5–11.5 ns/record. Row holds at ~80–93 ns/record. No steps.

This is a stronger insight than "look, the ladder." Cache effects inside either layout don't differentiate them. What differentiates them is the layout itself. Row-oriented spends ~30+ bytes per stdClass (zval header + property table + GC info) for 8 bytes of actual payload. At 100M records that's 28 GB on boxing alone. Column-oriented at the same 100M = 7.45 GB, because each column is a packed SplFixedArray with no boxing.

At 100M records, row OOMs — 28+ GB of stdClass objects don't fit. Column finishes the scan in 959 milliseconds at 7.45 GB.

Verdict: column layout isn't cache optimization (which is what I assumed). It's escape from PHP-object overhead at scale. On any analytical workload over large datasets — column. Row stays appropriate when DTOs are passed between layers, or when the working set is small.


What happens at scale

Micro-benchmarks on 1–10 million elements give you one picture. Scaling to billions gives a different one.

Three of the six patterns transition from "optimization" to "necessity" on large data:

  • B05 generator — at 100M records, naive OOMs. Generator finishes.
  • B06 column layout — at 100M records, row OOMs. Column completes the scan in 959 ms.
  • B01 mmap — at 1B records, the JSON fixture physically doesn't exist (100+ GB). mmap loads a 16 GB binary in 228 ms.

Two patterns stay "just optimizations" regardless of scale:

  • B03 object pool: ~4× at any size.
  • B04 lookup table: ~5× at any size.

One pattern turned out narrow — saves memory, but never speed:

  • B02 SplFixedArray: 38% less memory, always slower on speed. Both paths work all the way to 1B.

This is probably the most important reframing in the article. When someone says "X is faster than Y," that's a claim about a specific data size. On small data, half the claims break. On large data, half of them become "X works, Y doesn't exist."

And one more thing worth its own line: JIT in PHP 8.4 keeps eating optimizations every release. Between runs on PHP 8.3.31 and 8.4.21, B03 sped up from 2.78× to 4.43×, B04 from 3.75× to 5.81×. Not a bug — JIT just keeps improving. A year from now, these numbers will shift again.


Three rules of PHP performance in 2026

Out of these six experiments, a working framework emerged.

1. Trust the JIT.

Don't try to outsmart it at the syntax level. match vs switch — JIT compiles both forms to the same jump table. SplFixedArray vs packed array — JIT optimizes the regular array so aggressively that the specialized structure loses on speed. FFI dereference vs $arr[$id] — JIT-compiled array access beats FFI casts inside a hot loop.

If your optimization is about "which language construct to pick" — the JIT already made that choice for you.

2. Optimize what JIT can't see.

  • Cache locality (B06: column layout) — the JIT doesn't manage memory layout. That's your architecture.
  • Allocation pressure (B03: object pool) — the JIT doesn't eliminate allocations, it speeds them up.
  • I/O batching (batched INSERT of 1000 rows vs single-row) — the JIT doesn't optimize round trips to Postgres.
  • Cross-process resource sharing (B01: mmap + page cache) — the JIT works per process.
  • Streaming vs materialization (B05: generator) — the JIT isn't going to remove 30 GB of peak memory for you.

3. At a large enough scale, optimizations stop being optimizations.

They become a survival threshold. A generator on 100K records is 1.24× faster. On 100M it's the only code that finishes. A column layout on 5M is 8.66× faster. On 100M it's the only code that doesn't eat 28 GB on stdClass overhead. mmap on 10M is slower per call. On 1B it's the only way to load the table inside a second.

That's structural thinking, not syntactic. And that's what turns llama.cpp from "a heavily optimized C++ library" into a learning artifact for a PHP developer. Not "here are tricks, steal them." But "here are language limits you only see when you crash into them."


Closing

All benchmark code and a reproducible Docker setup live on GitHub: vbcherepanov/php-llamacpp-benchmarks. A full sweep takes ~15 minutes (make all), including a case study that imports 100K rows into a real PostgreSQL.

A note on the repo: the data/ directory is gitignored — fixtures (up to 16 GB binary lookup files at the 1B tier) are generated locally by make fixtures. Don't try to clone with them.

If you find a methodology bug or want to add a tier, send a PR. I work on this kind of thing through Braincore — a Go-based meta-agent with cost-aware routing and a memory layer for AI coding agents. If these benchmarks were useful and you'd like to support more of this, there's a Ko-fi.


Vitalii Cherepanov — Senior Full-Stack Developer. Building memory, tools, and bridges for AI coding agents. Open source: total-agent-memory (R@5 = 97.45% on LongMemEval), Braincore. Personal site: vbcherepanov.com.

Top comments (0)