fares haroun

Posted on Mar 15

How RPCS3 Emulates the PS3 — and why it Took a Decade of Unreasonable Effort

#programming #learning #opensource

The PlayStation 3 launched in 2006 with a processor that IBM, Sony, and Toshiba spent four years and hundreds of millions of dollars designing. It was supposed to be the future of computing. Game developers spent the better part of a console generation learning to hate it. And a group of volunteer engineers — working evenings, weekends, and vacations — spent the better part of a decade reverse-engineering it well enough to run its games on a standard PC.

I’ve been following RPCS3 since around 2016, when the project was still mostly a curiosity that could boot a handful of commercial titles into menus and then crash. Watching it go from that to running Red Dead Redemption at 60fps — on hardware Sony never intended — is one of the more quietly remarkable software engineering stories of the last twenty years.

To understand why this was so hard, you need to understand what the Cell processor actually was.

What the Cell Processor Actually Was

Cell processor die shot — IBM Cell BE chip.
Source: Wikimedia Commons

The Cell had one main core — the PPU, a dual-threaded PowerPC 970-derived chip at 3.2 GHz — plus eight “Synergistic Processing Units” hanging off a 512-bit ring bus called the Element Interconnect Bus. Seven SPUs were available to games (one disabled at the hardware level, partly for yield reasons). Each SPU was a 128-bit SIMD processor with its own dedicated 256 KB of “local store” memory.

Here’s the thing that breaks your brain: the SPU’s local store is not a cache. It does not have cache semantics. There is no automatic coherency with main memory. If an SPU wants data from XDR RAM or from another SPU, it has to explicitly request a DMA transfer through the Memory Flow Controller, wait for it to complete, and then operate on data that now lives entirely within its own 256 KB island. Write the result back? Another DMA. Signal that you’re done? There’s a mailbox channel for that — one of roughly 30 active I/O channels the SPU exposes (the hardware reserves a 128-slot address space, but only a subset are actually defined and used).

This was deliberate, and in a narrow sense it was brilliant. By eliminating cache coherency hardware, the SPUs could be small, power-efficient, and fast at vectorized math. IBM used similar designs in the Blue Gene supercomputer line. Theoretical peak: 256 GFLOPS across all SPUs. For 2006 consumer hardware, that was genuinely insane.

For the programmers shipping games on it? Less great. For engineers who’d eventually try to emulate it? A special kind of hell.

Why Every Emulation Assumption Breaks here

Most CPUs that get emulated share certain properties. Unified memory space. Cache coherency — if you write a value, anyone who reads that address gets the new value. Threads share an address space. These assumptions are so baked into x86 that most emulation frameworks just assume them.

The SPU ignores all of this.

Each SPU’s local store addresses start at 0x0, completely separate from main memory. A load instruction on an SPU loads from local store, not from anywhere in system RAM. The 256 KB is all it sees. Communication with the outside world goes through named channels — SPU_RdInMbox, SPU_WrOutMbox, MFC_EAH, MFC_EAL and so on, as defined in RPCS3‘s SPUThread.h. These aren’t memory-mapped registers in the traditional sense. They’re a distinct I/O mechanism.

So when you’re writing an SPU emulator, you can’t just point it at a host address space and let it rip. You have to translate the SPU’s instruction set (a distinct ISA, not PowerPC), manage the local store as a genuine separate allocation, and service DMA requests by copying data between that local store and the emulated XDR RAM at the right times. Miss a timing constraint and the SPU reads stale data or writes to the wrong place, and the game produces garbage or crashes.

The PCSX2 team — who built one of the best PS2 emulators ever written — looked at this seriously around 2007–2008 and concluded it was too different from anything they’d built for. That’s not a knock on them. It was a rational engineering judgment. The PS3 wasn’t an extension of the PS2. It was a different class of problem.

How RPCS3 Actually Does it

RPCS3 was started in 2011 by DH and Hykem. For years it ran in pure interpreter mode — fetch each instruction, decode it in software, execute it. Slow, but correct enough to make progress. The project could boot some homebrew by 2013. Commercial games stayed mostly out of reach.

The turning point was proper JIT recompilation.

For the PPU (the main PowerPC core), RPCS3 uses LLVM. PPUTranslator.h takes PPU instructions and emits LLVM IR, which LLVM’s backend compiles to native x86-64 — and, as of December 2024, ARM64. The main thread object PPUThread.h supports up to 100 simultaneous PPU threads with IDs starting at 0x01000000. LLVM handles register allocation, instruction selection, and a pile of optimizations that would be tedious to implement by hand. The tradeoff is compilation latency — when RPCS3 encounters a new code block, there’s a pause while LLVM processes it. You’ll see stutters in the first few minutes of any game until the code cache warms up.

For the SPUs, RPCS3 uses asmjit — a lower-level assembler library that emits x86 instructions directly, without going through LLVM’s IR. This is intentional. SPU code tends to be tight, homogeneous SIMD loops. You don’t need LLVM’s whole optimization pipeline for that; you need low-latency recompilation and precise control over output. SPURecompiler and ASMJITRecompiler translate SPU’s SIMD instructions into SSE/AVX equivalents. A 128-bit SPU vector register maps reasonably well to an SSE register — one of the few places where the architecture gives you a break.

The Memory Reservation Problem

Here’s a problem that sounds easy and is not: implementing load-linked / store-conditional (LL/SC) semantics on x86.

The Cell’s PPU supports lwarx/stwcx. — PowerPC’s reservation instructions. You load a word and set a hardware reservation on that cache line. If anyone else writes to that cache line before your store-conditional executes, the store fails and you retry. This is how you do lock-free synchronization on PowerPC.

x86 has LOCK CMPXCHG, which is superficially similar but semantically different. The Cell’s reservation is per-cache-line (128 bytes), can be held across many instructions, and can be invalidated by DMA traffic from an SPU — not just by another CPU core writing to memory.

vm_reservation.h in RPCS3 implements a software reservation table that tracks which guest memory ranges have active reservations. The emulator checks and potentially invalidates those on every DMA completion. Get this wrong and games deadlock, or worse, appear to run while silently corrupting state. A lot of RPCS3’s early commercial game failures traced back here.

SPURS: Sony’s Cooperative Threading Nightmare

Sony shipped a middleware library called SPURS — the SPU Runtime System. Almost every serious PS3 game used it. SPURS is a cooperative fiber scheduler that runs on SPU groups, letting games treat their SPU allocation as a general-purpose thread pool rather than manually scheduling each SPU.

The problem for emulation: SPURS isn’t just a library. It’s deeply stateful, uses SPU mailboxes as semaphores, relies on precise timing between the PPU and SPU scheduling loops, and uses Cell-specific atomic primitives with no direct x86 equivalent.

RPCS3 implements SPURS through a combination of cellSpurs.h HLE — where you intercept the library call and substitute your own implementation — and low-level SPU accuracy. (Fibers are a separate PS3 API implemented in cellFiber.h; SPURS has its own dedicated module.) Getting SPURS working correctly was a prerequisite for a huge category of commercial games becoming playable. Before it worked, any game leaning heavily on SPURS would hang at startup or make it to gameplay and then crash in ways that were essentially impossible to debug from the outside.

The RSX and the FIFO

The RSX is RPCS3‘s other major challenge. It’s an Nvidia NV47 derivative with PS3-specific modifications. The PPU communicates with it through a ring buffer — the FIFO.

RSXFIFO.h tracks six states: running, empty, spinning, NOP, lock wait, and paused. RSXThread.h continuously reads from the ring buffer and dispatches GPU commands. But those commands reference GPU microcode — PS3 vertex and fragment shaders written in a custom Nvidia format that modern GPUs don’t understand at all.

RPCS3 has to decompile that microcode, reconstruct the original intent, and recompile it to SPIR-V (via GLSLANG) for the Vulkan backend in VKGSRender.h, or to GLSL for the OpenGL backend in GLGSRender.h. The texture cache in VKTextureCache.h adds another layer: PS3 textures use swizzling patterns and compression formats that need conversion on the fly, and the cache has to track when the PPU or an SPU writes new texture data into XDR RAM so it can invalidate GPU-side copies.

RPCS3 running Demon’s Souls at 4K.
Source: rpcs3.net compatibility database

Shader recompilation is why RPCS3 stutters noticeably the first time you enter a new area or trigger a new effect. There’s no way around it entirely. You can precompile a cache from a previous run, but the first session cold-starts from nothing.

SPU Execution Loop — what Running an SPU Actually Looks like

Each emulated SPU thread runs this loop. JIT-compiled blocks replace the fetch-decode-execute cycle with native code for straight-line SPU basic blocks. When the SPU issues a DMA command via the SPU_MFC_Cmd channel, the MFC handler copies data at the right time and potentially invalidates PPU-side reservations. Mailbox reads can block — if the mailbox is empty, the SPU spins or yields depending on channel semantics. Getting this right, without burning CPU cycles on pointless spinning, required careful implementation of the full SPU channel I/O model.

The Compatibility Arc

In 2015, RPCS3 could boot a handful of commercial games. Most got to a black screen or a crash. The PPU recompiler push during 2016–2017 unlocked a lot of games that were purely CPU-bound.

The SPURS improvements around 2017–2018 were probably the single biggest unlock. Games that had been hanging at boot — waiting on a SPURS task that never completed — could suddenly start. The monthly progress reports from that period read like a dam slowly giving way.

The SPU LLVM interpreter introduced in early 2019 improved accuracy on a class of SPU timing bugs that asmjit’s more mechanical translation couldn’t catch. Some games only work correctly at high SPU accuracy because they rely on behaviors that fall out of the hardware’s precise execution order. That’s the kind of bug that takes months to find.

As of the project’s GitHub history — 334 contributors, 18,948 commits — RPCS3 covers a significant chunk of the PS3 library. Demon’s Souls, Persona 5, Killzone 2, God of War III all run at playable or better framerates. Some run better than on original hardware, because the emulator can exceed the PS3’s original CPU ceiling.

The Open-source Engineering Reality

334 contributors sounds like a lot until you realize most major features were driven by a tiny core. Nekotekina wrote enormous parts of the CPU emulation. kd-11 essentially owns the RSX backend. A contributor named elad335 has submitted hundreds of accuracy fixes over the years — the kind of one-game, one-bug patches that keep the compatibility list climbing.

There’s a Patreon that helps pay for infrastructure. The people writing this code are doing it because they find the problem interesting, because they want to preserve games that would otherwise become unplayable as original hardware ages out, or because — let’s be honest — it’s one of the hardest emulation problems left and that’s its own reward.

The ARM64 port from December 2024 is a good example of the engineering culture. Rather than rewriting the JIT backends from scratch, the developer built an IR transformer layer that takes the existing x86-64 LLVM IR output and adjusts it for ARM64 — different calling conventions, different stack frame layout, different return address handling. Clever, and the kind of thinking you get from people who’ve been staring at this codebase for years.

What RPCS3 pulled off: they made a totally alien processor — one that broke x86‘s assumptions about memory, coherency, threading, and I/O — run its programs on hardware that shares almost none of its properties. The Cell was supposed to outlive the PS3. It didn’t. But the software that ran on it will.

RPCS3 source is at github.com/RPCS3/rpcs3, licensed under GPL-2.0. Compatibility database and monthly progress reports at rpcs3.net. If you use the emulator and a game works, file a compatibility report — that data drives development priorities.

DEV Community