plasmon

Posted on Mar 24

95% of LLM Inference Energy Is Wasted on Data Movement — Why Optical Interconnects (CPO) Can't Fix It

#ai #hardware #semiconductor #physics

13.4 J/token — Do You Know What That Means?

Running an LLM on an RTX 4060: 145W power draw, 10.8 tokens/sec generation. Simple division gives 13.4 joules per token.

13.4J means a single AA battery's entire energy produces only 700 tokens. Two batteries for ~1000 characters of text.

Of that 13.4J, how much actually goes to matrix multiplication? Under 5%. The rest is copper wire resistance heating inside the chip, bus driving energy for DRAM round-trips, and power supply losses.

"Computation is almost free. Moving data is expensive." — That's the semiconductor reality of 2026.

The root-cause fix everyone's betting on: optical interconnects, specifically CPO (Co-Packaged Optics). Place optical devices right next to the compute chip, minimize electrical signal travel distance. NVIDIA and Intel are both investing heavily.

But here's my position: CPO is palliative care, not a cure.

This article covers three physical walls CPO cannot break, plus what technologies might actually matter — with equations and concrete numbers.

Chapter 1: Why CPO Is Called the "Savior"

Fair to the other side first. You have to understand your opponent before you dismantle them.

Electrical vs Optical — Energy Efficiency Comparison

Chip-to-chip data transfer power consumption, roughly:

Transmission Method	Energy Efficiency	Distance
On-chip copper interconnect	0.1–1 pJ/bit	< 10mm
PCB copper trace	5–15 pJ/bit	10–300mm
Pluggable optical module	15–25 pJ/bit	1–100m
CPO (target)	1–5 pJ/bit	10mm–several meters

PCB copper costs 5–15 pJ/bit. CPO targets 1–5 pJ/bit. For AI workloads where GPU-to-GPU communication dominates (like H100 clusters), estimates suggest 20–30% total power reduction from CPO alone.

Break-Even Point

The cost crossover between electrical and optical sits at roughly 100mm. Below that, electrical wins (no conversion overhead). Above it, optical wins (near-zero transmission loss).

GPU-to-GPU, GPU-to-HBM, switch-to-GPU — all exceed 100mm. So CPO should be the answer.

The logic is sound. The logic.

Chapter 2: Three Physical Walls CPO Cannot Break

Wall 1: Thermal Layout Deadlock

Think seriously about CPO's physical layout and you immediately hit contradictions:

DSP (Digital Signal Processing chip): Biggest heat source. 100W class.
Optical modulators/photodetectors: Need to be near the DSP (minimize electrical wiring).
Laser source: Needs to be near the optical devices (minimize fiber coupling loss).

The problem: lasers die from heat.

InP semiconductor laser efficiency degrades exponentially with temperature. With characteristic temperature T₀ of 50–70K, a 10°C temperature rise increases threshold current by 15–20%.

I_th(T) = I_th(T₀) × exp((T - T₀) / T₀)

DSP pumping 100W of heat. Optical devices next to it. Laser next to them. Lasers degrade rapidly above 55°C.

Everything must be close together for CPO to work. But put everything close together and the laser cooks. No escape.

Current workaround: External Laser Source (ELS) — move the laser outside the package, pipe light in via fiber. But this introduces 1–3 dB coupling loss at 2–4 connection points, potentially blowing the power budget. And you end up with "Co-Packaged" Optics where the light source isn't co-packaged. The name contradicts itself.

Wall 2: The Photodiode Dilemma — RC Delay vs Space-Charge Effect

The photodiode (PD) that converts light back to electricity faces a fundamental tradeoff.

What determines PD bandwidth? Two factors:

① RC Delay

A PD is essentially a capacitor. With active area A, depletion width d, and permittivity ε:

C = ε × A / d

Bandwidth:

f_c = 1 / (2π × R × C)

Larger area A → larger capacitance C → lower bandwidth.

② Space-Charge Effect

So make the area smaller? Not that simple.

Smaller area means the same optical power concentrates on less surface. Photogenerated carrier density gets so high that the internal electric field is screened. Carrier transit velocity drops. Bandwidth collapses.

This is the Space-Charge Effect.

Summary:

Active Area	RC Delay	Space-Charge	Bandwidth
Large	Capacitance↑ → bandwidth↓	No issue	❌
Small	No issue	Carrier congestion → bandwidth↓	❌
Medium	Moderate	Moderate	△

Go either direction, bandwidth collapses. The only option is balancing in the middle, but at 112 Gbaud and above, that "middle" window keeps shrinking.

UTC-PDs (Uni-Traveling-Carrier PDs) mitigate the space-charge effect by using only electrons in the transit layer. But manufacturing complexity skyrockets, and they're incompatible with silicon photonics CMOS processes.

Wall 3: The DSP Power Wall — Moore's Law Doesn't Save "Movement"

The last and largest wall.

Breaking down 1.6 Tbps DSP chip power consumption: logic gate switching accounts for only 20–30% of total. The dominant factor is internal wire charge/discharge energy:

P_wire = C_wire × V² × f × α

Where C_wire is wire capacitance, V is supply voltage, f is operating frequency, α is switching probability.

As transistors shrink, C_wire doesn't scale proportionally. Thinner wires mean higher resistance R, increasing RC delay, requiring buffer (repeater) insertion — which adds its own power. This is the interconnect scaling problem.

"Joint DSP" — integrating multi-wavelength DSP processing on a single chip — is architecturally elegant but explodes on-chip data movement. Processing four 800GbE wavelengths on one chip requires 3.2+ Tbps internal bandwidth, with wire power alone reaching tens of watts.

Chiplet decomposition is an escape route, but die-to-die bandwidth is 1/3 to 1/4 of on-chip wiring, meaning bandwidth inflates 3–4× for the same data volume.

Moore's Law makes transistors smaller. It doesn't change wire physics. CPO can fix chip-to-chip transmission but can't touch the chip-internal wiring problem.

Chapter 3: HCF (Hollow-Core Fiber) — Rising Star with Triple Weakness

A separate approach from CPO: HCF (Hollow-Core Fiber). Literally, optical fiber with a hollow core.

Why Hollow Is Desirable

Standard optical fiber confines light in glass (SiO₂) core. Glass refractive index is ~1.45, so signals travel at 69% of vacuum light speed (c/n = c/1.45).

HCF core is air (n ≈ 1.0), so light travels at nearly vacuum speed. 31% latency reduction. In HFT (high-frequency trading), microsecond differences translate to millions in profit — HCF is already deployed in production.

Another advantage: near-zero nonlinearity. In glass core, high-power light triggers self-phase modulation (SPM) and four-wave mixing (FWM), degrading signal quality. In HCF, light travels through air — tens of watts cause no nonlinear distortion.

Triple Weakness: Won't Bend, Won't Split, Won't Splice

But HCF has three fatal flaws:

① Bend Loss
HCF's confinement mechanism (photonic crystal cladding or NANF structure) is fragile under bending. Minimum bend radius: 50–100mm vs standard fiber's ~15mm. Won't fit in data center cable trays.

② Splicing Difficulty
Fusion-splicing standard fibers costs ~0.02 dB loss. HCF splicing must maintain the hollow structure — splice loss is 0.5–1 dB. 1 dB means 20% optical power gone. Each splice point erodes the power budget.

③ No Splitters
PON (Passive Optical Network) topologies split one fiber into 32–128 branches. Physically impossible with HCF (can't branch a hollow structure). This means scrapping all existing PON infrastructure.

These three flaws prevent HCF integration into existing infrastructure, regardless of its transmission superiority. It's limited to greenfield, high-value applications: long-haul core network links and ultra-low-latency HFT links.

Chapter 4: Kerr Microcombs — Repurposing a Nobel Prize Tool for Communication

Another optical interconnect bottleneck: light source scalability.

Standard WDM (Wavelength Division Multiplexing) with 100 wavelengths requires 100 lasers. Each wavelength controlled to 0.8nm precision, temperature-compensated, with hot-standby failover — all eating operational cost and rack space.

Enter Kerr Microcombs.

How It Works

Inject a single CW (continuous-wave) laser into a Si₃N₄ (silicon nitride) ring resonator. As light circulates thousands of times, the Kerr effect (refractive index changing proportionally to light intensity) generates new frequencies. Result: dozens to hundreds of equally-spaced wavelengths from a single laser.

This is an "optical frequency comb" — an extension of the technology behind the 2005 Nobel Prize in Physics. Originally for atomic clocks and precision spectroscopy, now being repurposed for communication.

Why Si₃N₄?

High Q-factor: Ring resonator Q reaches 10⁶–10⁷. Light circulates thousands of times, accumulating nonlinear effects
CMOS-compatible: Manufacturable in SiPh foundries. Ligentec, IMEC have volume production
Thermo-optic stability: Refractive index temperature sensitivity 1/10th of InP. Easy wavelength control with external heaters

The Problem: Still Stuck in the Lab

Kerr microcombs' biggest issue: startup reproducibility. Generating the comb requires locking to a soliton state (Dissipative Kerr Soliton: DKS), which demands slowly tuning the laser frequency to match the ring resonance — an inherently unstable process.

Lab papers report >90% success rates, but no one has demonstrated 24/7/365 stable operation in data center production environments. "90% success rate" means "fails 1 in 10 attempts." In a 100,000-port data center, that's 10,000 ports in abnormal state at any given time.

There's also uneven power across comb lines. Power drops off from the center wavelength, creating SNR disparities between WDM channels. Compensating this in DSP brings us right back to — you guessed it — Chapter 2's DSP power wall.

Chapter 5: The Realistic 2030 Landing Zone

Synthesizing everything above, here's where optical-electrical convergence technology likely lands around 2030.

Computation Stays Electrical

Optical computing (matrix multiplication via light) exists as research but comes nowhere near digital electronics in flexibility and integration density. 3D stacking + nanosheet FETs (GAA) keep computation firmly in the electrical domain.

Only Chip-to-Chip Communication Goes Optical

CPO's real value isn't "make everything optical" but "replace electrical wiring above 100mm with optical links." That alone buys 15–25% total power reduction. Thermal layout compromises via external lasers. PD dilemma mitigated through UTC-PD and advanced structures. Not perfect, but production-viable.

Efficiency Gains Are Modest

"Optical = 100× efficiency" is physics fiction. Realistic projection: 3–5× improvement over current, meaning 13.4 J/token drops to 3–4 J/token. Getting to 1/100 requires post-2035 convergence of process node advances (1.4nm, A14), mature 3D stacking, HBM4+ memory bandwidth expansion — optical is just one piece of that puzzle.

What a Real Breakthrough Requires

Reducing 13.4J to 0.1J through optical-electrical convergence requires solving the chip-internal wiring problem — completely outside CPO's scope. What's needed:

3D Monolithic Integration: Physically shorten wire lengths. But thermal density becomes hellish
Near-Memory / In-Memory Computing: Don't move data — compute where it lives
Analog Computing: Execute matrix operations in analog circuits (or optical analog circuits), tolerating AD/DA conversion overhead

All still research-stage by 2030. Hope for them, but don't bet on them.

Semiconductors and photonics have been promising "revolution in the next decade" for twenty years. The revolution never comes. What comes is unglamorous, incremental improvement. Just getting from 13.4 J/token to 4 J/token requires overcoming the PD dilemma, the DSP power wall, and the thermal layout deadlock — all three.

Next I want to benchmark Speculative Decoding with llama.cpp's --draft-model option on the RTX 4060, or do a simulation-based quantitative evaluation of HCF bend loss. Want to do both.

DEV Community