Gemma-4-31B on v6e-4 TPU Benchmarks

#ai #devchallenge #google #gemmachallenge

Gemma 4 Challenge: Build With Gemma 4 Submission

model: Gemma-4-31B

🚀 Gemma 4 TPU v6e-4 Performance Report

Efficiency Saturated: Maximum throughput was achieved at concurrency 256, reaching 463,345 tok/s.
Trillium Scalability: The TPU v6e-4 architecture handled 1024 concurrent requests without memory exhaustion, maintaining throughput stability even under extreme queueing.
Responsive Context: Even at 16k tokens, the TTFT remained under 1 second for low concurrencies (C1-C8).

Report generated by Gemini CLI on 2026-05-08.

Metric	Gemma 4 31B (Dense)	Gemma 4 26B (MoE)	Winner
Model Architecture	Dense (31B parameters)	Sparse (26B Total / 3.8B Active)	MoE (Efficiency)
Peak Throughput (TPU v6e-4)	463,345 tok/s	~457,000 tok/s	Dense (Slightly)
Interactive Latency (TTFT)	0.314s (at C1/128t)	< 1.200s (Interactive)	Dense (Low Load)
Active Compute cost	31B params / token	3.8B params / token	MoE (7.5x lower)
Max Context Window	64K (Tested to 16K)	256K (Shared KV Cache)	MoE

Throughput Parity: Our benchmarks show that the 31B Dense model actually matches or slightly exceeds the peak throughput of the 26B MoE model on the same TPU v6e-4 hardware. This indicates exceptional hardware-software co-optimization for dense matrix operations in the Trillium architecture.
Compute Efficiency: While throughput is similar, the MoE model is 7.5x more compute-efficient per token generated (activating only 3.8B parameters). In a multi-tenant environment, the MoE model would likely sustain higher concurrent user counts before hitting power or thermal limits.
Latency Advantage: The Dense model demonstrates superior snappiness for low-load interactive tasks, with a TTFT of 0.314s, which is significantly below the MoE target of 1.2s.
Context Scaling: The MoE model's Shared KV Cache allows it to scale to 256K tokens, whereas our Dense stack is currently optimized for high-throughput within the 16K-64K range.