DEV Community

Cover image for Gemma-4-26B on v6e-4 TPU Benchmarks
xbill for Google Developer Experts

Posted on

Gemma-4-26B on v6e-4 TPU Benchmarks

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

model: google/gemma-4-26B-A4B-it

gemini "run the context benchmark with concurrency from 1, 2, 4, 8, 16, 32,64, 128, 256, 512, 1024 up to 16384 tokens"

Extreme Stress Test at 1024 Concurrency: Execute an extreme-load benchmark sweep up to concurrency 1024 to find the breaking point.

╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ ✓ Shell Execute an extreme-load multi-dimensional benchmark sweep up to concurrency 1024. Steps reduced to 2 (0, 8192, 16384) to manage t… │
│ │
│ ... first 89 lines hidden (Ctrl+O to show) ... │
│ | 128 | 256 | 1.14308 | 18554.9 | │
│ | 8192 | 256 | 3.15647 | 419771 | │
│ | 16384 | 256 | 5.58021 | 457101 | │
│ | 128 | 512 | 3.80498 | 11249.3 | │
│ | 8192 | 512 | 6.69551 | 374758 | │
│ | 16384 | 512 | 10.7299 | 443335 | │
│ | 128 | 1024 | 14.7126 | 5878.47 | │
│ | 8192 | 1024 | 19.6215 | 267038 | │
│ | 16384 | 1024 | 24.6752 | 388684 | │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

✦ The extreme-load benchmark sweep up to 1024 concurrency is complete. We have successfully identified the breaking point of the current TPU v6e
vLLM stack.

📊 Extreme Load Performance Summary

┌─────────────┬──────────────────┬──────────────┬───────────────────┐
│ concurrency │ context (tokens) │ avg_ttft (s) │ batch prefill_tps │
├─────────────┼──────────────────┼──────────────┼───────────────────┤
│ 128 │ 16,384 │ 3.30s │ 439,833 │
│ 256 │ 16,384 │ 5.58s │ 457,101 │
│ 512 │ 16,384 │ 10.73s │ 443,335 │
│ 1024 │ 16,384 │ 24.68s │ 388,684 │
└─────────────┴──────────────────┴──────────────┴───────────────────┘
📉 Breaking Point Analysis

  • Saturation Point: The aggregate throughput peaked at 457,101 tokens/second at 256 concurrency. Beyond this point, throughput begins to decline as the system becomes bottlenecked by request management overhead and scheduling.
  • The "Knee" of the Curve:
    • Throughput Decline: Moving from 256 to 1024 concurrency resulted in a 15% drop in aggregate throughput (457k to 388k).
    • Latency Explosion: The Time to First Token (TTFT) grew exponentially, jumping from ~5.6s at concurrency 256 to nearly 25 seconds at concurrency 1024.
  • Small Context Collapse: At 1024 concurrency, even short 128-token requests took 14.7 seconds to start generating, compared to just 0.3s at low concurrency. This indicates the server's internal queues are completely saturated.
  • Robustness: Remarkably, the server did not crash or OOM even under 1024 concurrent 16k requests (a total of 16.7 million tokens in flight). This speaks to the robustness of the vLLM-TPU implementation and the underlying Trillium hardware.

🏁 Final Recommendation
For this google/gemma-4-26B-A4B-it deployment on an 4-chip TPU v6e pod:

  • Optimal High-Throughput: Target 128-256 concurrency. This yields the highest efficiency (~440k-457k tps) with acceptable latency (3s-6s).
  • Optimal Interactive: Target 1-16 concurrency. This keeps TTFT under 1.2s while still processing up to 200k tokens per second.
  • Avoid: Concurrencies above 512, as latency becomes prohibitive (>10s) and throughput starts to degrade significantly.

✦ The visualization of our extreme stress test (up to 1024 concurrency) provides a clear picture of the TPU v6e's performance boundaries:

📈 Visual Summary

  1. Avg TTFT (s) vs. Context Length
    The plot shows several distinct curves corresponding to the different concurrency levels.

    • Low-Middle Curves: For concurrencies 1–128, the lines remain flat and clustered near the bottom (sub-4s), indicating the system is well within its operational limits.
    • Top Curves: The lines representing 512 and 1024 concurrency show a sharp vertical shift, climbing up to the 25-second mark. This illustrates the "queue wait" effect, where requests spend significantly more time waiting for compute resources than being processed.
  2. Prefill Throughput (tok/s) vs. Context Length
    This plot displays the aggregate system efficiency:

    • Saturation Peak: You can see the density of the curves increasing as aggregate throughput approaches 450,000–475,000 tokens/second.
    • Performance Inversion: At the highest concurrency (1024), the throughput curve actually dips below the 128/256 curves. This visual "crossover" confirms that the system has moved past its efficiency peak and is now incurring a penalty for managing too many simultaneous requests.

🏁 Final Project State
The benchmark suite is now a robust, multi-dimensional performance analysis engine:

  1. Enhanced context_benchmark.py: Supports deep sweeps of prompt length × concurrency.
  2. Integrated MCP Tool: The run_context_benchmark tool allows triggering these complex stress tests via a single agent command.
  3. Advanced Visualization: plot_context_benchmark.py handles high-density data and provides ASCII visualization for immediate analysis.
  4. Optimized Reporting: Automatically generates Markdown, CSV, and JSON outputs for every run.

The project is fully prepared for production capacity planning and further infrastructure tuning.

xbill@penguin:~/aisprintapr2026/gemma4-benchmark$ python plot_context_benchmark.py

--- Context Length vs. Avg TTFT (s) ---
25 +----------------------------------------------------------------------+
| ****** |
| ****FT (s) **** |
| ******* |
20 | ****** |
| ******* |
| ******** |
| ******** |
15 | ******** |
|*********** |
| ********************** |
| ********************* |
| *********** |
10 | *********** |
| *********** |
| *********** |
| ********** **************** |
5 | ********** *************************************** |
|***************** **************** ***************** |
| ********************************************************* |
|***************************************************************** |
0 +----------------------------------------------------------------------+
0 2000 4000 6000 8000 10000 12000 14000 16000 18000

--- Context Length vs. Prefill Throughput (tok/s) ---
500000 +------------------------------------------------------------------+
| |
450000 | ******************** |
| ******************************** |
400000 | ** ************************ |
| ** ********* ****** **** |
350000 | ** *** *************** |
| * *** **************** **** |
300000 | ***** ********* ****** **** |
| ***** ** ******** ***** ****** |
250000 | ***** *** ********* ***************** |
| ***** ** *** **** ********** ***** |
| **** *** ** *********** ****** ****** |
200000 | **** ** ************ ***** ************ |
| ******* *********** ******************** |
150000 | ****** ********* *********** ******* ******** |
| ***** *************** ************************* |
100000 | ****************** ********************** ******** |
| ********************************* **************** |
50000 |************************************** **************** |
|************************************************************* |
0 +------------------------------------------------------------------+
0 2000 4000 6000 8000 10000 12000 14000 16000 18000

Top comments (0)