NVIDIA cuQuantum has a strong reputation as the natural high-performance baseline for GPU quantum simulation. That reputation is understandable: cuQuantum contains serious low-level GPU libraries such as cuStateVec and cuTensorNet and it is NVIDIA who creates GPU and CUDA!
But in an end-to-end differentiable VQE workload, the result is more nuanced. On our H200 GPU benchmark, TensorCircuit-NG was substantially faster after compilation, while also offering a much higher-level and user-friendly programming model.
The short version:
- cuQuantum is a powerful low-level library.
- It is not automatically the fastest route for practical quantum simulation tasks.
- Direct cuQuantum code is significantly more verbose and engineering-heavy.
- TensorCircuit-NG pays a JAX compilation cost, but repeated value-and-gradient evaluations quickly amortize that cost.
- The final running time of TensorCircuit-NG is much shorter than NVIDIA cuquantum.
Benchmark setup
We used the workload as in the script for 1D TFIM VQE task:
Hardware and software:
- GPU: NVIDIA H200
- TensorCircuit-NG:
1.6.0 - JAX:
0.7.2 - cuQuantum Python:
26.3.2 - CuPy:
14.1.1 - PyTorch:
2.11.0+cu128
We measured one warmup/compile call and then the mean of five later value-and-gradient calls.
Implementations compared
We tested two TensorCircuit-NG modes:
-
TC-JAX scan: uses
scanover VQE layers to reduce JAX compilation/staging time. - TC-JAX unrolled: builds all layers directly. This produces a larger traced program, but can be faster after compilation.
We also tested two direct cuQuantum routes:
- cuStateVec adjoint: applies gates with cuStateVec and computes the full gradient with adjoint differentiation. This is not parameter shift so it is a fair comparison.
- cuTensorNet full-state autograd: contracts the full state with cuTensorNet, then computes the TFIM state-vector expectation on GPU with PyTorch autograd.
The cuTensorNet path is intentionally not the obviously bad version where every Pauli term gets a separate tensor-network path search. We first tried that more "TN-native" observable-contraction style, but for this workload it spent too much time in repeated graph/path overhead. The final version is closer to the state-vector expectation workflow used by the TensorCircuit-NG and MindQuantum benchmark.
Repeated value-and-gradient runtime
The table below reports the post-warmup runtime. This is the relevant metric for VQE-style optimization, where the same circuit structure is evaluated many times.
| backend | 14 qubits | 20 qubits | 24 qubits |
|---|---|---|---|
| TC-JAX scan | 0.01201s | 0.01616s | 0.06374s |
| TC-JAX unrolled | 0.00995s | 0.01381s | 0.02547s |
| cuStateVec adjoint | 0.08036s | 0.12061s | 0.30142s |
| cuTensorNet full-state autograd | 1.35677s | 2.04291s | 2.30414s |
In repeated value-and-gradient calls, TensorCircuit-NG is faster than cuStateVec:
| qubits | TC-JAX scan vs cuStateVec | TC-JAX unrolled vs cuStateVec |
|---|---|---|
| 14 | 6.69x | 8.08x |
| 20 | 7.46x | 8.73x |
| 24 | 4.73x | 11.83x |
The gap is much larger against the cuTensorNet route for this particular state-vector expectation plus autograd workflow:
| qubits | TC-JAX scan vs cuTensorNet | TC-JAX unrolled vs cuTensorNet |
|---|---|---|
| 14 | 112.97x | 136.36x |
| 20 | 126.42x | 147.93x |
| 24 | 36.15x | 90.46x |
These numbers are the main point: cuQuantum is not a magic speed button. A library being close to CUDA, or being written by a GPU vendor, does not automatically make it the fastest end-to-end implementation for a differentiable quantum algorithm.
First-call cost and amortization
cuQuantum has much lower first-call overhead. This is expected: TensorCircuit-NG uses JAX JIT compilation, and that first call can be expensive.
So if the task is a single one-off circuit evaluation, cuQuantum's low startup cost is attractive. But VQE is usually not a one-off workload. It repeatedly evaluates the same circuit structure for many optimizer steps and often across multiple random initializations. In that regime, TensorCircuit-NG's first-call cost is easily amortized, and the much faster post-compilation runtime becomes the dominant factor.
There is also a useful TensorCircuit-NG tradeoff:
- Use scan mode when compilation time matters.
- Use unrolled mode when the same circuit will be evaluated many times and peak post-compilation throughput matters.
At 24 qubits, unrolled TensorCircuit-NG is about 2.50x faster than scan mode after compilation, but the first call is about 9x heavier.
Programming model
Performance is only half of the story. The programming model matters.
In TensorCircuit-NG, the benchmark is expressed as circuit code:
c = tc.Circuit(n)
c.h(range(n))
for layer in range(depth):
for i in range(n - 1):
c.rzz(i, i + 1, theta=params[layer, 0, i])
for i in range(n):
c.rx(i, theta=params[layer, 1, i])
value_and_grad = tc.backend.jit(tc.backend.value_and_grad(energy_fn))
With direct cuQuantum, the user has to manually manage much lower-level details:
- gate matrices and their dtype conventions
- state-vector memory
- cuStateVec binding signatures
- tensor-network modes
- PyTorch operands for autograd
- GPU synchronization
- version-specific API behavior
cuQuantum is valuable, but it is closer to a low-level engine than a high-level quantum algorithm framework. For a researcher, that difference is very real.
Takeaway
This benchmark does not prove that cuQuantum is slow for every task. What this benchmark does show is narrower and more practical:
For VQE workload, direct cuQuantum was not the fastest end-to-end route. TensorCircuit-NG provided a much simpler programming interface and substantially faster repeated value-and-gradient evaluations after JAX compilation.
The common assumption that "NVIDIA controls CUDA, therefore cuQuantum must be the fastest implementation" is too simplistic. Raw GPU kernels matter, but so do JIT compilation, autodiff integration, graph-level optimization, and the abstraction level exposed to users.
TensorCircuit-NG's advantage is that it lets users write concise quantum-program code while still compiling to high-performance backend-native tensor programs. For repeated VQE-style workloads, that combination can beat direct cuQuantum both in usability and in runtime.
Top comments (0)