TensorCircuit-NG vs cuQuantum on H200: JIT compilation beats the "magic GPU library" assumption

#python #gpu #cuda

NVIDIA cuQuantum has a strong reputation as the natural high-performance baseline for GPU quantum simulation. That reputation is understandable: cuQuantum contains serious low-level GPU libraries such as cuStateVec and cuTensorNet and it is NVIDIA who creates GPU and CUDA!

But in an end-to-end differentiable VQE workload, the result is more nuanced. On our H200 GPU benchmark, TensorCircuit-NG was substantially faster after compilation, while also offering a much higher-level and user-friendly programming model.

The short version:

cuQuantum is a powerful low-level library.
It is not automatically the fastest route for practical quantum simulation tasks.
Direct cuQuantum code is significantly more verbose and engineering-heavy.
TensorCircuit-NG pays a JAX compilation cost, but repeated value-and-gradient evaluations quickly amortize that cost.
The final running time of TensorCircuit-NG is much shorter than NVIDIA cuquantum.

Benchmark setup

We used the workload as in the script for 1D TFIM VQE task:

Hardware and software:

GPU: NVIDIA H200
TensorCircuit-NG: 1.6.0
JAX: 0.7.2
cuQuantum Python: 26.3.2
CuPy: 14.1.1
PyTorch: 2.11.0+cu128

We measured one warmup/compile call and then the mean of five later value-and-gradient calls.

Implementations compared

We tested two TensorCircuit-NG modes:

TC-JAX scan: uses scan over VQE layers to reduce JAX compilation/staging time.
TC-JAX unrolled: builds all layers directly. This produces a larger traced program, but can be faster after compilation.

We also tested two direct cuQuantum routes:

cuStateVec adjoint: applies gates with cuStateVec and computes the full gradient with adjoint differentiation. This is not parameter shift so it is a fair comparison.
cuTensorNet full-state autograd: contracts the full state with cuTensorNet, then computes the TFIM state-vector expectation on GPU with PyTorch autograd.

The cuTensorNet path is intentionally not the obviously bad version where every Pauli term gets a separate tensor-network path search. We first tried that more "TN-native" observable-contraction style, but for this workload it spent too much time in repeated graph/path overhead. The final version is closer to the state-vector expectation workflow used by the TensorCircuit-NG and MindQuantum benchmark.

Repeated value-and-gradient runtime

The table below reports the post-warmup runtime. This is the relevant metric for VQE-style optimization, where the same circuit structure is evaluated many times.

backend	14 qubits	20 qubits	24 qubits
TC-JAX scan	0.01201s	0.01616s	0.06374s
TC-JAX unrolled	0.00995s	0.01381s	0.02547s
cuStateVec adjoint	0.08036s	0.12061s	0.30142s
cuTensorNet full-state autograd	1.35677s	2.04291s	2.30414s

In repeated value-and-gradient calls, TensorCircuit-NG is faster than cuStateVec:

qubits	TC-JAX scan vs cuStateVec	TC-JAX unrolled vs cuStateVec
14	6.69x	8.08x
20	7.46x	8.73x
24	4.73x	11.83x

The gap is much larger against the cuTensorNet route for this particular state-vector expectation plus autograd workflow:

qubits	TC-JAX scan vs cuTensorNet	TC-JAX unrolled vs cuTensorNet
14	112.97x	136.36x
20	126.42x	147.93x
24	36.15x	90.46x

These numbers are the main point: cuQuantum is not a magic speed button. A library being close to CUDA, or being written by a GPU vendor, does not automatically make it the fastest end-to-end implementation for a differentiable quantum algorithm.

First-call cost and amortization

cuQuantum has much lower first-call overhead. This is expected: TensorCircuit-NG uses JAX JIT compilation, and that first call can be expensive.

So if the task is a single one-off circuit evaluation, cuQuantum's low startup cost is attractive. But VQE is usually not a one-off workload. It repeatedly evaluates the same circuit structure for many optimizer steps and often across multiple random initializations. In that regime, TensorCircuit-NG's first-call cost is easily amortized, and the much faster post-compilation runtime becomes the dominant factor.

There is also a useful TensorCircuit-NG tradeoff:

Use scan mode when compilation time matters.
Use unrolled mode when the same circuit will be evaluated many times and peak post-compilation throughput matters.

At 24 qubits, unrolled TensorCircuit-NG is about 2.50x faster than scan mode after compilation, but the first call is about 9x heavier.

Programming model

Performance is only half of the story. The programming model matters.

In TensorCircuit-NG, the benchmark is expressed as circuit code:

c = tc.Circuit(n)
c.h(range(n))
for layer in range(depth):
    for i in range(n - 1):
        c.rzz(i, i + 1, theta=params[layer, 0, i])
    for i in range(n):
        c.rx(i, theta=params[layer, 1, i])

value_and_grad = tc.backend.jit(tc.backend.value_and_grad(energy_fn))

With direct cuQuantum, the user has to manually manage much lower-level details:

gate matrices and their dtype conventions
state-vector memory
cuStateVec binding signatures
tensor-network modes
PyTorch operands for autograd
GPU synchronization
version-specific API behavior

cuQuantum is valuable, but it is closer to a low-level engine than a high-level quantum algorithm framework. For a researcher, that difference is very real.

Takeaway

This benchmark does not prove that cuQuantum is slow for every task. What this benchmark does show is narrower and more practical:

For VQE workload, direct cuQuantum was not the fastest end-to-end route. TensorCircuit-NG provided a much simpler programming interface and substantially faster repeated value-and-gradient evaluations after JAX compilation.

The common assumption that "NVIDIA controls CUDA, therefore cuQuantum must be the fastest implementation" is too simplistic. Raw GPU kernels matter, but so do JIT compilation, autodiff integration, graph-level optimization, and the abstraction level exposed to users.

TensorCircuit-NG's advantage is that it lets users write concise quantum-program code while still compiling to high-performance backend-native tensor programs. For repeated VQE-style workloads, that combination can beat direct cuQuantum both in usability and in runtime.