BAder82t

Posted on Apr 21

We benchmarked 5 FHE libraries on identical math. OpenFHE is 140,000 more precise than TenSEAL

#cryptography #security #python #machinelearning

Fully Homomorphic Encryption (FHE) lets you compute on encrypted data. The math is exact in theory. In practice, every CKKS implementation ships a slightly different rescaling strategy, and those differences don't show up until you look hard at worst-case inputs.

We built an adversarial precision tester — FHE Oracle — and pointed it at the three main CKKS libraries. The results surprised us.

The setup

Every library runs the exact same circuit: (w·x + b)² over real inputs in [-3, 3]^8, multiplicative depth 2. Same CKKS ring dimension (16384), same coefficient-modulus chain ([60, 40, 40, 40, 40, 60]), same inputs, same threshold.

The tester is AutoOracle with a 500-evaluation budget and a noise-aware CMA-ES fitness function. It doesn't sample uniformly — it searches for the inputs most likely to break precision. 5 random seeds, threshold 1e-2.

The leaderboard — CKKS (real-valued)

Rank	Library	Scheme	Fail rate	Median max-err	Wall
🥇 1	OpenFHE 1.5+	CKKS	0% (0/5) ✅	1.57e-08	21.2 s
🥈 2	Pyfhel 3.5.0	CKKS	100% (5/5)	1.30e-03	43.9 s
🥉 3	TenSEAL 0.3.16 (SEAL CKKS wrapper)	CKKS	100% (5/5)	2.20e-03	17.7 s

OpenFHE is the only library that passes. Across five seeds, the oracle couldn't drive its max error above 2.7e-08. TenSEAL and Pyfhel both FAIL every seed at ~1–2e-3.

On identical math: OpenFHE is ~140,000× more precise than TenSEAL and ~80,000× more precise than Pyfhel.

The hypothesized cause: OpenFHE rescales after every ciphertext multiplication by default; TenSEAL defers rescales. Scalar smoke tests agree to six decimal places across all three libraries — the divergence only shows up at worst-case inputs that uniform random sampling almost never hits.

Why random testing misses this

The inputs that trigger FHE precision failure sit in narrow regions of the input space. Those regions scale with multiplicative depth and the magnitude of intermediate ciphertexts. Uniformly sampling your training distribution will almost never find them.

On our patent-reference benchmark (a CKKS logistic regression with a polynomial sigmoid approximation):

Random sampling, 500 evaluations: max error 3.5e-4
Adversarial CMA-ES, 500 evaluations: max error 1.50

Ratio: 4,259×. Same budget, same circuit, same seed. Random testing found zero diverging inputs. The oracle found them in under a second.

The methodology note that matters

CKKS libraries get compared to BGV/BFV/TFHE libraries all the time. That comparison is broken, and we want to be clear about why.

CKKS is real-valued with bounded approximation error. Its natural circuit is (w·x + b)² on real inputs. BGV/BFV/TFHE are exact integer arithmetic. Their natural circuit is (w_int·x_int + b_int)² mod p on quantized integers. Forcing integer schemes to approximate real-valued math injects quantization error (order 1/s²) that dominates any library-level signal. Conversely, running CKKS on a modular-integer circuit wastes its precision.

So we ran two leaderboards. The CKKS table above. And a separate integer/quantized table on its own circuit:

Rank	Library	Scheme	Fail rate	Wall
🥇 1	OpenFHE BFV 1.5+	BFV	0% (0/5)	12.7 s
🥈 2	OpenFHE BGV 1.5+	BGV	0% (0/5)	15.6 s
🥉 3	Concrete ML 1.9.0	TFHE	33% (1/3)	0.6 s

BGV/BFV are bit-exact by design — zero divergence on every seed. Concrete ML's 33% FAIL is a quantization-boundary crossing effect, not an algorithmic error. TFHE wins on wall-clock by 20×; the tradeoff is that it operates on small fixed-width integers while BGV/BFV handle larger ranges.

Picking a library

Integer / PIR / encrypted SQL → OpenFHE BFV
Quantized ML inference → Concrete ML (TFHE)
Real-valued CKKS (LR, neural net, SHAP) → OpenFHE if you need precision, TenSEAL if you need 1.2× speed and can tolerate ~140,000× more worst-case error

The TenSEAL-vs-OpenFHE tradeoff is real, not a bug. If you're doing CKKS smoke tests and shipping to production at scale, that precision gap will bite you eventually. If you're prototyping on benign data, TenSEAL's ergonomics and speed are fine. Pick deliberately.

Reproduce

# CKKS unified (w·x+b)^2 benchmark (any Python venv with TenSEAL)
pip install fhe-oracle tenseal
python benchmarks/library_comparison.py --circuit unified-squared-dot --libs tenseal

# Full leaderboard (OpenFHE BGV/BFV/CKKS) via Linux/amd64 Docker
docker build --platform linux/amd64 -t fhe-oracle-bench .
docker run --rm --platform linux/amd64 fhe-oracle-bench \
    python benchmarks/library_comparison.py --circuit unified-squared-dot

Full code, adapters, and raw results:
github.com/BAder82t/fhe-oracle-oss (AGPL-3.0)
pip install fhe-oracle

What FHE Oracle does, in one paragraph

It tests black-box semantic divergence between a plaintext function and its FHE-compiled counterpart using CMA-ES search with a noise-aware fitness (divergence + ciphertext noise-budget consumption + multiplicative-depth utilization). Adapters for OpenFHE, Microsoft SEAL, Concrete ML, and TenSEAL turn on noise-guided mode; a pure-divergence fallback runs without any native FHE library for CI. It does not verify the cryptographic security of the FHE scheme, test for side-channel leaks, or replace formal verification. A PASS means the adversarial search didn't find a divergence in its budget, not that none exists.

Caveats worth naming

Pyfhel's middle-of-the-pack result likely reflects its API wrapper, not the underlying SEAL. Pyfhel requires manual rescale_to_next / mod_switch_to_next management per multiplication. On this depth-2 circuit it works. On Taylor-3 polynomials it fails with scale out of bounds.
Lattigo (Go) would be interesting to add — it ships a built-in noise estimator — but it needs a subprocess wrapper to call from Python. Tracked as future work.
Concrete ML's 33% rate on the integer board is quantization-boundary behavior, not an algorithmic bug. We note this in the table to avoid misreading the number.

Why we built this

CKKS circuits that pass on 99,999 random inputs can return garbage on the 100,000th. If you're deploying FHE ML to production (credit scoring, medical inference, privacy-preserving recommendation), random precision testing isn't enough — you need an adversary searching for the inputs that break you. That's what the oracle does.

PRs welcome. Add a library by writing one adapter; leaderboard updates on re-run.

Bader Alissaei — VaultBytes. A patent application on the method (PCT/IB2026/053378) has been filed; the code is AGPL-3.0. Commercial licences via b@vaultbytes.com.

DEV Community