Fully Homomorphic Encryption (FHE) lets you compute on encrypted data. The math is exact in theory. In practice, every CKKS implementation ships a slightly different rescaling strategy, and those differences don't show up until you look hard at worst-case inputs.
We built an adversarial precision tester — FHE Oracle — and pointed it at the three main CKKS libraries. The results surprised us.
The setup
Every library runs the exact same circuit: (w·x + b)² over real inputs in [-3, 3]^8, multiplicative depth 2. Same CKKS ring dimension (16384), same coefficient-modulus chain ([60, 40, 40, 40, 40, 60]), same inputs, same threshold.
The tester is AutoOracle with a 500-evaluation budget and a noise-aware CMA-ES fitness function. It doesn't sample uniformly — it searches for the inputs most likely to break precision. 5 random seeds, threshold 1e-2.
The leaderboard — CKKS (real-valued)
| Rank | Library | Scheme | Fail rate | Median max-err | Wall |
|---|---|---|---|---|---|
| 🥇 1 | OpenFHE 1.5+ | CKKS | 0% (0/5) ✅ | 1.57e-08 | 21.2 s |
| 🥈 2 | Pyfhel 3.5.0 | CKKS | 100% (5/5) | 1.30e-03 | 43.9 s |
| 🥉 3 | TenSEAL 0.3.16 (SEAL CKKS wrapper) | CKKS | 100% (5/5) | 2.20e-03 | 17.7 s |
OpenFHE is the only library that passes. Across five seeds, the oracle couldn't drive its max error above 2.7e-08. TenSEAL and Pyfhel both FAIL every seed at ~1–2e-3.
On identical math: OpenFHE is ~140,000× more precise than TenSEAL and ~80,000× more precise than Pyfhel.
The hypothesized cause: OpenFHE rescales after every ciphertext multiplication by default; TenSEAL defers rescales. Scalar smoke tests agree to six decimal places across all three libraries — the divergence only shows up at worst-case inputs that uniform random sampling almost never hits.
Why random testing misses this
The inputs that trigger FHE precision failure sit in narrow regions of the input space. Those regions scale with multiplicative depth and the magnitude of intermediate ciphertexts. Uniformly sampling your training distribution will almost never find them.
On our patent-reference benchmark (a CKKS logistic regression with a polynomial sigmoid approximation):
- Random sampling, 500 evaluations: max error 3.5e-4
- Adversarial CMA-ES, 500 evaluations: max error 1.50
Ratio: 4,259×. Same budget, same circuit, same seed. Random testing found zero diverging inputs. The oracle found them in under a second.
The methodology note that matters
CKKS libraries get compared to BGV/BFV/TFHE libraries all the time. That comparison is broken, and we want to be clear about why.
CKKS is real-valued with bounded approximation error. Its natural circuit is (w·x + b)² on real inputs. BGV/BFV/TFHE are exact integer arithmetic. Their natural circuit is (w_int·x_int + b_int)² mod p on quantized integers. Forcing integer schemes to approximate real-valued math injects quantization error (order 1/s²) that dominates any library-level signal. Conversely, running CKKS on a modular-integer circuit wastes its precision.
So we ran two leaderboards. The CKKS table above. And a separate integer/quantized table on its own circuit:
| Rank | Library | Scheme | Fail rate | Median max-err | Wall |
|---|---|---|---|---|---|
| 🥇 1 | OpenFHE BFV 1.5+ | BFV | 0% (0/5) | 0 | 12.7 s |
| 🥈 2 | OpenFHE BGV 1.5+ | BGV | 0% (0/5) | 0 | 15.6 s |
| 🥉 3 | Concrete ML 1.9.0 | TFHE | 33% (1/3) | 0 | 0.6 s |
BGV/BFV are bit-exact by design — zero divergence on every seed. Concrete ML's 33% FAIL is a quantization-boundary crossing effect, not an algorithmic error. TFHE wins on wall-clock by 20×; the tradeoff is that it operates on small fixed-width integers while BGV/BFV handle larger ranges.
Picking a library
- Integer / PIR / encrypted SQL → OpenFHE BFV
- Quantized ML inference → Concrete ML (TFHE)
- Real-valued CKKS (LR, neural net, SHAP) → OpenFHE if you need precision, TenSEAL if you need 1.2× speed and can tolerate ~140,000× more worst-case error
The TenSEAL-vs-OpenFHE tradeoff is real, not a bug. If you're doing CKKS smoke tests and shipping to production at scale, that precision gap will bite you eventually. If you're prototyping on benign data, TenSEAL's ergonomics and speed are fine. Pick deliberately.
Reproduce
# CKKS unified (w·x+b)^2 benchmark (any Python venv with TenSEAL)
pip install fhe-oracle tenseal
python benchmarks/library_comparison.py --circuit unified-squared-dot --libs tenseal
# Full leaderboard (OpenFHE BGV/BFV/CKKS) via Linux/amd64 Docker
docker build --platform linux/amd64 -t fhe-oracle-bench .
docker run --rm --platform linux/amd64 fhe-oracle-bench \
python benchmarks/library_comparison.py --circuit unified-squared-dot
Full code, adapters, and raw results:
github.com/BAder82t/fhe-oracle-oss (AGPL-3.0)
pip install fhe-oracle
What FHE Oracle does, in one paragraph
It tests black-box semantic divergence between a plaintext function and its FHE-compiled counterpart using CMA-ES search with a noise-aware fitness (divergence + ciphertext noise-budget consumption + multiplicative-depth utilization). Adapters for OpenFHE, Microsoft SEAL, Concrete ML, and TenSEAL turn on noise-guided mode; a pure-divergence fallback runs without any native FHE library for CI. It does not verify the cryptographic security of the FHE scheme, test for side-channel leaks, or replace formal verification. A PASS means the adversarial search didn't find a divergence in its budget, not that none exists.
Caveats worth naming
-
Pyfhel's middle-of-the-pack result likely reflects its API wrapper, not the underlying SEAL. Pyfhel requires manual
rescale_to_next/mod_switch_to_nextmanagement per multiplication. On this depth-2 circuit it works. On Taylor-3 polynomials it fails withscale out of bounds. - Lattigo (Go) would be interesting to add — it ships a built-in noise estimator — but it needs a subprocess wrapper to call from Python. Tracked as future work.
- Concrete ML's 33% rate on the integer board is quantization-boundary behavior, not an algorithmic bug. We note this in the table to avoid misreading the number.
Why we built this
CKKS circuits that pass on 99,999 random inputs can return garbage on the 100,000th. If you're deploying FHE ML to production (credit scoring, medical inference, privacy-preserving recommendation), random precision testing isn't enough — you need an adversary searching for the inputs that break you. That's what the oracle does.
PRs welcome. Add a library by writing one adapter; leaderboard updates on re-run.
Bader Alissaei — VaultBytes. A patent application on the method (PCT/IB2026/053378) has been filed; the code is AGPL-3.0. Commercial licences via b@vaultbytes.com.
Top comments (0)