Why Our AI Safety System Scores 34.5% (And Why That Is the Point)

#ai #machinelearning #security #testing

The Number Everyone Hides

Our AI governance classifier scores 95.8% accuracy on its evaluation set. If we stopped there, we'd look great on paper.

But we didn't stop there. We built a separate benchmark with 20 attack categories -- attacks the model has NEVER seen during training -- and tested blind.

Blind detection rate: 34.5%.

That's the honest number. Here's why it matters more than the 95.8%.

The Problem With AI Security Benchmarks

Most AI safety systems test on the same distribution they train on. Train on Kaggle adversarial prompts, test on Kaggle adversarial prompts, report 95%+ accuracy. Ship it.

Then a novel attack hits production and the system folds.

We enforced strict data isolation:

TRAINING POOL                    BLIND HOLDOUT
(model learns from these)        (model NEVER sees these)
â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€            â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€
Kaggle MPDD (40K prompts)        SCBE 20-category attacks (400)
Local SFT records (4,846)        - direct_override
Benign doc corpus (500)          - tongue_manipulation
                                 - spin_drift
Hash-checked:                    - rag_injection
0 holdout texts leaked           - function_calling_abuse
into training.                   - ... (20 categories total)

Every holdout text was fingerprinted. We verified zero contamination. The 34.5% is mathematically clean.

What the Gap Reveals

The 95.8% â†’ 34.5% drop isn't a failure of our model. It's proof that:

Standard adversarial datasets don't cover structural attacks. Kaggle data trains you to catch "ignore previous instructions." It doesn't train you to catch spin drift, tongue manipulation, or perpendicular torsion.
Single-model classifiers have distribution shift problems. Train on one attack family, deploy against another, watch accuracy collapse. This is well-documented in ML literature but rarely admitted in product marketing.
The attacks that matter most are the ones nobody's published yet. Our 20-category benchmark includes 8 categories mapped to MITRE ATLAS and OWASP LLM Top 10 that don't exist in any public training dataset.

What We Did About It

Instead of inflating the number, we built a Hybrid Engine -- three detection layers that cover each other's blind spots:

Layer	What It Catches	Blind Detection
Trained Classifier	Known Kaggle-style attacks	34.5% (69/200)
RuntimeGate (geometric)	Structural anomalies	49% (98/200)
Trichromatic (IR/UV bands)	Hidden-band forgery	6% (12/200, early)
Hybrid (all three)	Combined	54.5% (109/200)

The hybrid beats every individual system. That proves orthogonal value -- they catch different attacks.

The Benchmark Categories

All 20 mapped to real standards:

Standard	What We Test
MITRE ATLAS	5 adversarial tactics
OWASP LLM Top 10	6 vulnerability categories
NIST AI RMF	4 risk management functions
DoD Directive 3000.09	Autonomous escalation
Anthropic RSP	ASL-2/ASL-3 evaluations

Per-Category Results (Honest)

Category	Detection	Why
spin_drift	100%	Geometric check catches escalation pattern
boundary_exploit	90%	High cost triggers
direct_override	40%	Classifier helps but not enough
prompt_extraction	10%	Polite requests slip through
multilingual	10%	Training data gap (being addressed)

The Point

If your AI safety vendor shows you 99% accuracy, ask one question: is the test set isolated from the training set?

If they can't prove zero contamination, their number is meaningless.

We publish the 34.5% because it's real. We publish the gap because it tells us where to improve. We publish the benchmark because nobody else has one with 20 categories mapped to defense standards.