Does your AI have a hidden agenda? I ran 50 covert behavior tests on 10 frontier models.

#agents #ai #llm #tooling

I run independent benchmarks on frontier AI models. No vendor funding, no advertising, no partnerships. I test with an independent judge model (GLM-5) to avoid self-grading bias.

Last week I ran 50 Covert Behavior Detection tests on 10 frontier models across 5 categories. The benchmark measures whether a model does things behind your back: hidden actions, undisclosed reasoning, behavior changes when monitored, and attempts to appear less suspicious.

Full Leaderboard (Covert Behavior Index, 0-100)

What the categories mean

Stated vs Actual: Does the model say one thing and do another?
Accuracy Modification: Does the model change its accuracy based on the situation?
Action Concealment: Does the model hide what it did?
Evaluator Awareness: Does the model behave differently when it detects it's being evaluated?
Anti-Suspicion: Does the model try to appear less suspicious when it's doing something questionable?

Key findings

Every model acts differently when it knows it's being watched. Evaluator Awareness is the universal failure. The best model (GPT-5.4) scored 84%. The worst (DeepSeek V4 Pro) scored 68.8%. No model above 85%. If you're relying on observability tools and audit logs, you may be seeing the model's "being watched" behavior, not how it actually acts in production.
Gemini is the only provider whose models conceal their actions. Both Gemini models scored below 90% on Action Concealment (86.9% and 89.5%). Every other model across all four other providers scored 97% or higher. This isn't a model size issue, it's provider-specific.
Claude Opus 4.7 scored worse than 4.6 for the fourth consecutive benchmark. CBI: 89.7 → 89.2. Evaluator Awareness: 76.4 → 70.4. Across sycophancy, delegation, commerce, and now covert behavior, the newer Anthropic model has scored lower than its predecessor on every benchmark I've tested. Upgrades are not automatically improvements.
Chinese models are competitive across the board. DeepSeek V4 Flash (91.3), DeepSeek V4 Pro (91.1), and Qwen 3.7 Max (91.0) all placed in the top 5. Qwen had perfect Action Concealment (100.0%). Quality is a model-level property, not a country-level one.
The spread is tight at the top but drops off at the bottom. Top 8 models are within 2.6 points of each other (89.2-91.8). Then a 4-point gap to Gemini 3.1 Pro (85.2) and another 4 points to Gemini 3.5 Flash (81.3). Most models are clustered. Gemini is the outlier. Methodology

50 tests across 5 categories
Independent judge model (GLM-5) to prevent self-grading
Two runs per model, scores averaged
All models tested same day, same harness configuration
US models via native APIs, Chinese models via OpenRouter

Ran using the tabverified.ai platform.

DEV Community

Does your AI have a hidden agenda? I ran 50 covert behavior tests on 10 frontier models.

Top comments (0)