aurigin

Posted on Aug 29

Introducing Apollo, Setting a New Record in Voice Deepfake Detection Accuracy

#security #machinelearning #startup #performance

"Apollo”, the model we released today, surpasses industry leaders in several dimensions: Its accuracy of 97.7% reduces the error rate of the next-best commercial model by 48.2%. With an ultra-low latency of <50ms it is ideal for real-time applications like securing communication. Starting today, Apollo is available through our API, web app, and enterprise integrations, including on-premises deployments.

Closing the Gap in Audio Security

Until now, organizations faced a difficult choice in detecting synthetic audio. Academic models were often open, but impractical, as they were tuned to specific datasets and struggled with messy, real-world samples.

“There have quite literally been 10s of millions of dollars spent on this problem [deepfake detection] by DARPA, the military, countless university grants, etc... and pretty much nothing works.” - Reddit user [5]

Closed black-box APIs (Elevenlabs classifier, DeepfakeDetector.ai, etc.) could perform well under controlled conditions, yet their reliability quickly degrades with common manipulations such as background noise, codec compression, or re-recordings [6].

With its unprecedented accuracy, Apollo is one of the first deepfake detection models “that actually works”, engineered specifically to perform in real-world situations:

Latency of <50ms: Enables real-time feedback during live calls and audio streams.
Robust to phone-call quality: Even compressed, low-quality audio signals (e.g. cut off frequencies in phone calls, lost data packages) deliver high detection accuracies.
Works with rerecordings: Apollo maintains high detection strength even when deepfakes are played over speakers and re-captured by microphones - “the Shazaam for deepfake detection”.
Background resilience: Detects synthetic audio reliably in noisy environments, overlapping speech, and conversational settings.
Cross generator generalization: Apollo achieves high detection scores across multiple datasets and generator families: including strong performance on unseen synthesis methods. One example was the release of ElevenLabs’ V3 model, which was still detected with basically 100% accuracy, even without any ElevenLabs V3 data in the training set.
Little compute requirements: The model can run with as little compute as on a Raspberry PI’s, running on CPUs only.

Benchmark Performance

We evaluated Apollo on a wide range of audio spoofing and deepfake benchmarks, including ASVspoof variants, in-the-wild, and Fake or Real, and compared it to different sources [1][2][3][4]. To address differences in dataset difficulty and size, we computed dataset-level weights for each metric (see Appendix), giving higher weights to datasets where models performed worse for F1 and accuracy, and to those with higher EER for the EER metric. Each model’s results were then aggregated using these weights to produce weighted averages per metric, ensuring fair and robust comparison across datasets and models.

Raw accuracy can give a general sense of correctness, but in this case it is misleading due to the unbalanced nature of the datasets (1,272,467 fake vs. 197,955 real samples). We therefore emphasize F1 score and Equal Error Rate when selecting the best model. Our approach achieves a weighted F1 of 97.45% and a weighted EER of 2.76%, surpassing open-source baselines and leading commercial systems. We also report False Positive Rate and False Negative Rate, which others do not, obtaining a weighted FPR of 2.13% and a weighted FNR of 3.47%.

Accuracy vs. Audio Duration

We assessed our model’s performance by progressively removing audio samples below increasing durations (1, 2, 3, … seconds). Based on both the previous benchmark and our own exploration, 3 seconds emerges as the optimal duration, since marginal gains quickly diminish beyond 4 seconds. At this duration, we achieve a weighted EER of 2.3%, from which we derive the initially reported validation accuracy of 97.7% (1 – weighted EER).

Language Coverage

Our model was tested across a diverse set of languages, including English, Spanish, French, Italian, Dutch, and Polish, covering over 40 languages in total. The reported results use a separate test set, as the previous benchmark lacked sufficient language coverage. F1 scores show consistent performance, indicating that the model generalizes well to multiple language contexts, including unseen languages. This extensive multilingual support enables a single model to be applied across diverse voice datasets without language-specific retraining or fine-tuning.

Conclusion

With presented metrics of 97.7% accuracy and <50ms latency, outperform all other models we have evaluated, including famous commercial and open source ones. The experiments show that it is also robust to background noise, languages and new voice synthesis models. But this quantitative comparison is only one side of the story. The other side is the perceived accuracy. And to evaluate that, we encourage you to run your own evaluation sets through our web app or our API.

The next model versions are already being worked on, and we would love to learn more from your experience. Do you also perceive the accuracy as best-in-class? Which use cases does it handle well, and which ones not yet? We are very happy to receive any kind of feedback via e-mail or through a direct discussion (nicolas@aurigin.ai)

Coming Next

We’re continuing to expand Apollo’s capabilities and tooling:

Richer confidence scoring
Automated forensic reports for explainability
Submission to the Huggingface DF-Arena

Voice should remain a trusted interface. With Apollo, we’re making that possible.

Sources:

HuggingFace Datasets. Emilia-Yodas Audio Dataset. 2025 [1].
HuggingFace Spaces. Speech Deepfake Arena Leaderboard. 2025 [2].
Pindrop Labs. Pindrop Labs Submission to the ASVspoof Challenge. 2021 [3].
Reality Defender. Reality Defender Submission on ASVspoof5. 2024 [4].
Reddit. Thoughts on Deepfake Detection. 2025 [5].
Youtube. I Tested Five Deepfake Detectors—They ALL FAILED. 2025 [6].

Appendix

See original blogpost.

DEV Community