DEV Community

Cover image for Detecting Deepfake Audio in Python: Why the Threshold Matters More Than the Model
[Tanwydd]
[Tanwydd]

Posted on

Detecting Deepfake Audio in Python: Why the Threshold Matters More Than the Model

Cloning a voice used to require a recording studio and a professional impersonator. Today it takes a few seconds of audio and a free API call.

That changes the threat model for any system that verifies identity by voice.


The problem with voice verification in 2026

Voice biometrics have been used in contact centers and banking for years. The assumption was that a voice is hard to fake — you either sound like someone or you don't, and training a model to tell the difference was expensive enough to deter casual fraud.

That assumption is gone. Modern voice cloning tools can reproduce a speaker's voice with enough fidelity to fool both humans and many biometric systems, using as little as three to five seconds of target audio. The barrier is now effectively zero for anyone motivated enough to try.

The response can't just be "better biometrics." It has to include detection of synthetic audio alongside speaker verification.


Two problems, two models

VoiceID Compare solves both problems in a single API call:

  1. Speaker verification — do these two audio samples belong to the same person?
  2. Deepfake detection — was either sample generated by AI?

These are separate tasks that require separate models. Confusing them is a common mistake — a deepfake detector doesn't tell you if two voices match, and a speaker verification model doesn't tell you if the audio is synthetic.


Speaker verification: embeddings and cosine similarity

The speaker verification component uses SpeechBrain's ResNet model, trained on VoxCeleb — a large-scale dataset of celebrity speech collected from YouTube.

The model doesn't compare audio files directly. It converts each audio sample into an embedding — a vector of floating point numbers that represents the speaker's vocal characteristics in a high-dimensional space.

from speechbrain.inference.speaker import SpeakerRecognition

model = SpeakerRecognition.from_hparams(
    source="speechbrain/spkrec-resnet-voxceleb"
)

score, prediction = model.verify_files(audio_path_1, audio_path_2)
Enter fullscreen mode Exit fullscreen mode

The similarity between two embeddings is calculated using cosine similarity — the angle between the two vectors in that high-dimensional space. Vectors pointing in the same direction (same speaker) have high cosine similarity. Vectors pointing in different directions (different speakers) have low similarity.

The raw score is normalized to a 0-100 percentage scale for human readability.


Why audio preprocessing matters

Raw audio files from real-world sources are messy. Different sample rates, different channel counts, different durations, background noise. Feeding inconsistent audio to the model produces unreliable results.

Before embedding extraction, every audio file goes through normalization:

import torchaudio

def preprocess_audio(path: str):
    waveform, sample_rate = torchaudio.load(path)

    # Convert to mono
    if waveform.shape[0] > 1:
        waveform = waveform.mean(dim=0, keepdim=True)

    # Resample to 16kHz (model requirement)
    if sample_rate != 16000:
        resampler = torchaudio.transforms.Resample(sample_rate, 16000)
        waveform = resampler(waveform)

    return waveform
Enter fullscreen mode Exit fullscreen mode

16kHz mono is the format the ResNet model was trained on. Deviating from it degrades accuracy in ways that aren't always obvious — the model still produces an output, it's just less reliable.


Deepfake detection: Wav2Vec2

The deepfake detection component uses a fine-tuned Wav2Vec2 model from HuggingFace, trained to classify audio as real or synthetic.

Wav2Vec2 is a self-supervised model originally designed for speech recognition. The fine-tuned version used here has been trained on a dataset of real and AI-generated speech samples, learning to identify the subtle artifacts that synthetic audio introduces — phase discontinuities, unnatural prosody, artifacts from vocoder processing.

from transformers import pipeline

deepfake_detector = pipeline(
    "audio-classification",
    model="garystafford/wav2vec2-deepfake-voice-detector"
)

result = deepfake_detector(audio_path)
# Returns: [{'label': 'fake', 'score': 0.73}, {'label': 'real', 'score': 0.27}]
Enter fullscreen mode Exit fullscreen mode

The output is a probability score per class. A score of 0.73 for 'fake' means the model is 73% confident the audio was synthetically generated.


The threshold problem

Here's where most implementations go wrong: they treat the model's output as ground truth.

It isn't. It's a probability estimate with a confidence interval that varies depending on audio quality, recording conditions, the specific voice cloning tool used, and how much the model has been updated relative to the latest generation of synthesis tools.

The threshold — the score above which you classify audio as a deepfake — is a design decision, not a model parameter. And it has asymmetric consequences:

  • Too low (e.g. 40%): high false positive rate. Legitimate users get flagged. Trust collapses.
  • Too high (e.g. 80%): high false negative rate. Actual deepfakes get through. False confidence.

For forensic use, the threshold needs to be calibrated against your specific threat model and your tolerance for each type of error. The system should surface the raw score, not just a binary verdict.

DEEPFAKE_ALERT_THRESHOLD = 0.60  # tunable per deployment

def interpret_deepfake_score(score: float) -> dict:
    return {
        "score_pct": round(score * 100, 1),
        "alert": score >= DEEPFAKE_ALERT_THRESHOLD,
        "verdict": "possible deepfake" if score >= DEEPFAKE_ALERT_THRESHOLD else "appears genuine",
    }
Enter fullscreen mode Exit fullscreen mode

The 60% default is a starting point, not a recommendation. In a banking compliance context you might want 50%. In a forensic investigation you might want to surface everything above 30% for manual review.


Running both models in parallel

Speaker verification and deepfake detection are independent. Running them in parallel cuts processing time roughly in half.

Since both are CPU/GPU bound, they need to run in a thread pool to avoid blocking an async event loop:

import asyncio
from concurrent.futures import ThreadPoolExecutor

executor = ThreadPoolExecutor(max_workers=2)

async def compare(audio_path_1: str, audio_path_2: str) -> dict:
    loop = asyncio.get_event_loop()

    speaker_task = loop.run_in_executor(
        executor, verify_speaker, audio_path_1, audio_path_2
    )
    deepfake_task = loop.run_in_executor(
        executor, detect_deepfake, audio_path_1
    )

    similarity, deepfake = await asyncio.gather(speaker_task, deepfake_task)

    return {
        "similarity_pct": similarity,
        "audio_1_deepfake": deepfake,
    }
Enter fullscreen mode Exit fullscreen mode

Wrap with asyncio.wait_for — model inference on long audio files can take tens of seconds and you don't want hung requests blocking the server.


The combination that should raise flags

The most dangerous scenario isn't a low similarity score or a high deepfake score in isolation. It's high similarity combined with a high deepfake score.

That means: the voice sounds like the target person, but the audio was probably synthesized. That's a cloning attack.

def interpret_result(similarity_pct: float, deepfake_score: float) -> str:
    if similarity_pct >= 75 and deepfake_score >= 0.60:
        return "HIGH RISK: voice matches but audio may be synthetic — possible cloning attack"
    if similarity_pct >= 75:
        return "same person"
    if similarity_pct >= 55:
        return "inconclusive — manual review recommended"
    return "different people"
Enter fullscreen mode Exit fullscreen mode

This case needs explicit handling in the interpretation logic, not just surfacing both scores independently.


What this doesn't solve

No system catches everything. The current generation of voice cloning tools produces audio that fools both humans and models at rates that should make anyone uncomfortable relying on voice verification as a sole authentication factor.

This is a layer in a defense stack, not a complete solution. Combined with behavioral signals, session context, and human review for edge cases, it raises the cost of a successful attack significantly. Used alone as a binary gate, it will eventually be bypassed.

The honest answer to "is this voice real?" is always a probability, not a fact. The system's job is to surface that probability clearly enough that the humans making decisions can act on it.


Top comments (0)