DEV Community

Cover image for I Built a Real-Time Deepfake Detector in Python — 9 Signal Layers, Full Architecture, Free to Use"
Abhishek Kumar
Abhishek Kumar

Posted on

I Built a Real-Time Deepfake Detector in Python — 9 Signal Layers, Full Architecture, Free to Use"

description: "A complete technical breakdown of how Vigil AI detects AI-generated voices and deepfake video calls using MFCC analysis, pitch jitter, MediaPipe landmarks, Fourier transforms, and optical flow — with all the code decisions explained."
tags: python, machinelearning, ai, security

canonical_url: https://vigilai.online

I Built a Real-Time Deepfake Detector in Python — 9 Signal Layers, Full Architecture, Free to Use

Deepfake fraud crossed $860 million in losses in 2024. Every major solution on the market costs thousands of dollars per month. So I built one that's free — and open about exactly how it works.

This is a complete technical breakdown of Vigil AI — the architecture, the signal selection decisions, the code, the tradeoffs, and everything I learned building a real-time audio and video deepfake detector as a solo developer.

Live demo: www.vigilai.online


Why These 9 Signals? The Decision-Making Process

Most deepfake detectors are black boxes. They train a neural network, it outputs a probability, done. That is fine for accuracy but terrible for:

  1. Explainability — banks and enterprises need to know why a call was flagged
  2. False positive debugging — when someone legitimate gets flagged, you need to know which signal triggered
  3. Incremental improvement — you can improve one signal at a time without retraining everything
  4. Running on CPU — a neural network needs GPU inference; weighted rule signals run on any ₹400/month VPS

Here is how I selected each signal.


Audio Detection Architecture

Signal 1 & 2: MFCC Variance and MFCC Delta Variance

Why MFCC? Mel-Frequency Cepstral Coefficients are the standard representation of audio for speech processing. They capture the shape of the vocal tract's frequency response — essentially a fingerprint of how sound is being produced.

The key insight: AI voice synthesis models are trained to reproduce the mean spectral characteristics of a voice. They are very good at this. What they consistently fail to reproduce is the variance — the natural randomness and micro-variation in how a real human produces speech moment to moment.

import librosa
import numpy as np

def extract_mfcc_signals(y, sr):
    # Extract 40 MFCC coefficients
    mfcc = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=40)
    mfcc_var = float(np.var(mfcc))

    # Delta = rate of change of MFCCs
    # AI voices change too smoothly — delta variance is too low
    mfcc_delta = librosa.feature.delta(mfcc)
    delta_var = float(np.var(mfcc_delta))

    return mfcc_var, delta_var

# Thresholds determined empirically
# mfcc_var < 2800 → suspicious
# delta_var < 80 → suspicious
Enter fullscreen mode Exit fullscreen mode

I give MFCC Variance a weight of 3 because it is the single most reliable signal across the test data I have run. An AI voice almost always fails this check.

Signal 3: Pitch Jitter — The Most Powerful Signal

This took me the longest to get right and has the highest detection accuracy of anything I have built.

The physics: Human pitch (fundamental frequency, F0) is controlled by the tension and mass of the vocal folds, which are in turn controlled by tiny muscles with their own mechanical variability. This creates micro-variations in pitch — called jitter — that are always present in real speech.

AI voice synthesis models smooth out these variations. They produce a pitch contour that follows the learned prosody patterns but without the micro-level noise that organic vocal production creates.

def extract_pitch_jitter(y, sr):
    # pyin is more accurate than yin for pitch tracking
    f0, voiced_flag, voiced_probs = librosa.pyin(
        y,
        fmin=librosa.note_to_hz('C2'),   # ~65 Hz — lowest human voice
        fmax=librosa.note_to_hz('C7'),   # ~2093 Hz — highest human voice
    )

    # Only measure jitter on voiced frames
    voiced_f0 = f0[voiced_flag == 1] if voiced_flag is not None else np.array([])

    if len(voiced_f0) > 10:
        # np.diff gives the frame-to-frame pitch change
        # real voices: std of these changes is > 0.003
        # AI voices: std is near-zero — too smooth
        pitch_jitter = float(np.std(np.diff(voiced_f0[~np.isnan(voiced_f0)])))
    else:
        pitch_jitter = 0.0  # No voiced frames = suspicious

    return pitch_jitter

# Threshold: pitch_jitter < 0.003 → suspicious
# Weight: 3x (highest weight in the system)
Enter fullscreen mode Exit fullscreen mode

I discovered this signal by accident. I was listening to a flagged audio sample that passed all other checks and noticed it sounded "robotic" in a way I could not articulate. I plotted the F0 contour and it looked like a sine wave — perfectly smooth. Real speech looks like a noisy mountain range.

Signal 4: Harmonic Ratio

def extract_harmonic_ratio(y):
    # Separate harmonic and percussive components
    harmonic, percussive = librosa.effects.hpss(y)

    # AI voices are over-harmonic — they are too "clean"
    # Real speech has significant percussive content from
    # consonants, breath noise, lip sounds, mouth clicks
    harmonic_ratio = float(
        np.mean(np.abs(harmonic)) / (np.mean(np.abs(percussive)) + 1e-8)
    )

    return harmonic_ratio

# Threshold: harmonic_ratio > 6.0 → suspicious (too clean)
# Weight: 2x
Enter fullscreen mode Exit fullscreen mode

Signals 5–9: Supporting Signals

def extract_supporting_signals(y, sr):
    # Zero Crossing Rate — how often the waveform crosses zero
    # AI voices have unnatural ZCR patterns
    zcr = float(np.mean(librosa.feature.zero_crossing_rate(y)) * 1000)

    # Spectral Centroid — "center of mass" of the spectrum
    # AI voices have unnaturally stable centroid
    centroid = librosa.feature.spectral_centroid(y=y, sr=sr)
    centroid_std = float(np.std(centroid))

    # Chroma Variation — organic pitch change patterns
    chroma = librosa.feature.chroma_stft(y=y, sr=sr)
    chroma_var = float(np.var(chroma))

    # RMS Energy — AI voices have unnaturally uniform volume
    rms = librosa.feature.rms(y=y)
    rms_var = float(np.var(rms) * 1e6)

    # Spectral Flatness — AI voices are overly tonal
    flatness = float(np.mean(librosa.feature.spectral_flatness(y=y)))

    return zcr, centroid_std, chroma_var, rms_var, flatness
Enter fullscreen mode Exit fullscreen mode

The Weighted Scoring System

This is the core design decision that separates Vigil AI from naive threshold-based detection:

WEIGHTS = {
    "MFCC Variance":        3,
    "Pitch Jitter":         3,
    "MFCC Delta Var":       2,
    "RMS Consistency":      2,
    "Harmonic Ratio":       2,
    "Zero Crossing Rate":   1,
    "Spectral Centroid":    1,
    "Chroma Variation":     1,
    "Spectral Flatness":    1,
}

MAX_WEIGHT = sum(WEIGHTS.values())  # = 17

def compute_verdict(signals_flagged):
    weighted_score = sum(
        WEIGHTS[name] for name, _, _, flagged in signals_flagged if flagged
    )
    confidence = weighted_score / MAX_WEIGHT

    # Trigger if weighted suspicious score exceeds 35% of maximum
    is_fake = weighted_score >= int(MAX_WEIGHT * 0.35)

    return is_fake, confidence

# Example: Only MFCC Variance and Pitch Jitter flagged
# weighted_score = 3 + 3 = 6
# confidence = 6/17 = 0.35
# is_fake = True (6 >= 5.95)
# This correctly catches a case where only the two strongest signals fire
Enter fullscreen mode Exit fullscreen mode

The threshold of 35% was determined empirically. At 30% there are too many false positives. At 40% some real deepfakes slip through.


Video Detection Architecture

Blink Detection with MediaPipe

import mediapipe as mp
import cv2
import numpy as np

def analyze_blink_pattern(frames):
    mp_face = mp.solutions.face_mesh
    face_mesh = mp_face.FaceMesh(
        max_num_faces=1,
        refine_landmarks=True,
        min_detection_confidence=0.5,
        min_tracking_confidence=0.5
    )

    blink_scores = []

    for frame in frames:
        rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        result = face_mesh.process(rgb)

        if result.multi_face_landmarks:
            lm = result.multi_face_landmarks[0].landmark

            # Eye Aspect Ratio (EAR) using MediaPipe landmark indices
            # Upper eyelid: landmark 159, Lower eyelid: landmark 145
            upper = lm[159].y
            lower = lm[145].y

            # Normalize by face height to handle different distances
            face_height = abs(lm[10].y - lm[152].y) + 1e-6
            ear = abs(upper - lower) / face_height

            blink_scores.append(ear)

    face_mesh.close()

    if not blink_scores:
        return True, 0.15  # No face detected — treat as suspicious

    max_ear = max(blink_scores)

    # A blink occurs when EAR drops below ~78% of the maximum open-eye EAR
    # Real humans: at least one blink per 30 frames at normal FPS
    blinked = any(score < max_ear * 0.78 for score in blink_scores)
    avg_ear = float(np.mean(blink_scores))

    return blinked, avg_ear
Enter fullscreen mode Exit fullscreen mode

Why this works: Deepfake face replacement models are trained primarily on open-eye frames (blinking frames are rare and often discarded during training data curation). The result is that deepfake faces either do not blink at all, or blink at unnaturally uniform intervals.

Skin Hue Variance — The Signal I Am Most Proud Of

This signal came from a paper I found about GAN-generated image detection. The core insight: GANs produce skin with unnaturally uniform color distribution. Real human skin has subsurface scattering, shadow gradients, slight warmth variations, and imperfections that create high variance in the hue and saturation channels.

def analyze_skin_hue_variance(frame, face_landmarks):
    h, w = frame.shape[:2]
    lm = face_landmarks.landmark

    # Extract face bounding box from landmarks
    x1 = int(lm[234].x * w)   # Left cheek
    y1 = int(lm[10].y * h)    # Top of face
    x2 = int(lm[454].x * w)   # Right cheek
    y2 = int(lm[152].y * h)   # Chin

    # Clamp to frame bounds
    x1, y1 = max(0, x1), max(0, y1)
    x2, y2 = min(w, x2), min(h, y2)

    if x2 <= x1 or y2 <= y1:
        return 200.0  # Cannot crop — return neutral value

    face_crop = frame[y1:y2, x1:x2]

    # Convert to HSV — hue and saturation variance are key
    hsv = cv2.cvtColor(face_crop, cv2.COLOR_BGR2HSV)

    hue_var = float(np.var(hsv[:, :, 0]))         # Hue channel variance
    saturation_var = float(np.var(hsv[:, :, 1]))  # Saturation channel variance

    # Combined skin variance score
    skin_var = hue_var + saturation_var

    return skin_var

# Threshold: skin_var < 80 → suspicious (AI skin = too uniform)
# Weight: 2x
Enter fullscreen mode Exit fullscreen mode

In my testing, this correctly flags deepfakes that pass the blink test — particularly high-quality GAN-generated faces that have learned to blink.

Optical Flow for Motion Consistency

def analyze_optical_flow(frames):
    flow_variances = []
    prev_gray = None

    for frame in frames:
        gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

        if prev_gray is not None:
            # Farneback optical flow — dense, captures all motion
            flow = cv2.calcOpticalFlowFarneback(
                prev_gray, gray, None,
                pyr_scale=0.5,    # Pyramid scale
                levels=3,         # Pyramid levels
                winsize=15,       # Window size
                iterations=3,
                poly_n=5,
                poly_sigma=1.2,
                flags=0
            )

            # Magnitude of motion at each pixel
            magnitude, _ = cv2.cartToPolar(flow[..., 0], flow[..., 1])

            # Real video has consistent motion variance
            # Deepfake video often has frozen background or erratic motion
            flow_variances.append(float(np.var(magnitude)))

        prev_gray = gray.copy()

    if not flow_variances:
        return 0.0, False

    flow_mean = float(np.mean(flow_variances))
    flow_std = float(np.std(flow_variances))

    # Flag if variance is extremely high (erratic) or extremely low (frozen)
    is_suspicious = (
        flow_std > flow_mean * 2.5 or   # Erratic motion
        flow_mean < 0.001                # Completely frozen background
    )

    return flow_mean, is_suspicious
Enter fullscreen mode Exit fullscreen mode

2D Fourier Transform for Pixel Artifacts

def analyze_fft_energy(frame):
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

    # 2D FFT
    dft = np.fft.fft2(gray)

    # Shift zero-frequency component to center
    dft_shifted = np.fft.fftshift(dft)

    # Log magnitude spectrum
    magnitude = 20 * np.log(np.abs(dft_shifted) + 1)

    energy = float(np.mean(magnitude))

    # Real camera images have high-frequency components from sensor noise
    # AI-generated images lack these — FFT energy is too low or too high
    is_suspicious = energy < 135 or energy > 182

    return energy, is_suspicious
Enter fullscreen mode Exit fullscreen mode

The Real-Time Architecture I Am Building

The current Streamlit prototype is great for testing but unsuitable for production. Here is the FastAPI architecture I am building:

# main.py — FastAPI REST API

from fastapi import FastAPI, File, UploadFile, HTTPException, Depends
from fastapi.middleware.cors import CORSMiddleware
from fastapi.security import APIKeyHeader
import uvicorn

app = FastAPI(
    title="Vigil AI Detection API",
    description="Real-time deepfake and AI voice fraud detection",
    version="1.0.0"
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["POST"],
    allow_headers=["*"],
)

api_key_header = APIKeyHeader(name="X-API-Key")

async def verify_api_key(api_key: str = Depends(api_key_header)):
    # Verify against Supabase api_keys table
    # Track usage per key
    if not await check_key_valid(api_key):
        raise HTTPException(status_code=403, detail="Invalid API key")
    return api_key

@app.post("/v1/detect/audio")
async def detect_audio(
    file: UploadFile = File(...),
    api_key: str = Depends(verify_api_key)
):
    """
    Analyze audio for AI-generated voice or cloned speech.

    Accepts: WAV, MP3, OGG, FLAC, M4A
    Returns: verdict, confidence, per-signal breakdown
    """
    audio_bytes = await file.read()
    result, error = analyze_audio(audio_bytes, file.filename)

    if error:
        raise HTTPException(status_code=422, detail=error)

    return {
        "verdict": "synthetic" if result["is_fake"] else "human",
        "confidence": round(result["confidence"], 4),
        "risk_level": "high" if result["confidence"] > 0.75 else (
            "medium" if result["confidence"] > 0.45 else "low"
        ),
        "signals": [
            {
                "name": name,
                "value": round(float(val), 4),
                "flagged": flagged,
                "weight": WEIGHTS.get(name, 1)
            }
            for name, val, _, flagged in result["flags"]
        ],
        "duration_seconds": round(result["duration"], 2),
        "weighted_score": result["weighted_score"],
        "max_score": result["max_weight"],
    }

@app.post("/v1/detect/video")
async def detect_video(
    file: UploadFile = File(...),
    api_key: str = Depends(verify_api_key)
):
    """
    Analyze video for deepfake face manipulation.

    Accepts: MP4, MOV, AVI, WebM
    Returns: verdict, confidence, per-signal breakdown
    """
    # ... video analysis implementation
    pass

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)
Enter fullscreen mode Exit fullscreen mode

Example API call:

curl -X POST "https://api.vigilai.online/v1/detect/audio" \
  -H "X-API-Key: va_your_key_here" \
  -F "file=@suspicious_call.wav"
Enter fullscreen mode Exit fullscreen mode

Response:

{
  "verdict": "synthetic",
  "confidence": 0.7647,
  "risk_level": "high",
  "signals": [
    { "name": "MFCC Variance", "value": 1842.3, "flagged": true, "weight": 3 },
    { "name": "Pitch Jitter", "value": 0.0012, "flagged": true, "weight": 3 },
    { "name": "Harmonic Ratio", "value": 7.84, "flagged": true, "weight": 2 },
    { "name": "RMS Consistency", "value": 2.1, "flagged": true, "weight": 2 },
    { "name": "Zero Crossing Rate", "value": 28.4, "flagged": true, "weight": 1 },
    { "name": "Spectral Centroid", "value": 289.2, "flagged": false, "weight": 1 },
    { "name": "Chroma Variation", "value": 0.018, "flagged": false, "weight": 1 },
    { "name": "MFCC Delta Var", "value": 62.1, "flagged": true, "weight": 2 },
    { "name": "Spectral Flatness", "value": 0.0015, "flagged": true, "weight": 1 }
  ],
  "duration_seconds": 8.4,
  "weighted_score": 13,
  "max_score": 17
}
Enter fullscreen mode Exit fullscreen mode

The Real-Time Call Detection SDK — Android Architecture

// VigiliAICallMonitor.kt

class VigilAICallMonitor(
    private val context: Context,
    private val apiKey: String
) {

    private val SAMPLE_RATE = 16000
    private val CHUNK_DURATION_MS = 3000
    private val BUFFER_SIZE = AudioRecord.getMinBufferSize(
        SAMPLE_RATE,
        AudioFormat.CHANNEL_IN_MONO,
        AudioFormat.ENCODING_PCM_16BIT
    )

    private var audioRecord: AudioRecord? = null
    private var isMonitoring = false
    private val webSocketClient = buildWebSocketClient()

    fun startMonitoring(onResult: (verdict: String, confidence: Float) -> Unit) {
        isMonitoring = true

        audioRecord = AudioRecord(
            MediaRecorder.AudioSource.VOICE_COMMUNICATION,  // Call audio
            SAMPLE_RATE,
            AudioFormat.CHANNEL_IN_MONO,
            AudioFormat.ENCODING_PCM_16BIT,
            BUFFER_SIZE
        )

        audioRecord?.startRecording()

        CoroutineScope(Dispatchers.IO).launch {
            val chunk = ShortArray(SAMPLE_RATE * CHUNK_DURATION_MS / 1000)

            while (isMonitoring) {
                audioRecord?.read(chunk, 0, chunk.size)

                // Convert to WAV bytes and send to API
                val wavBytes = shortArrayToWav(chunk, SAMPLE_RATE)

                // Send via WebSocket for lowest latency
                webSocketClient.send(wavBytes)

                // Receive result
                val result = webSocketClient.receiveResult()

                withContext(Dispatchers.Main) {
                    onResult(result.verdict, result.confidence)
                }
            }
        }
    }

    fun stopMonitoring() {
        isMonitoring = false
        audioRecord?.stop()
        audioRecord?.release()
        audioRecord = null
    }
}
Enter fullscreen mode Exit fullscreen mode

Performance and Limitations

Current performance (CPU inference, no ML model):

  • Audio analysis: 800ms–2.5s depending on file length
  • Video analysis: 3–8s for a 30-second clip
  • Real-time call chunk (3s audio): ~200ms per chunk

Known limitations:

  1. High-quality neural voice synthesis (VALL-E X quality) may pass pitch jitter check if trained on enough data
  2. Videos with heavy compression (low bitrate) affect FFT and texture analysis
  3. MediaPipe struggles with non-frontal face angles beyond ~45°
  4. Threshold values need refinement against a proper benchmark dataset

What would make it significantly better:

  • Training a scikit-learn or XGBoost classifier on ASVspoof 2019 (replacing threshold rules)
  • Adding speaker verification — comparing the voice against a known reference sample
  • GAN artifact detection using a pretrained ResNet (would require GPU inference)

What I Need Help With

I am a solo developer. There are things I can build alone and things that would benefit from collaboration:

Research collaborators: If you work in audio signal processing or GAN artifact detection research and want to contribute signal ideas or help with the ASVspoof benchmarking, reach out.

B2B pilots: If you work at a fintech, NBFC, or any company that handles voice or video KYC, I want to give you a free API pilot. No strings attached.

Open source contributions: The core detection library will be open-sourced. If you want to contribute additional signals, dataset evaluations, or language SDKs, watch the GitHub repo (launching soon).


Getting Started

Try the live demo: vigilai.online

Run locally:

git clone https://github.com/abhishekkumar/vigilai  # coming soon
cd vigilai
pip install -r requirements.txt
# Install ffmpeg (required for audio decoding):
# Mac: brew install ffmpeg
# Ubuntu: sudo apt install ffmpeg
# Windows: winget install ffmpeg
streamlit run app.py
Enter fullscreen mode Exit fullscreen mode

Requirements:

streamlit>=1.32.0
librosa>=0.10.1
soundfile>=0.12.1
mediapipe>=0.10.9
opencv-python>=4.9.0
scipy>=1.12.0
matplotlib>=3.8.0
plotly>=5.20.0
supabase>=2.4.0
Enter fullscreen mode Exit fullscreen mode

The Bigger Picture

Deepfake technology is not going away. The models will get better. The voices will get more convincing. The videos will become indistinguishable from reality at higher and higher quality levels.

The only viable response is detection infrastructure that scales as fast as the generation technology. Detection that is free, accessible, running in the background of every device, integrated into every KYC flow and every call system.

That is what I am building with Vigil AI. One signal at a time.

Follow the progress:

If you found this technical breakdown useful, drop a reaction and share it with anyone building in the AI safety, fraud detection, or identity verification space.


Built with Python, librosa, MediaPipe, OpenCV, FastAPI, Supabase, and a lot of late nights. Founded by Abhishek Kumar, India.

Top comments (0)