Face Swap a Video with Python — Step by Step Tutorial

#machinelearning #api #python #tutorial

Face swapping in photos is one API call. But face swapping in video requires a pipeline: extract frames, swap faces on each one, reassemble. This tutorial walks you through the full workflow in Python using ffmpeg and a Face Swap API.

How It Works

A video is a sequence of frames. The strategy:

Extract — Split video into JPEG frames with ffmpeg
Filter — Detect which frames contain a face (skip the rest)
Swap — Send each face frame to the API with the source face
Reassemble — Stitch processed frames back into video with original audio

Step 1: Extract Frames

import subprocess, os, json

def extract_frames(video_path: str, frames_dir: str) -> tuple[int, float]:
    os.makedirs(frames_dir, exist_ok=True)

    probe = subprocess.run(
        ["ffprobe", "-v", "quiet", "-print_format", "json", "-show_streams", video_path],
        capture_output=True, text=True,
    )
    vs = next(s for s in json.loads(probe.stdout)["streams"] if s["codec_type"] == "video")
    num, den = vs["r_frame_rate"].split("/")
    fps = int(num) / int(den)

    subprocess.run(
        ["ffmpeg", "-i", video_path, "-qscale:v", "2", f"{frames_dir}/frame_%05d.jpg"],
        capture_output=True,
    )
    count = len([f for f in os.listdir(frames_dir) if f.endswith(".jpg")])
    print(f"Extracted {count} frames at {fps:.1f} FPS")
    return count, fps

Step 2: Detect Faces

Not every frame has a visible face. Skip frames without one to save API calls.

import requests

HOST = "deepfake-face-swap-ai.p.rapidapi.com"
HEADERS = {
    "x-rapidapi-host": HOST,
    "x-rapidapi-key": "YOUR_API_KEY",
}

def detect_face(frame_path: str) -> bool:
    with open(frame_path, "rb") as f:
        resp = requests.post(
            f"https://{HOST}/detect-faces",
            headers=HEADERS,
            files={"image": ("frame.jpg", f, "image/jpeg")},
        )
    return resp.status_code == 200 and resp.json().get("total_faces", 0) > 0

Step 3: Swap Faces

from pathlib import Path

def swap_face(source: str, frame: str, output: str) -> bool:
    with open(source, "rb") as s, open(frame, "rb") as t:
        resp = requests.post(
            f"https://{HOST}/swap-face",
            headers=HEADERS,
            files={
                "source_image": ("src.jpg", s, "image/jpeg"),
                "target_image": ("tgt.jpg", t, "image/jpeg"),
            },
        )
    if resp.status_code == 200:
        img = requests.get(resp.json()["image_url"])
        Path(output).write_bytes(img.content)
        return True
    return False

Step 4: Process All Frames in Parallel

Using a thread pool to process multiple frames concurrently:

import shutil
from concurrent.futures import ThreadPoolExecutor, as_completed

def process_frame(source: str, frame: str, output: str) -> bool:
    if detect_face(frame):
        if swap_face(source, frame, output):
            return True
    shutil.copy2(frame, output)
    return False

def process_all(source_face: str, frames_dir: str, output_dir: str):
    os.makedirs(output_dir, exist_ok=True)
    frames = sorted(f for f in os.listdir(frames_dir) if f.endswith(".jpg"))
    swapped = 0

    with ThreadPoolExecutor(max_workers=4) as pool:
        futures = {
            pool.submit(process_frame, source_face, f"{frames_dir}/{f}", f"{output_dir}/{f}"): f
            for f in frames
        }
        for fut in as_completed(futures):
            if fut.result():
                swapped += 1

    print(f"{swapped}/{len(frames)} frames swapped")

Step 5: Reassemble Video

def reassemble(frames_dir: str, original: str, output: str, fps: float):
    subprocess.run([
        "ffmpeg", "-y",
        "-framerate", str(fps),
        "-i", f"{frames_dir}/frame_%05d.jpg",
        "-i", original, "-map", "0:v", "-map", "1:a?",
        "-c:v", "libx264", "-preset", "medium", "-crf", "18",
        "-c:a", "aac", "-shortest", output,
    ], capture_output=True)

The -map 1:a? copies audio from the original. The ? makes it optional — works with or without audio.

Optimization Tips

Processing every frame of a long video gets expensive. Here's how to cut costs:

Sample face detection — Check every 10th frame instead of every frame. If frames 100 and 110 have faces, 101–109 likely do too. Reduces detection calls by 90%.

Key frames only — Swap every 3rd frame, copy the result to adjacent frames. 66% fewer API calls with minimal visual difference in slow scenes.

Skip no-face sections — Landscape shots, close-ups of hands, back of head — skip them all.

Cost Estimation (30s clip at 30 FPS)

Strategy	API calls	Monthly quota used (Pro)
Every frame	~1,800	22%
Sampled detection	~990	12%
Key frames + sampled	~390	5%
50% face coverage + key frames	~240	3%

With optimizations, the Pro plan ($12.99/mo, 10,000 requests) handles multiple videos per month.

Limitations

Not real-time — This is an offline pipeline. A 30s clip takes several minutes.
Temporal consistency — Each frame is independent. Use /enhance-face on swapped frames to smooth inconsistencies.
Multiple faces — Use /detect-faces to find the target face index, then /target-face instead of /swap-face.

The Face Swap API offers a free tier to test the full pipeline on a short clip.

👉 Read the full tutorial with the complete runnable script and more optimization strategies