Face swapping in photos is one API call. But face swapping in video requires a pipeline: extract frames, swap faces on each one, reassemble. This tutorial walks you through the full workflow in Python using ffmpeg and a Face Swap API.
How It Works
A video is a sequence of frames. The strategy:
- Extract — Split video into JPEG frames with ffmpeg
- Filter — Detect which frames contain a face (skip the rest)
- Swap — Send each face frame to the API with the source face
- Reassemble — Stitch processed frames back into video with original audio
Step 1: Extract Frames
import subprocess, os, json
def extract_frames(video_path: str, frames_dir: str) -> tuple[int, float]:
os.makedirs(frames_dir, exist_ok=True)
probe = subprocess.run(
["ffprobe", "-v", "quiet", "-print_format", "json", "-show_streams", video_path],
capture_output=True, text=True,
)
vs = next(s for s in json.loads(probe.stdout)["streams"] if s["codec_type"] == "video")
num, den = vs["r_frame_rate"].split("/")
fps = int(num) / int(den)
subprocess.run(
["ffmpeg", "-i", video_path, "-qscale:v", "2", f"{frames_dir}/frame_%05d.jpg"],
capture_output=True,
)
count = len([f for f in os.listdir(frames_dir) if f.endswith(".jpg")])
print(f"Extracted {count} frames at {fps:.1f} FPS")
return count, fps
Step 2: Detect Faces
Not every frame has a visible face. Skip frames without one to save API calls.
import requests
HOST = "deepfake-face-swap-ai.p.rapidapi.com"
HEADERS = {
"x-rapidapi-host": HOST,
"x-rapidapi-key": "YOUR_API_KEY",
}
def detect_face(frame_path: str) -> bool:
with open(frame_path, "rb") as f:
resp = requests.post(
f"https://{HOST}/detect-faces",
headers=HEADERS,
files={"image": ("frame.jpg", f, "image/jpeg")},
)
return resp.status_code == 200 and resp.json().get("total_faces", 0) > 0
Step 3: Swap Faces
from pathlib import Path
def swap_face(source: str, frame: str, output: str) -> bool:
with open(source, "rb") as s, open(frame, "rb") as t:
resp = requests.post(
f"https://{HOST}/swap-face",
headers=HEADERS,
files={
"source_image": ("src.jpg", s, "image/jpeg"),
"target_image": ("tgt.jpg", t, "image/jpeg"),
},
)
if resp.status_code == 200:
img = requests.get(resp.json()["image_url"])
Path(output).write_bytes(img.content)
return True
return False
Step 4: Process All Frames in Parallel
Using a thread pool to process multiple frames concurrently:
import shutil
from concurrent.futures import ThreadPoolExecutor, as_completed
def process_frame(source: str, frame: str, output: str) -> bool:
if detect_face(frame):
if swap_face(source, frame, output):
return True
shutil.copy2(frame, output)
return False
def process_all(source_face: str, frames_dir: str, output_dir: str):
os.makedirs(output_dir, exist_ok=True)
frames = sorted(f for f in os.listdir(frames_dir) if f.endswith(".jpg"))
swapped = 0
with ThreadPoolExecutor(max_workers=4) as pool:
futures = {
pool.submit(process_frame, source_face, f"{frames_dir}/{f}", f"{output_dir}/{f}"): f
for f in frames
}
for fut in as_completed(futures):
if fut.result():
swapped += 1
print(f"{swapped}/{len(frames)} frames swapped")
Step 5: Reassemble Video
def reassemble(frames_dir: str, original: str, output: str, fps: float):
subprocess.run([
"ffmpeg", "-y",
"-framerate", str(fps),
"-i", f"{frames_dir}/frame_%05d.jpg",
"-i", original, "-map", "0:v", "-map", "1:a?",
"-c:v", "libx264", "-preset", "medium", "-crf", "18",
"-c:a", "aac", "-shortest", output,
], capture_output=True)
The -map 1:a? copies audio from the original. The ? makes it optional — works with or without audio.
Optimization Tips
Processing every frame of a long video gets expensive. Here's how to cut costs:
Sample face detection — Check every 10th frame instead of every frame. If frames 100 and 110 have faces, 101–109 likely do too. Reduces detection calls by 90%.
Key frames only — Swap every 3rd frame, copy the result to adjacent frames. 66% fewer API calls with minimal visual difference in slow scenes.
Skip no-face sections — Landscape shots, close-ups of hands, back of head — skip them all.
Cost Estimation (30s clip at 30 FPS)
| Strategy | API calls | Monthly quota used (Pro) |
|---|---|---|
| Every frame | ~1,800 | 22% |
| Sampled detection | ~990 | 12% |
| Key frames + sampled | ~390 | 5% |
| 50% face coverage + key frames | ~240 | 3% |
With optimizations, the Pro plan ($12.99/mo, 10,000 requests) handles multiple videos per month.
Limitations
- Not real-time — This is an offline pipeline. A 30s clip takes several minutes.
-
Temporal consistency — Each frame is independent. Use
/enhance-faceon swapped frames to smooth inconsistencies. -
Multiple faces — Use
/detect-facesto find the target face index, then/target-faceinstead of/swap-face.
The Face Swap API offers a free tier to test the full pipeline on a short clip.
👉 Read the full tutorial with the complete runnable script and more optimization strategies
Top comments (0)