Face Tracking for Vertical Video: Why It's Harder Than It Looks (And How It Works)

#ai #video #machinelearning #facetracking

Face Tracking for Vertical Video: Why It's Harder Than It Looks (And How It Works)

Reformatting landscape video to vertical sounds like a simple crop operation. It isn't. The technical challenge of keeping a human face consistently centered in a 9:16 frame while the subject moves, gestures, walks, and shifts position is what separates professional-quality vertical video from the static-center-crop approach that makes content feel wrong.

Here's what actually goes into solving this problem.

The Naive Approach and Why It Fails

The easiest implementation is a fixed center crop: take the middle third of a 16:9 frame and export it as 9:16. For static, tripod-mounted interviews, this works. For anything else, it fails in ways that compound quickly.

A host who leans left to emphasize a point. A guest who turns to face their interviewer. A speaker who walks across a stage. A YouTuber who gestures widely. In each case, the face exits the crop window, and you're left with a vertical video showing someone's shoulder or an empty background while the person is clearly talking.

Viewers register this immediately, even if they don't articulate it. The clip feels amateurish. Watch time drops in the first 5 seconds.

What Good Face Tracking Actually Requires

The production-quality approach runs face detection on every frame of the video, tracks the position of the primary speaker's face, and dynamically repositions a crop window that follows the subject while staying within the bounds of the original frame.

The components:

Face detection: Identifying where faces are in each frame. Modern solutions like MediaPipe Face Detection run in real-time and handle varying distances, angles, and partial occlusion reliably. For single-speaker content, you're typically tracking one face — the primary presenter.

Subject persistence: If multiple faces appear, the system needs to maintain a consistent "primary" target rather than jumping between subjects. For interview content, this usually means tracking whoever is speaking, which requires some audio-to-speaker correlation or a simpler "largest face in frame" heuristic.

Smooth camera movement: Raw face-position data is noisy. If the crop window snaps to each new face position exactly, the "camera" shakes constantly. The solution is a smoothing pass — exponential moving average or low-pass filter applied to the x,y position stream before converting to crop coordinates.

def smooth_positions(positions, alpha=0.15):
    """Exponential moving average smoothing for face tracking data"""
    smoothed = []
    prev = positions[0]
    for pos in positions:
        smoothed_pos = alpha * pos + (1 - alpha) * prev
        smoothed.append(smoothed_pos)
        prev = smoothed_pos
    return smoothed

The alpha value controls responsiveness vs. smoothness. Lower alpha = smoother movement but slower to follow fast subject movement. 0.1-0.2 works well for most talking-head content.

Boundary handling: The crop window can't go outside the original frame. When the subject approaches the edge, the window needs to stop at the boundary rather than following further. This creates a "camera at the limit" effect rather than a hard-edge violation.

The FFmpeg Implementation

Once you have a stream of (time, crop_x, crop_y) values, FFmpeg's crop filter with keyframe expressions handles the actual reframing:

ffmpeg -i input.mp4 \
  -vf "crop=w=608:h=1080:x='if(lt(t,0.5),320,if(lt(t,1.0),280,360))':y=0" \
  -c:v libx264 -crf 23 output.mp4

For production use, you'd generate the crop expression programmatically from your face-position data rather than hardcoding values. The 'expr' syntax in FFmpeg allows time-based expressions that interpolate between keyframes.

Combining crop with caption overlay in a single FFmpeg pass saves significant encoding time versus piping through multiple stages.

Where This Gets Deployed

At ClipSpeedAI, this pipeline runs on every video processed. The face tracking stage takes 20-40% of total job processing time depending on video resolution and length. For 1080p, 60-minute videos, the face detection pass alone processes roughly 108,000 frames at 30fps.

Several optimizations reduce this to practical timeframes:

Process at downscaled resolution (480p) for detection, apply results to full-resolution crop
Skip frames using temporal sampling — detect every 3rd frame and interpolate between detections
Run detection asynchronously in chunks rather than sequentially

The result is vertical video that feels like a camera operator was actually following the subject — the standard that viewers now expect from professional short-form content.

The Practical Impact

Creators who switch from center-crop to tracked vertical video consistently see higher average watch time on their Shorts and Reels. The improvement isn't marginal. The subjective quality difference is immediately apparent to any viewer, even without understanding the technical reason.

For interview content, podcast clips, and lecture footage — the most common sources of short-form content — proper face tracking is the difference between clips that feel produced and clips that feel like something went wrong.

ClipSpeedAI applies this face tracking pipeline automatically to every video processed, no configuration required.