DEV Community

Om Prakash
Om Prakash

Posted on

We Integrated Netflix's VOID Model Into Our API — Here's What Nobody Tells You About Video Object Removal

We Integrated Netflix's VOID Model Into Our API — Here's What Nobody Tells You About Video Object Removal

Standard video inpainting tools have a dirty secret. Remove a person from a shot using Runway or Pika, and you get ghost artifacts, floating shadows, and objects that were being held suspended in mid-air like magic. The model patched the pixels — it didn't understand physics.

Netflix released VOID (Video Object and Interaction Deletion) to solve exactly this. After a few weeks of integration work, we now have it running behind our /v1/video/void_removal endpoint. Here's the honest technical rundown of what changed, what improved, and what still trips us up.

The Core Problem VOID Solves

Traditional video inpainting works frame-by-frame. Each frame gets its own 2D inpaint. The result: a coffee cup that was in a character's hand just floats there when the hand is removed. Shadows persist. Reflections don't update. The footage looks "corrected" in a way that's subtly wrong.

VOID approaches this as a 4D problem — it jointly reasons about space and time, using a 5B parameter CogVideoX transformer. When you remove an object, VOID automatically handles:

  • Associated shadow removal (static and moving shadows)
  • Object drop physics (things that were being held now fall, properly)
  • Reflected surface restoration (mirrors, floors, wet surfaces update correctly)
  • Seamless background completion across frames

The quality difference on real footage is significant. We've been running internal tests on product videos, interview backdrops, and B-roll clips. The ghost artifact rate dropped substantially compared to sequential frame inpainting.

Integration Notes for Anyone Doing This

Our pipeline looks like this:

Video → KeyFrame Detection → VOID Processing → Frame Interpolation → Output
Enter fullscreen mode Exit fullscreen mode

The key engineering decisions we made:

1. Keyframe-based processing — Running VOID on every frame is expensive and slow. We run it on keyframes and propagate the mask using temporal consistency, then apply VOID only where the mask changes significantly.

2. Prompt engineering matters — The VOID model responds to natural language descriptions of what to remove and what the environment should look like. "Remove the man in the blue shirt standing in front of the whiteboard" works better than "remove person."

3. Resolution ceiling — VOID was trained on specific resolution ranges. Upscaling to 4K and then running VOID produces artifacts. We've settled on 720p as the sweet spot for quality vs. compute cost.

4. Shadow detection is a separate pass — VOID handles shadows well in controlled environments but can miss complex multi-shadow scenarios. We run a secondary shadow detection pass and merge results.

What Still Doesn't Work Great

Honest answer:

  • Very fast motion — VOID struggles with rapid camera pans and fast-moving objects. Motion blur confuses the interaction detection.
  • Multiple overlapping interactions — Removing two people who are interacting (e.g., shaking hands) creates ambiguous physics predictions.
  • Transparent/semi-transparent objects — Glass, water, smoke are still hard. The model sometimes removes too much or too little.
  • 4K+ output — We cap at 720p for now. 1080p is on the roadmap but needs more VRAM than our current setup.

Pricing

We priced VOID at 200 credits per generation (~$0.20). That's roughly 2x cheaper than equivalent compute on commercial alternatives. For context: a 10-second clip at 24fps processed through VOID uses about 240 frames of keyframe computation. At our pricing, it's still heavily subsidized by our GPU infrastructure — we're pricing for adoption, not margin.

Try It

import requests

response = requests.post(
    "https://api.pixelapi.dev/v1/video/void_removal",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    json={
        "video_url": "https://your-site.com/input.mp4",
        "prompt": "remove the person standing in front of the bookshelf",
        "fps": 24
    }
)
print(response.json())
Enter fullscreen mode Exit fullscreen mode

Free credits on signup: pixelapi.dev — no card required.

Documentation: api.pixelapi.dev/docs


Built by a solo founder running this on a cluster of RTX 4070s and an RTX 6000 Ada in a home office rack. Questions welcomed — I read everything.

Top comments (0)