Cinematic Product Videos with fal.ai and Kling 3.0 for $1 a Scene

#automation #ai #python #tutorial

A client needed social media videos of their product in six different lifestyle scenes. Professional shoots would have cost thousands per location. We did all six for about $6 total, in under an hour.

The pipeline is two API calls: one to place the real product into a generated scene, one to animate it into a 5-second video with sound. Both run through fal.ai.

The brief

The client had a small physical product and a solid brand page with plenty of existing content. He sent me an AI-generated video he'd seen of someone walking through New York that seamlessly featured a product. He wanted something similar for his own brand: cinematic scenes showing the product in restaurant and bar settings, generated entirely from a single product photo.

The goal was to build a repeatable skill that could produce these scenes on demand, not just a one-off video.

Step 1: place the product into a scene

The first script uses Google's Nano Banana 2 edit model via fal.ai. You give it a reference photo of the real product and a text prompt describing the scene you want. It generates a new image with the product placed naturally into that environment, preserving the product's appearance, label, and proportions.

python generate_kontext.py product_photo.jpg \
  "Product on white linen table, candlelit restaurant, beside wine glass, warm golden light, cinematic" \
  --variations 5

The --variations 5 flag is important. AI image generation is inconsistent. Out of five attempts, usually two or three look good. One will be excellent. The rest get discarded. At $0.04 per image, generating five costs $0.20. Cheap enough to always overshoot.

One thing I learned: prompts need a scale anchor. If the product is small, the model will sometimes scale it up to fill the scene. Always include a size reference in the prompt: a wine glass, a hand, a plate. Something that tells the model how big the product actually is relative to its surroundings.

Step 2: animate the winner

The second script takes the best image from Step 1 and turns it into a 5-second video using Kling 3.0 Pro, also via fal.ai. It generates native audio too: sizzling sounds for a kitchen scene, ambient restaurant noise, clinking glasses.

python generate_video.py \
  "Hand reaches for product, picks it up, tilts gently, slow motion" \
  --image_url "https://fal.media/files/..." \
  --duration 5 \
  --cfg_scale 1.0

The cfg_scale setting matters. The default (0.5) gives the model creative freedom, which is fine for abstract content but bad for product shots. Setting it to 1.0 forces the model to follow the prompt closely. For product content, you want maximum adherence: the product should stay in frame, the motion should be what you described, nothing should morph or distort.

One video takes 60 to 180 seconds to generate and costs about $0.80. Combined with the image step, a full scene (5 image variations + 1 video) runs to about $1.

The scenes we built

We created a prompt library with six scenes, each with an image prompt and a matching motion prompt. Restaurant lifestyle, in-hand close-ups, kitchen action shots, moody food pairings, textured product beauty shots, and bar settings.

Each scene follows the same workflow: two commands, one decision (pick the best of five images), one output (a 5-second video with audio). Total cost for all six scenes: about $6. Total time: under an hour, including prompt iteration.

The prompt library is the reusable part. Once you've dialled in the style and scale for one product, adapting it for another is just swapping the product description and the reference photo.

What I'd do differently

Batch the image generation. Right now each scene is a separate script invocation. A wrapper that runs all six scenes, generates all 30 images, and presents them for review in one pass would save time.

Test 9:16 for Stories and Reels. All our content was 16:9. Kling supports 9:16 for vertical video, but only in text-to-video mode (not image-to-video). For Instagram Reels, you'd need to either crop or generate the initial image at 9:16.

Build a prompt template system. The prompt library works, but it's manual. A template where you swap in the product name, size description, and setting would make this reusable across clients without rewriting prompts from scratch.

Why this works for small brands

This client is a bootstrapped D2C brand. There's no budget for location shoots across six restaurants. But the social content needs to look premium because the product is premium.

This pipeline delivers that. Five minutes per scene, a dollar per video, and the output looks like it came from a production studio. The client picks from five image options, approves one, and gets a ready-to-post video with sound. No photographer, no stylist, no venue booking.

If you're selling a physical product and need lifestyle content at scale, this exact pipeline works. Two scripts, one API key, and a good product photo to start from.

ctrlaltautomate.com