If you’ve been playing around with AI video generation lately, you already know the struggle: the tech is insanely cool, but sometimes getting it to output exactly the format you want feels like trying to center a <div> in 2014.
Recently, I needed to generate a perfectly looping, high-quality square (1:1) video with audio using Google's new video models. The problem? Native aspect ratio support can sometimes be finicky depending on the model tier, and cropping a generated 16:9 or 9:16 video often ruins the framing or hallucinates weird artifacts at the edges.
So, I had to let it cook. I came up with a slightly hacky but reliable workaround using NanoBanana 2, Veo 3.1 Lite, and our old reliable friend, FFmpeg.
Here is the ultimate pipeline to get flawless square AI videos:
TL;DR
- Start with a square image concept.
- Ask NanoBanana 2 to convert it to a 9:16 aspect ratio by literally just padding the top and bottom with black bars.
- Feed that phone-format 9:16 image into Veo 3.1 Lite as your start and end frames to force a loop.
- Run a quick Python script using
ffmpegto slice off the black bars.
Boom. Perfect square video. Perfect audio sync. And no weird edge hallucinations. Here’s how to automate this flow using Python. 🐍
Step 1: Generating the "phone format" 9:16 frames with NanoBanana 2
First, we need to generate our 9:16 image with the black bars baked in. Using the new Gemini API SDK, we can prompt NanoBanana 2 to do the heavy lifting for us.
from google import genai
from google.genai import types
# Initialize your client
client = genai.Client(api_key="YOUR_API_KEY")
def generate_padded_frame(prompt, output_filename):
print("🎨 Generating padded 9:16 image with NanoBanana 2...")
# We explicitly tell NanoBanana 2 to give us a 9:16 image
# where the subject is a square in the middle, padded by black bars.
hacked_prompt = f"{prompt}. Keep the main subject perfectly square in the center, and pad the top and bottom with solid black bars to make the overall aspect ratio 9:16."
result = client.models.generate_images(
model='nanobanana-2', # Our trusty image model
prompt=hacked_prompt,
config=types.GenerateImagesConfig(
number_of_images=1,
aspect_ratio="9:16",
output_mime_type="image/jpeg"
)
)
# Save the output
for generated_image in result.generated_images:
image = generated_image.image
image.save(output_filename)
print(f"✅ Saved to {output_filename}")
# Generate our start/end frame
generate_padded_frame("A majestic pink flamingo standing in a serene pond", "flamingo_padded.jpg")
Step 2: Generating the video with Veo 3.1 Lite
Now that we have our 9:16 image with black bars (flamingo_padded.jpg), we pass it to Veo 3.1 Lite. By using the same image as the visual prompt, we ensure the video maintains those exact black bars throughout the generation process.
(Note: In the Veo web UI, you can set this as the start and end frame for a perfect loop. Here is the API equivalent for generating the video from your image).
import time
def generate_video(image_path, video_prompt, output_filename):
print("🎬 Uploading frame and prompting Veo 3.1 Lite...")
# Upload the padded image to the Gemini API
initial_frame = client.files.upload(file=image_path)
# Wait for the file to be processed
while initial_frame.state.name == "PROCESSING":
print(".", end="", flush=True)
time.sleep(2)
initial_frame = client.files.get(name=initial_frame.name)
# Call Veo 3.1 Lite
# We ask it to animate the subject but keep the black bars untouched
response = client.models.generate_content(
model='veo-3.1-lite',
contents=[
initial_frame,
f"{video_prompt}. The flamingo moves slightly, but the black bars at the top and bottom must remain exactly the same."
]
)
# Save the generated video bytes
with open(output_filename, "wb") as f:
f.write(response.text.encode('utf-8')) # Handling depends on raw bytes returned
print(f"\n✅ Video generated and saved as {output_filename}")
generate_video("flamingo_padded.jpg", "Cinematic shot of a flamingo looking around", "raw_veo_output.mp4")
Step 3: The ffmpeg post-processing
Now we have a beautiful video of a flamingo, but it's a 9:16 file with annoying black bars at the top and bottom.
We could crop this frame-by-frame using Python libraries like MoviePy, but honestly? ffmpeg via the subprocess module is infinitely faster, uses way less memory, and most importantly: it perfectly preserves the audio stream without degrading it through re-encoding.
Since the video is 9:16, trimming it to iw:iw (input width : input width) creates a perfect 1:1 square. FFmpeg is smart enough to center the crop automatically, perfectly slicing off the top and bottom black bars.
import subprocess
def crop_to_square(input_video, output_video):
print("✂️ Cropping out the black bars with FFmpeg...")
command =[
'ffmpeg',
'-y', # Overwrite output if it exists
'-i', input_video, # Input file
'-vf', 'crop=iw:iw', # Video Filter: Crop to width x width (automatically centered!)
'-c:a', 'copy', # Copy the audio as-is (chef's kiss for performance)
output_video
]
try:
subprocess.run(command, check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
print(f"🔥 Success! Perfectly square video saved to {output_video}")
except subprocess.CalledProcessError as e:
print(f"💀 FFmpeg failed: {e}")
# Run the final crop
crop_to_square("raw_veo_output.mp4", "final_square_flamingo.mp4")
Why this workaround actually... works
- Framing control: When you force the AI to outpaint black bars first, you control the framing of the main subject. You aren't relying on the video model to guess what to keep in the center.
-
Audio preservation: The
'-c:a', 'copy'flag in FFmpeg ensures you don't lose any audio fidelity when manipulating the video file. - Zero hallucinations: Because the video model is explicitly told to keep the black bars, it doesn't waste compute trying to generate weird background details at the extreme top and bottom edges.
Sometimes the best engineering solutions are just stacking simple tools together in a trench coat. 🧥
Have you guys found any other weird/genius hacks for wrangling AI video generation APIs? Drop them in the comments, I’d love to test them out!
(P.S. Make sure you have ffmpeg already installed on your machine before running the Python script, or it will yell at you).

Top comments (5)
The "pad with black bars then crop" trick is so simple but smart.
I've been fighting with aspect ratios on video gen APIs for weeks and never thought to just bake the padding into the image first.
The
-c:a copyffmpeg flag to skip re-encoding audio is a nice touch too.I always forget that's an option and end up with degraded audio.
Definitely stealing this pipeline. 🔥
👍️👍 👍
The workaround using
crop=iw:iwin FFmpeg for square videos is clever, but I'm curious about the impact on video quality when removing the black bars. Does it affect the resolution or metadata? I've been using prachub.com for technical screens, and their follow-up questions about video processing were really relevant in a recent round I had. It's been more reliable than just browsing random forums.the format wrestling is such a consistent tax with new AI APIs. spend the first week fighting aspect ratios and cropping artifacts, then finally get 10 minutes on the actual creative work.
Unconventional thought: what about a post‑processing shader stage that re‑frames already‑generated 16:9 footage into a 1:1 region of interest, guided by saliency detection? That could make any generator “square” without retraining.
Getting a model to output perfectly square video without cropping or stretching is one of those “small” details that actually breaks entire pipelines. I loved how you focused on this aspect-ratio challenge, because I’ve seen generated content rejected purely for not fitting the required format.
One real pain: aspect ratio can silently distort motion and affect temporal consistency. Did you observe any weird edge artefacts when forcing a square output, especially with fast camera panning? That would be interesting to document.
The aspect-ratio workaround is clever — I hit something similar going the other direction, trying to coax Veo into 9:16 for shorts. Ended up generating 16:9 then doing a center-crop with motion tracking on the subject, which worked but burned a lot of compute on pixels I was about to throw away. Square-then-crop is a much cleaner mental model.
One question: in your NanoBanana → Veo handoff, are you locking the seed on the first-frame image to keep character consistency across multiple shots, or letting Veo re-interpret each call? We've found that even with a fixed reference frame, Veo will subtly drift faces/clothing over a 6–8 second clip. Curious if you've found prompt scaffolding that mitigates that, or whether you just accept the drift and edit around it.