If you’ve played around with current AI video generators, you already know the frustration: It’s basically a slot machine.
You write a massive prompt, hit "generate," wait 3 minutes, and pray. If the lighting is wrong, or the character's jacket changed color? You have to rewrite the prompt and re-roll the dice. You lose all your previous progress.
As a developer, this lack of "state" drove me crazy. Why can't we have version control or iterative diffs for video generation? Why can't I just tell the AI, "Keep everything exactly the same, but make it rain in the background"?
I decided to fix this by ditching the traditional NLE (Non-Linear Editor) timeline entirely and building a conversational video generator powered by Google's Gemini Omni model.
Here is how I built it, the technical hurdles of maintaining "video state," and why I think conversational UI is the future of video editing.
The Architecture: Conversational UI as the NLE
When designing the frontend (I used Next.js for this), I realized that traditional video editing tools rely on spatial organization (dragging clips on a track). But AI understands intent.
Instead of a timeline, the core UI is a chat interface. But under the hood, it's not a simple chatbot. It's a state machine managing a JSON object that represents the "Creative Brief."
Every time a user types a command (e.g., "pan the camera to the left"), the application doesn't just send that raw text to the video model. Instead:
It sends the current JSON state and the user's text to a lightweight LLM.
The LLM updates the specific parameters in the JSON (e.g., updating "camera_movement": "static" to "camera_movement": "pan_left").
This updated, highly structured payload is what actually triggers the video generation.
This architectural choice is what allows for Multi-Turn Video Editing. You are iterating on a stateful object, not starting from scratch.
Exploiting Gemini Omni's Multi-Modal Capabilities
The real magic happened when I integrated Gemini Omni. The goal wasn't just text-to-video; I wanted a unified workflow.
Because Gemini Omni is natively multimodal, the backend can accept completely unstructured inputs simultaneously. You can drop in:
A rough text script.
A product photo (.webp or .png).
A voice memo describing the vibe (.mp3).
I built an ingestion pipeline that feeds all these raw buffers into Gemini simultaneously. The model acts as the "Director," parsing the audio sentiment, analyzing the reference image's color palette, and merging it with the text prompt to generate a cohesive scene. No manual compositing, no separate audio-syncing steps. It handles cinematography and sound design in one pass.
Dynamic Resolution Scaling
One of the most annoying parts of modern content creation is formatting for different platforms (16:9 for YouTube, 9:16 for TikTok).
Instead of building manual cropping tools in the browser, I offloaded the re-composition to the AI. The state manager simply passes the requested aspect ratio flag before rendering. The model redraws the scene natively for that aspect ratio—meaning subjects are never awkwardly cropped out of the frame.
The Result: Gemini Omni Video
After weeks of tweaking API calls, managing long-polling rendering states, and refining the Next.js UI, I packaged this into a tool called Gemini Omni Video.
It completely removes the "production overhead." You can go from a blank canvas to a publish-ready 4K video (with auto-matched audio) in minutes, just by talking to it.
Some core features I managed to implement:
Consistent Characters: Maintaining facial and style continuity across multiple generated clips.
Photo-to-Motion: Animating static product shots with context-aware camera movements.
Auto-Matched Audio: Synchronizing ambient sound and effects without a separate audio track.
What's Next?
Building AI video tools right now feels like building for the web in the late 90s—everything is changing weekly. My next technical challenge is reducing the latency between iterative edits and improving the streaming feedback loop so the UI feels more instantaneous.
If you are a developer, creator, or just someone tired of complex video editors, I'd love for you to try out Gemini Omni Video and let me know what you think.
How are you guys handling state management in AI-heavy applications? Have you tried building anything with the Gemini Omni API yet? Let's discuss in the comments!
Top comments (0)