VoxEdit AI is a conversational video editing agent that allows users to edit videos using natural language commands instead of complex editing tools. The goal of this project is to simplify video editing by allowing creators to interact with an AI assistant that understands their intent and automatically performs editing operations.
This project was built using Google’s Gemini multimodal AI models combined with Google Cloud infrastructure to create a scalable AI-powered editing pipeline.
How VoxEdit AI Works
The system allows users to upload a video clip and give editing commands such as trimming, adding sound effects, or generating audio responses. Instead of manually editing timelines, the user simply tells the AI what they want to change.
The workflow of VoxEdit AI is:
- The user uploads a video through the frontend interface.
- The backend processes the video and stores it temporarily for analysis.
- Frames and contextual information from the video are analyzed using Gemini AI.
- Gemini interprets the user’s natural language instruction and generates an editing plan.
- The backend executes the plan using FFmpeg video processing tools.
- The processed video is returned to the user.
Technology Stack
- The system was built using the following technologies:
- Gemini AI for multimodal reasoning and command interpretation
- FastAPI for the backend API
- FFmpeg for video editing operations
- React for the frontend interface
- Google Cloud Run for scalable backend deployment
- Google Cloud Run allows the backend service to scale automatically and handle AI requests efficiently.
Architecture Overview
The architecture of VoxEdit AI includes:
User Interface → FastAPI Backend → Gemini AI Agent → Video Processing Engine → Google Cloud Run Deployment
This architecture enables the AI agent to understand user instructions and convert them into executable editing operations.
Conclusion
VoxEdit AI demonstrates how multimodal AI agents can transform traditional creative workflows. By combining natural language interaction with video processing and cloud infrastructure, the project shows how AI can simplify complex tasks like video editing.
This project was created for the #GeminiLiveAgentChallenge hackathon to explore the capabilities of Google’s Gemini models and Google Cloud in building next-generation AI agents. #GeminiLiveAgentChallenge.


Top comments (0)