I recently shipped the MVP for a side project called veo3.im. It’s a web-based tool that automatically summarizes long-form video content (e.g., YouTube, TikTok) by extracting and stitching together the most relevant segments, along with generating a natural-language summary.
The core idea came from a personal pain point: too many long videos, not enough time. So I built something to surface the most valuable content faster.
🧱 Tech Stack & System Design (MVP)
Here’s a high-level breakdown of how veo3.im currently works under the hood:
Speech Recognition: Using OpenAI’s Whisper for accurate automatic speech transcription;
Semantic Relevance Detection: Scoring subtitle segments based on keyword density, semantic similarity, and visual pacing to identify “highlight-worthy” moments;
Programmatic Video Editing: Leveraging FFmpeg with custom segment selection logic for non-linear editing and fast media composition;
Natural Language Summarization: Utilizing LLMs (tested with GPT-4 and Gemini) to generate concise summaries from transcript data;
Frontend: Built with React + Tailwind CSS, responsive and minimal;
Backend: Node.js + PostgreSQL, with Redis-based task queue (BullMQ) for processing orchestration;
Media Hosting & Delivery: Using Bunny.net for low-latency video storage, CDN delivery, and HLS playback support.
Right now, the entire processing pipeline is asynchronous. Users simply paste a video link, and within a few minutes, they receive a short-form highlight video + readable summary via a clean UI.
🤔 Still in MVP – Here’s What I’m Wondering:
How can I improve highlight detection beyond basic heuristics?
Right now, segment scoring is based on surface-level heuristics like keyword frequency and scene changes. I’m considering integrating embedding-based relevance or even a lightweight classifier trained on video-summary pairs. Has anyone tried this?
Optimizing processing performance at scale
A 10-minute video currently takes ~2–4 minutes to process end-to-end. I've optimized with parallel jobs, queue management, and caching, but I suspect the FFmpeg phase is the main bottleneck. Curious if anyone’s done distributed FFmpeg processing or GPU acceleration in production?
Should I support local file uploads (or stick with URL-only input)?
Right now, only public URLs are supported to keep things lean. Adding file upload means dealing with larger storage and validation overhead. From a developer/user standpoint—do you see this as a necessary feature or unnecessary complexity?
🚀 What’s Next?
This is a solo-built full-stack project (design, infra, code) developed over the past 3 months. I plan to experiment with UI enhancements, multilingual support, fine-tuned summarization models, and more robust processing pipelines.
Would love feedback from fellow devs:
What would you optimize in this workflow?
Are there better open-source models or FFmpeg configs you’d recommend?
How would you approach feature prioritization at this stage?
You can try it here → https://veo3.im/
I’d truly appreciate your thoughts, ideas, or critique 🙌
Thanks for reading!
Top comments (2)
Just tried out veo3.im — and wow, it works really well! It pulled out the key parts of a long video and gave me a clean, short summary. Super useful if you don’t have time to watch the whole thing. Definitely worth checking out! 🙌
Thanks for your feedback! Is there anything you felt was missing or could be better? We’re all ears and would love to hear any downsides 😊.