🎥 Building veo3.im – An AI-Powered Tool That Automatically Summarizes Long Videos (MVP Stage)

#ai #openai #sideprojects #javascript

Hey folks 👋

I recently shipped the MVP for a side project called veo3.im. It’s a web-based tool that automatically summarizes long-form video content (e.g., YouTube, TikTok) by extracting and stitching together the most relevant segments, along with generating a natural-language summary.

The core idea came from a personal pain point: too many long videos, not enough time. So I built something to surface the most valuable content faster.

🧱 Tech Stack & System Design (MVP)
Here’s a high-level breakdown of how veo3.im currently works under the hood:

Speech Recognition: Using OpenAI’s Whisper for accurate automatic speech transcription;

Semantic Relevance Detection: Scoring subtitle segments based on keyword density, semantic similarity, and visual pacing to identify “highlight-worthy” moments;

Programmatic Video Editing: Leveraging FFmpeg with custom segment selection logic for non-linear editing and fast media composition;

Natural Language Summarization: Utilizing LLMs (tested with GPT-4 and Gemini) to generate concise summaries from transcript data;

Frontend: Built with React + Tailwind CSS, responsive and minimal;

Backend: Node.js + PostgreSQL, with Redis-based task queue (BullMQ) for processing orchestration;

Media Hosting & Delivery: Using Bunny.net for low-latency video storage, CDN delivery, and HLS playback support.

Right now, the entire processing pipeline is asynchronous. Users simply paste a video link, and within a few minutes, they receive a short-form highlight video + readable summary via a clean UI.

🤔 Still in MVP – Here’s What I’m Wondering:
How can I improve highlight detection beyond basic heuristics?
Right now, segment scoring is based on surface-level heuristics like keyword frequency and scene changes. I’m considering integrating embedding-based relevance or even a lightweight classifier trained on video-summary pairs. Has anyone tried this?

Optimizing processing performance at scale
A 10-minute video currently takes ~2–4 minutes to process end-to-end. I've optimized with parallel jobs, queue management, and caching, but I suspect the FFmpeg phase is the main bottleneck. Curious if anyone’s done distributed FFmpeg processing or GPU acceleration in production?

Should I support local file uploads (or stick with URL-only input)?
Right now, only public URLs are supported to keep things lean. Adding file upload means dealing with larger storage and validation overhead. From a developer/user standpoint—do you see this as a necessary feature or unnecessary complexity?

🚀 What’s Next?
This is a solo-built full-stack project (design, infra, code) developed over the past 3 months. I plan to experiment with UI enhancements, multilingual support, fine-tuned summarization models, and more robust processing pipelines.

Would love feedback from fellow devs:

What would you optimize in this workflow?

Are there better open-source models or FFmpeg configs you’d recommend?

How would you approach feature prioritization at this stage?

You can try it here → https://veo3.im/
I’d truly appreciate your thoughts, ideas, or critique 🙌

Thanks for reading!

Top comments (2)

Duke • Jun 9

Just tried out veo3.im — and wow, it works really well! It pulled out the key parts of a long video and gave me a clean, short summary. Super useful if you don’t have time to watch the whole thing. Definitely worth checking out! 🙌

Juddiy • Jun 9

Thanks for your feedback! Is there anything you felt was missing or could be better? We’re all ears and would love to hear any downsides 😊.