A Show HN post is trending at 260+ points: someone built sub-second video search using Gemini's new native video embedding capability.
This is a big deal for developers building AI-powered video apps.
What Changed
Until now, working with video in AI meant:
- Extract frames (FFmpeg)
- Send frames as images to vision model
- Lose temporal information
- Pay for many image API calls
Now Gemini can accept video natively — understanding motion, context, and temporal relationships in a single API call.
Why Developers Should Care
1. Video search becomes trivial
Instead of building a complex pipeline (extract frames → embed each → vector search), you can now:
# Conceptual — Gemini video embedding
import google.generativeai as genai
model = genai.GenerativeModel('gemini-2.0-flash')
# Upload and understand video in one call
video = genai.upload_file("demo.mp4")
response = model.generate_content([
"Find the moment where the speaker shows the chart",
video
])
print(response.text) # "At 2:34, the speaker displays a bar chart showing..."
2. New application categories
| Application | Before | Now |
|---|---|---|
| Video search | Frame extraction + CLIP + vector DB | Single API call |
| Content moderation | Manual review or basic ML | Natural language queries |
| Meeting summaries | Transcription only | Visual + audio context |
| Sports analysis | Custom CV models | Describe what you want |
| Security footage | Motion detection rules | "Find person in red jacket" |
3. Cost implications
Gemini 2.0 Flash pricing:
- Video input: priced per minute of video
- Much cheaper than sending hundreds of frames as images
- Sub-second inference makes real-time applications feasible
The Competition
| Model | Native Video | Cost | Speed |
|---|---|---|---|
| Gemini 2.0 Flash | ✅ Yes | Low | Sub-second |
| GPT-4o | ❌ Frames only | Medium | 2-5 sec |
| Claude Sonnet | ❌ Images only | Medium | 2-5 sec |
| Open source (LLaVA-Video) | ✅ Yes | Self-hosted | Varies |
Google is ahead here. OpenAI and Anthropic will likely follow, but right now Gemini is the only major API with native video understanding.
Discussion
- Are you building anything with video AI? What's your current stack?
- Does native video embedding change your architecture? Or were you already using frame extraction?
- Is sub-second video search useful for your use case?
- Google vs. OpenAI vs. Anthropic — who's winning the multimodal race?
The multimodal AI space is moving fast. I'm tracking all the free and paid AI APIs — Gemini's video capability just jumped to the top of the list.
What's the most interesting AI capability you've seen this month?
Top comments (0)