In the world of search, we've long been confined to keywords. Even with the advent of image search, the bridge between sound and video has remained a complex engineering challenge. Today, we're diving into the technical architecture of WhisperGrid, an app that lets you "speak the vibe" to find the perfect video.
The Vision
The goal was simple but ambitious: Create a 3x3 grid of videos that responds to semantic audio cues. Not just voice commands like "show me a cat," but the feeling of the audio. If you whistle a lonely tune, it should find a solitary landscape. If you make a splashing sound, it should find the ocean.
The Engine: Gemini Embedding 2
The core of WhisperGrid is the gemini-embedding-2-preview model. Unlike traditional models that only handle text, Gemini's latest embedding model is natively multimodal. It can map text, images, audio, and video into the same high-dimensional vector space.
This means a video of a "stormy beach" and the sound of "crashing waves" will end up as vectors that are mathematically close to each other.
Technical Architecture
1. Pre-calculating the Video Latent Space
To ensure the app feels "instant," we don't embed videos on the fly during a search. Instead, we perform a "warm-up" phase:
const result = await ai.models.embedContent({
model: "gemini-embedding-2-preview",
contents: [{
inlineData: {
data: videoBase64,
mimeType: "video/mp4"
}
}]
});
const videoVector = result.embeddings[0].values;
2. Capturing the "Vibe"
When the user hits the mic, we use the MediaRecorder API to capture a high-quality audio blob. This blob is then sent to Gemini to be transformed into its own vector.
3. The Math of Similarity
Once we have the audio vector ($A$) and our library of video vectors ($V_1, V_2, ... V_9$), we use Cosine Similarity to find the best match.
Cosine similarity measures the cosine of the angle between two vectors. A value of 1 means they are identical in direction (perfect semantic match), while 0 means they are orthogonal (no relation).
function cosineSimilarity(a, b) {
return dotProduct(a, b) / (magnitude(a) * magnitude(b));
}
UI/UX: The "Whisper" Aesthetic
For the interface, I chose a "Cinematic Brutalist" style:
-
Real-time Audio Visualizer: Using the Web Audio API's
AnalyserNode, I render a live frequency waveform to a Canvas element. This gives users immediate visual confirmation that their "vibe" is being heard. - Granular Loading States: Instead of a global loading spinner, each video tile manages its own state. You can watch the grid "warm up" as each video is embedded individually.
- Quantitative Transparency: By displaying the Match Percentage (derived from the cosine similarity score), I provide users with a clear understanding of how the AI interpreted their input.
- Framer Motion: For smooth transitions and layout animations when a "winner" is selected.
Challenges Overcome
- Real-time Visualization: Syncing the Canvas renderer with the MediaRecorder stream required careful management of the AudioContext lifecycle to prevent memory leaks.
- Payload Limits: Embedding large videos can hit API limits. Implemented a 20MB cap and client-side checks to ensure stability.
- CORS & Assets: Moving from external URLs to local assets solved latency and cross-origin issues, making the app production-ready for local deployment.
WhisperGrid isn't just a search tool; it's a demonstration of how multimodal AI is blurring the lines between different types of data. By mapping sound and vision into a shared mathematical space, I've created a search experience that feels less like a computer and more like an intuition.
Github Repo: https://github.com/harishkotra/WhisperGrid




Top comments (0)