Abhiraj Adhikary

Posted on Oct 9

Building a YouTube Video Search App with Flask, Whisper, and RAG

#whisper #flask #mariadb #hackathon

Building a YouTube Video Search App with Flask, Whisper, and RAG

Ever wanted to search for specific moments in a YouTube video by just typing a keyword? Imagine pinpointing that exact timestamp where someone explains "machine learning" in a 5-minute tutorial—without scrubbing through the whole thing. I built a Flask-based web app called video-rag-search that does exactly this, using Retrieval-Augmented Generation (RAG), OpenAI's Whisper, and a sprinkle of AI magic. In this post, I'll walk you through what it does, how it works, and why it's a fun project for developers to explore.

What Does It Do?

The video-rag-search app lets you:

Paste a YouTube video link (up to 5 minutes long).
Automatically download and transcribe the audio using OpenAI's Whisper.
Generate 5 key topics from the transcript using Groq's LLM.
Search for moments in the video by selecting a topic, with results linked to exact timestamps.
Cache results for speed and store data in a MariaDB database for persistence.

Think of it as a smart search engine for YouTube videos, powered by semantic search and AI transcription. Whether you're a student skimming lectures or a developer digging through tech talks, this tool saves time.

Why Build This?

I wanted to combine my love for Flask, AI, and video content into a practical tool. YouTube is a treasure trove of knowledge, but finding specific moments can be a pain. By leveraging RAG (Retrieval-Augmented Generation), we can make video content searchable in a way that's intuitive and developer-friendly. Plus, it's a great excuse to play with cutting-edge AI libraries like Whisper and SentenceTransformers!

Tech Stack

Here's the lineup of tools and libraries powering the app:

Flask: Lightweight Python web framework for the backend and UI.
OpenAI Whisper: Transcribes YouTube audio to text with timestamps.
Groq LLM: Generates meaningful keywords from transcripts.
SentenceTransformers: Creates semantic embeddings for search.
MariaDB: Stores transcripts and embeddings for persistence.
yt-dlp: Downloads YouTube audio efficiently.
Flask-Caching: Speeds up repeated searches.
pydub: Handles audio file processing.
NumPy: Computes similarity scores for search.

You'll also need a Groq API key (free tier available) and a MariaDB instance (local or cloud).

How It Works

Let's break down the app's workflow, from YouTube link to search results.

1. Input a YouTube Link

The user submits a YouTube URL via a simple form (index.html). The app validates it using a regex to ensure it's a proper YouTube link (e.g., youtube.com/watch?v=... or youtu.be/...). Whitespace and quotes are stripped for cleanliness.

2. Download Audio

Using yt-dlp, the app downloads the audio as an MP3 file. It checks the video's duration (via pydub) and enforces a 5-minute limit to keep processing manageable. If the video's too long, you get a friendly error message.

3. Transcribe with Whisper

OpenAI's Whisper (medium model) transcribes the audio, producing segments with text and timestamps (e.g., [10.2 - 12.5] "Welcome to AI basics"). Empty or invalid segments are filtered out to ensure quality.

4. Store in MariaDB

Each segment is saved in a MariaDB table (video_data) with:

Video ID (from the YouTube URL).
Segment text, start/end times, and a timestamped YouTube link.
Semantic embeddings (as JSON, generated later).

The table is created dynamically if it doesn't exist, with defensive migrations to handle schema changes.

5. Generate Keywords with Groq

The transcript is sent to Groq's LLM (model: openai/gpt-oss-20b) with a prompt to extract 5 relevant keywords. For example, a machine learning tutorial might yield:

Neural Networks
Backpropagation
Overfitting
Gradient Descent
Activation Functions

The app parses the LLM's response, prioritizing bold (**...**), numbered, or bulleted lists, and cleans up markdown artifacts.

6. Semantic Search with Embeddings

To enable smart searching, the app uses SentenceTransformers (all-MiniLM-L6-v2) to create embeddings for each transcript segment. These are stored as JSON in MariaDB. When a user selects a keyword (e.g., "Neural Networks"), the app:

Encodes the keyword into an embedding.
Computes cosine similarity against stored segment embeddings.
Returns the best-matching segment (if similarity ≥ 0.5) with its timestamp and a clickable link.

7. Caching for Speed

Results are cached using Flask-Caching with the video ID as the key. If the same video is searched again within an hour, the app skips processing and loads from cache.

8. User Interface

The UI (built with Jinja2 templates) guides users through three steps:

Input Link: Enter a YouTube URL.
Select Keyword: Choose from 5 auto-generated keywords.
View Results: See the matching timestamp, transcript snippet, and a link to jump to that moment in the video.

Errors (e.g., invalid URL, failed transcription) are logged and displayed as user-friendly messages.

Code Highlights

Here's a peek at some key functions (simplified for brevity):

def download_audio(youtube_link):
    args = ["yt-dlp", "-x", "--audio-format", "mp3", "-o", "video.mp3", youtube_link]
    result = subprocess.run(args, capture_output=True, text=True)
    if result.returncode != 0:
        raise RuntimeError(f"Download failed: {result.stderr}")
    if not os.path.exists("video.mp3"):
        raise FileNotFoundError("Audio file not found.")

def parse_keywords(text: str) -> list:
    bold = re.findall(r"\*\*(.+?)\*\*", text)
    candidates = bold if bold else re.findall(r"^\s*\d+\.\s*(.+)", text, re.M)
    return [re.sub(r"[\s\.,;:!]+$", "", item.strip()) for item in candidates][:5]

@app.route('/select_keyword/<int:index>', methods=['GET'])
def select_keyword(index):
    keywords = session.get('keywords', [])
    query = keywords[index - 1].lower()
    embedder = SentenceTransformer('all-MiniLM-L6-v2')
    query_embedding = embedder.encode(query, convert_to_tensor=True)
    # ... (fetch segments, compute cosine similarity, return best match)

Getting Started

Want to try it yourself? Here's how to set it up:

Clone the Repo:

   git clone <your-repo>
   cd video-rag-search

Install Dependencies:

   pip install flask whisper sentence-transformers groq pydub flask-caching mariadb numpy yt-dlp

Set Up Environment: Create a .env file:

   GROQ_API_KEY=your_groq_key
   DB_USER=root
   DB_PASSWORD=RootPass123!
   DB_HOST=localhost
   DB_PORT=3306
   DB_NAME=youtube_search

Set Up MariaDB:
Install MariaDB locally or use a cloud provider. Create a youtube_search database.
Run the App:

   python app.py

Visit http://localhost:5000 and paste a YouTube link (try a short tech tutorial!).

Challenges and Lessons

Whisper Load Time: The medium model is heavy. Preloading or using a smaller model (tiny) could speed things up, but I prioritized accuracy.
Embedding Storage: Storing embeddings as JSON in MariaDB works but isn't ideal for scale. A vector database like FAISS or Pinecone would be better (planned for v2!).
LLM Parsing: Groq's output varies, so robust parsing (e.g., handling markdown) was key to consistent keywords.
Caching: Flask-Caching with a simple in-memory store is great for dev but needs Redis for production.

What's Next?

I'm excited to extend this project with:

Quiz Generation: Turn transcripts into MCQ quizzes for learning.
User Accounts: Add login/register to track search history.
Cloud DB: Move to Neon Postgres or Render for scalability.
Audio Readout: Use text-to-speech for accessibility.
Leaderboard: Rank users by search activity or quiz scores.

Try It Out!

The video-rag-search app is a fun blend of AI, web dev, and data science. It’s open-source, so fork it, tweak it, or add your own spin! Got ideas for features or hit a snag? Drop a comment on Dev.to or open an issue on the repo.

Happy coding, and let’s make YouTube videos searchable! This was build during MariaDB hackathon by @anikchand461 and me.

DEV Community

Building a YouTube Video Search App with Flask, Whisper, and RAG

Building a YouTube Video Search App with Flask, Whisper, and RAG

What Does It Do?

Why Build This?

Tech Stack

How It Works

1. Input a YouTube Link

2. Download Audio

3. Transcribe with Whisper

4. Store in MariaDB

5. Generate Keywords with Groq

6. Semantic Search with Embeddings

7. Caching for Speed

8. User Interface

Code Highlights

Getting Started

Challenges and Lessons

What's Next?

Try It Out!

Top comments (0)