DEV Community

Cover image for How to create a perfectly YouTube video transcriber using AI
Wilbert Misingo
Wilbert Misingo

Posted on • Updated on

How to create a perfectly YouTube video transcriber using AI

INTRODUCTION

YouTube has become an unparalleled resource for information, entertainment, and educational content. However, extracting the spoken words from videos programmatically can be a challenge.

IMPLEMENTATION

In this article, we'll explore how to harness the power of Python to read and transcribe YouTube video content using the YouTube Transcript API library and the machine learning models manager library called llama index.

Step 01: installing libraries and modules

Before diving into the code, make sure you have the required packages installed. You can do this by running the following command on the command line terminal.

pip install youtube_transcript_api llama_index

Enter fullscreen mode Exit fullscreen mode

Step 02: importing libraries and modules

Now, let's import the crucial modules and components needed for
implementation. The code block below includes the necessary imports.

import re
from typing import Any, List, Optional
from llama_index.readers.base import BaseReader
from llama_index.readers.schema.base import Document
from importlib.util import find_spec
Enter fullscreen mode Exit fullscreen mode

Step 03: Defining expected YouTube Videos URLs

The YOUTUBE_URL_PATTERNS list contains regular expressions to match various YouTube URL formats. These patterns are crucial for extracting the video ID.

YOUTUBE_URL_PATTERNS = [
    r"^https?://(?:www\.)?youtube\.com/watch\?v=([\w-]+)",
    r"^https?://(?:www\.)?youtube\.com/embed/([\w-]+)",
    r"^https?://youtu\.be/([\w-]+)",  # youtu.be does not use www
]
Enter fullscreen mode Exit fullscreen mode

Step 04: Verifying YouTube Video

From a list of many YouTube videos links, the is_youtube_video function determines if a given URL is a valid YouTube video link by matching it against the defined patterns.

def is_youtube_video(url: str) -> bool:
    """
    Returns whether the passed in `url` matches the various YouTube URL formats
    """
    for pattern in YOUTUBE_URL_PATTERNS:
        if re.search(pattern, url):
            return True
    return False
Enter fullscreen mode Exit fullscreen mode

Step 05: Initializing the transcriber

The YoutubeTranscriptReader class checks for the presence of the youtube_transcript_api package and raises an ImportError if not found.

class YoutubeTranscriptReader(BaseReader):
    """Youtube Transcript reader."""

    def __init__(self) -> None:
        if find_spec("youtube_transcript_api") is None:
            raise ImportError(
                "Missing package: youtube_transcript_api.\n"
                "Please `pip install youtube_transcript_api` to use this Reader"
            )
        super().__init__()
Enter fullscreen mode Exit fullscreen mode

Step 06: Loading Videos Data/transcription

The load_data method takes a list of YouTube links (ytlinks) and optional language parameters. It uses the YouTubeTranscriptApi to fetch and compile transcripts for each video.

def load_data(
    self,
    ytlinks: List[str],
    languages: Optional[List[str]] = ["en"],
    **load_kwargs: Any,
) -> List[Document]:
    """Load data from the input directory.

    Args:
        pages (List[str]): List of youtube links \
            for which transcripts are to be read.
    """
    from youtube_transcript_api import YouTubeTranscriptApi

    results = []
    for link in ytlinks:
        video_id = self._extract_video_id(link)
        if not video_id:
            raise ValueError(
                f"Supplied url {link} is not a supported youtube URL."
                "Supported formats include:"
                "  youtube.com/watch?v={video_id} "
                "(with or without 'www.')\n"
                "  youtube.com/embed?v={video_id} "
                "(with or without 'www.')\n"
                "  youtu.be/{video_id} (never includes www subdomain)"
            )
        transcript_chunks = YouTubeTranscriptApi.get_transcript(
            video_id, languages=languages
        )
        chunk_text = [chunk["text"] for chunk in transcript_chunks]
        transcript = "\n".join(chunk_text)
        results.append(Document(text=transcript, extra_info={"video_id": video_id}))
    return results
Enter fullscreen mode Exit fullscreen mode

Step 07: Extracting the video ID from its data

The _extract_video_id method extracts the video ID from a given YouTube link using the predefined URL patterns.

@staticmethod
def _extract_video_id(yt_link) -> Optional[str]:
    for pattern in YOUTUBE_URL_PATTERNS:
        match = re.search(pattern, yt_link)
        if match:
            return match.group(1)

    # return None if no match is found
    return None
Enter fullscreen mode Exit fullscreen mode

CONCLUSION

By following these steps, you can implement a powerful YouTube transcript reader in Python. This opens the door to a wide range of applications, from content analysis to language processing. Experiment with different videos and languages to unlock the full potential of this simple yet effective tool.

Happy coding!

Do you have a project ๐Ÿš€ that you want me to assist you email me๐Ÿค๐Ÿ˜Š: wilbertmisingo@gmail.com
Have a question or wanna be the first to know about my posts:-
Follow โœ… me on Twitter/X ๐•
Follow โœ… me on LinkedIn ๐Ÿ’ผ

Top comments (0)