DEV Community

Cover image for Transform Your Video Transcripts: From Raw to Readable Text
Roomal Seferaj
Roomal Seferaj

Posted on

Transform Your Video Transcripts: From Raw to Readable Text

I cannot help but think of all the YouTube videos I have had to watch, simply because I didn't know how to save their transcripts. When I finally learned how, the resulting text was a mess. For instance, here’s a snippet from a required video for my biological psychology course:

Input: I've come here to California on the trail of one of the most infamous doctors of the with century Or Alter Freemen the last Alter Freemen began practicing as a doctor in the 1920s going on to work in one of the last institutions set up to house growing numbers of mentally ill people the Shell Shocked victims of the first world war and inmates with Dreadful psychiatric problems lived out their lives in what were known as snake pits psychiatric hospitals in the 1930s were terrible places to be as a patient and they were terrible places because they were they were places of hopelessness there were really no effective treatments for most mental disorders for the most part these hospitals warehouse patients for long periods of times decades even entire lives Freemen was horrified at the sheer waste of human potential now he started out with with good intentions and here was a terribly serious problem and it wasn't getting any better it was getting worse it was a public health problem Freemen was convinced that the root cause of many of the patients problems lay in the physical structure of their brains so he decided to change them a pulled by what he was seeing in these snake pits Freemen now spent increasing amounts of time in the laboratory coat brains dissecting brains examining brains looking for differences this is the brain received through the courtesy of Washington sanitary Freemen thought that surgery could help patients far more than the current treatments are cut off at the level of the middle of the pond spurred on by the growing understanding of what different regions of the brain do he finally decided that the problem lay in a set of connections between the thalamus and the frontal love the pointer demonstrates the thalamus and the anterior thalamic radiation going to all parts of the frontal L Freemen believed if he could never the connections between the thalamus and the frontal love and this would damper down all those awful emotions and it would if you like cure the patients he saw this as surgery of the Soul a way of bringing the Damned back to life but there was a problem Freemen was not himself a surgeon so he got together with a man who was James wants and together they started performing the operation action they called labotomy
[ Music ]
Enter fullscreen mode Exit fullscreen mode

Not bad...
However, the lack of punctuation makes the experience less-than-desirable. So, to remedy this situation, I weaved together the original script that relied on the YouTubeTranscriptApi, TextBlob, and by importing the functionalities of the punctuators module.

from punctuators.models import PunctCapSegModelONNX
from textblob import TextBlob
from tqdm import tqdm
from typing import List
from youtube_transcript_api import YouTubeTranscriptApi
import nltk
import spacy

# Initialize models and download necessary resources
m = PunctCapSegModelONNX.from_pretrained("1-800-BAD-CODE/xlm-roberta_punctuation_fullstop_truecase")
nltk.download('punkt')
nlp = spacy.load("en_core_web_sm")

def nest_sentences(document: str, max_length: int = 1024) -> List[str]:
    """
    Nest sentences into groups ensuring each group does not exceed max_length.
    """
    nested, sent, length = [], [], 0
    for sentence in nltk.sent_tokenize(document):
        length += len(sentence)
        if length < max_length:
            sent.append(sentence)
        else:
            nested.append(" ".join(sent))
            sent = [sentence]
            length = len(sentence)
    if sent:
        nested.append(" ".join(sent))
    return nested

def process_transcript(video_id: str) -> str:
    """
    Retrieve and concatenate transcripts from YouTube video.
    """
    transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
    all_transcript_text = []

    for transcript in transcript_list:
        transcript_data = transcript.fetch()
        transcript_text = " ".join(entry["text"] for entry in transcript_data)
        all_transcript_text.append(transcript_text)

    return " ".join(all_transcript_text)

def filter_tokens(text: str) -> str:
    """
    Remove spaces from tokens in the text using spaCy.
    """
    doc = nlp(text)
    return " ".join(token.text for token in doc if not token.is_space)

def correct_text(text: str) -> str:
    """
    Correct the text using TextBlob.
    """
    blob = TextBlob(text)
    return str(blob.correct())

def punctuate_text(texts: List[str]) -> List[str]:
    """
    Punctuate and segment the texts using the pre-trained model.
    """
    return m.infer(texts=texts, apply_sbd=True)

def main(video_id: str):
    """
    Main processing function.
    """
    transcript_text = process_transcript(video_id)
    filtered_text = filter_tokens(transcript_text)
    corrected_text_str = correct_text(filtered_text)
    nested_sentences = nest_sentences(corrected_text_str)

    results = punctuate_text(nested_sentences)

    for input_text, output_texts in tqdm(zip(nested_sentences, results), desc="Processing", total=len(nested_sentences)):
        print(f"Input: {input_text}")
        print("Outputs:")
        for text in output_texts:
            print(f"\t{text}")
        print()

if __name__ == "__main__":
    video_id = "CUgtGjA6VvA"
    main(video_id)
Enter fullscreen mode Exit fullscreen mode

Script Summary and Utility

This script is designed to process transcripts from YouTube videos and enhance their readability by applying punctuation, correcting errors, and ensuring proper segmentation. Here's a breakdown of its components and why it's useful:

Key Components:

  1. Import Libraries:

    • PunctCapSegModelONNX from punctuators.models: Adds punctuation and capitalization to the text.
    • TextBlob: Corrects grammatical and spelling errors in the text.
    • YouTubeTranscriptApi: Fetches transcripts from YouTube videos.
    • nltk and spacy: Tokenizes and processes text to ensure proper segmentation.
  2. Initialization:

    • Load the pre-trained punctuation and capitalization model.
    • Download necessary NLTK resources and load spaCy's English language model.
  3. Functions:

    • nest_sentences(document: str, max_length: int = 1024) -> List[str]: Groups sentences into segments not exceeding a specified length to maintain context and readability.
    • process_transcript(video_id: str) -> str: Retrieves and concatenates transcripts from a given YouTube video ID.
    • filter_tokens(text: str) -> str: Removes spaces and ensures proper tokenization using spaCy.
    • correct_text(text: str) -> str: Uses TextBlob to correct grammatical and spelling errors in the text.
    • punctuate_text(texts: List[str]) -> List[str]: Applies punctuation and segmentation to the text using the pre-trained model.
  4. Main Function (main(video_id: str)):

    • Retrieves the YouTube video transcript.
    • Processes the transcript by filtering tokens, correcting text, nesting sentences, and applying punctuation.
    • Prints the input and processed text for each nested segment.

Why This Script is Useful:

  1. Improves Readability:

    • Adds punctuation and capitalization, transforming raw transcripts into more readable text.
  2. Corrects Errors:

    • Uses TextBlob to automatically correct grammatical and spelling errors, enhancing the accuracy of the text.
  3. Ensures Proper Segmentation:

    • Splits text into manageable segments to maintain context and readability, especially useful for long transcripts.
  4. Automates Transcript Processing:

    • Simplifies the process of retrieving and enhancing YouTube video transcripts, saving time and effort for users.
  5. Educational Tool:

    • Can be included in a student package toolset to aid in processing and analyzing video transcripts for study purposes, making lecture notes or online video content more accessible and easier to study from.

Below is the output of the original CC from the video after it was processed:

Outputs:
I've come here to California on the trail of one of the most infamous doctors of the with century.
Or Alter Freemen, the last Alter Freemen began practicing as a doctor in the 1920s, going on to work in one of the last institutions set up to house growing numbers of mentally ill people.
The Shell Shocked victims of the First World War and inmates with Dreadful psychiatric problems lived out their lives in what were known as snake pits.
Psychiatric hospitals in the 1930s were terrible places to be as a patient, and they were terrible places because they were, they were places of hopelessness.
There were really no effective treatments for most mental disorders.
For the most part, these hospitals warehouse patients for long periods of times, decades, even entire lives.
Freemen was horrified at the sheer waste of human potential.
Now he started out with with good intentions.
And here was a terribly serious problem, and it wasn't getting any better, it was getting worse.
It was a public health problem.
Freemen was convinced that the root cause of many of the patients problems lay in the physical structure of their brains, so he decided to change them, a pulled by what he was seeing in these snake pits.
Freemen now spent increasing amounts of time in the laboratory, coat brains, dissecting brains, examining brains, looking for differences.
This is the brain received through the courtesy of Washington Sanitary.
Freemen thought that surgery could help patients far more than the current treatments are cut off at the level of the middle of the pond.
Spurred on by the growing understanding of what different regions of the brain do, he finally decided that the problem lay in a set of connections.
Between the thalamus and the frontal love The pointer demonstrates the thalamus and the anterior thalamic radiation going to all parts of the frontal.
Freemen believed if he could, never the connections between the thalamus and the frontal love.
And this would damper down all those awful emotions, and it would, if you like, cure the patients.
He saw this as Surgery of the Soul, a way of bringing the Damned back to life.
But there was a problem.
Freemen was not himself a surgeon, so he got together with a man who was James Wants, and together, they started performing the operation action they called lobotomy
[ Music ].
Enter fullscreen mode Exit fullscreen mode

Overall, the quality is significantly better! All things considered, this script is a valuable tool for anyone looking to enhance the quality and readability of YouTube video transcripts, making it especially useful for students, researchers, and content creators. The script can also be modified with a transformer and used for translation processes, too!

Till next time,

Roomal

Top comments (0)