Automating Content Research with Python and YouTube Transcripts

#python #automation #youtube #productivity

I spend a lot of time researching topics before writing about them. YouTube is one of my best sources — experts share detailed knowledge in videos that often isn't available in written form.

The problem: watching videos is slow. I read 3x faster than people speak. So I built a Python script that automates my entire research workflow.

The Goal

Given a topic, I want to:

Find the top YouTube videos about it
Extract their transcripts
Generate a research summary combining all sources
Output key points, quotes, and an article outline

Prerequisites

pip install youtube-transcript-api google-api-python-client openai

You'll need:

A YouTube Data API key (free from Google Cloud Console)
An OpenAI API key

Step 1: Find Relevant Videos

from googleapiclient.discovery import build

def search_youtube(query, max_results=5):
    youtube = build('youtube', 'v3', developerKey=API_KEY)

    request = youtube.search().list(
        part='snippet',
        q=query,
        type='video',
        maxResults=max_results,
        order='relevance',
        videoDuration='medium',  # 4-20 minutes
        relevanceLanguage='en'
    )

    response = request.execute()

    videos = []
    for item in response['items']:
        videos.append({
            'id': item['id']['videoId'],
            'title': item['snippet']['title'],
            'channel': item['snippet']['channelTitle'],
        })

    return videos

Step 2: Extract Transcripts

from youtube_transcript_api import YouTubeTranscriptApi

def get_transcript(video_id):
    try:
        transcript = YouTubeTranscriptApi.get_transcript(video_id)
        return ' '.join([entry['text'] for entry in transcript])
    except Exception as e:
        print(f"Could not get transcript for {video_id}: {e}")
        return None

For production use, I'd recommend ScripTube (scriptube.me) which handles edge cases and formatting automatically. But for a personal script, the library works.

Step 3: AI-Powered Research Summary

from openai import OpenAI

client = OpenAI()

def generate_research_summary(topic, transcripts):
    sources = ""
    for t in transcripts:
        sources += f"\n\n--- Source: {t['title']} by {t['channel']} ---\n"
        sources += t['text'][:3000]  # Truncate for token limits

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{
            "role": "system",
            "content": "You are a research assistant. Synthesize information from multiple sources."
        }, {
            "role": "user",
            "content": f"""Topic: {topic}

Here are transcripts from {len(transcripts)} YouTube videos on this topic:
{sources}

Please provide:
1. A comprehensive summary of the key points across all sources
2. Where the sources agree
3. Where they disagree or offer different perspectives
4. 5-8 notable quotes (with attribution)
5. A suggested article outline for a blog post on this topic
6. Questions that weren't answered by any source"""
        }]
    )

    return response.choices[0].message.content

Step 4: Putting It Together

def research_topic(topic):
    print(f"Researching: {topic}")
    print("=" * 50)

    # Find videos
    print("Searching YouTube...")
    videos = search_youtube(topic, max_results=5)
    print(f"Found {len(videos)} videos")

    # Extract transcripts
    print("Extracting transcripts...")
    transcripts = []
    for video in videos:
        text = get_transcript(video['id'])
        if text:
            transcripts.append({
                'title': video['title'],
                'channel': video['channel'],
                'text': text
            })
            print(f"  ✓ {video['title']}")
        else:
            print(f"  ✗ {video['title']} (no transcript)")

    # Generate summary
    print("\nGenerating research summary...")
    summary = generate_research_summary(topic, transcripts)

    # Save output
    filename = f"research_{topic.replace(' ', '_')}.md"
    with open(filename, 'w') as f:
        f.write(f"# Research: {topic}\n\n")
        f.write(f"## Sources\n")
        for t in transcripts:
            f.write(f"- {t['title']} by {t['channel']}\n")
        f.write(f"\n## Summary\n\n{summary}")

    print(f"\nSaved to {filename}")
    return summary

# Run it
research_topic("content repurposing strategies")

Results

Running this on "content repurposing strategies" gave me:

5 video transcripts totaling ~25,000 words of expert knowledge
A synthesized summary highlighting where experts agree and disagree
7 quotable insights with attribution
A complete article outline

Total time: about 2 minutes (mostly API calls).

Writing the blog post from this research: about 45 minutes.

Extending the Script

Ideas for improvement:

Add caching to avoid re-extracting transcripts you've already processed
Filter videos by minimum view count for quality
Export to Notion or Google Docs instead of markdown
Build a simple web UI with Streamlit

The core insight: YouTube contains more expert knowledge than any library, and it's all accessible via transcripts. Automating the extraction turns YouTube from a time sink into a research superpower.