YouTube holds an enormous amount of publicly available data — video metadata, channel statistics, comments, and transcripts. Whether you're building a content research tool, training an NLP model, or doing competitive analysis, extracting this data programmatically is a valuable skill.
In this guide, I'll walk through every major approach to scraping YouTube in 2026, including the official API, popular Python libraries, and practical workarounds for the limitations you'll hit.
The YouTube Data API v3: Start Here
Google's YouTube Data API v3 is the most reliable way to access video metadata, channel info, and comments. It's free to use with a Google Cloud project, but comes with strict quota limits.
Setting Up
from googleapiclient.discovery import build
API_KEY = "YOUR_API_KEY"
youtube = build("youtube", "v3", developerKey=API_KEY)
You'll need to create a project in the Google Cloud Console, enable the YouTube Data API v3, and generate an API key.
Searching for Videos
def search_videos(query, max_results=50):
videos = []
request = youtube.search().list(
q=query,
part="snippet",
type="video",
maxResults=min(max_results, 50),
order="relevance"
)
response = request.execute()
for item in response["items"]:
videos.append({
"video_id": item["id"]["videoId"],
"title": item["snippet"]["title"],
"channel": item["snippet"]["channelTitle"],
"published_at": item["snippet"]["publishedAt"],
"description": item["snippet"]["description"]
})
return videos
results = search_videos("python web scraping tutorial")
Getting Detailed Video Statistics
The search endpoint doesn't return view counts or likes. You need a separate videos.list call:
def get_video_details(video_ids):
request = youtube.videos().list(
part="statistics,contentDetails,snippet",
id=",".join(video_ids)
)
response = request.execute()
details = []
for item in response["items"]:
details.append({
"video_id": item["id"],
"title": item["snippet"]["title"],
"views": int(item["statistics"].get("viewCount", 0)),
"likes": int(item["statistics"].get("likeCount", 0)),
"comments": int(item["statistics"].get("commentCount", 0)),
"duration": item["contentDetails"]["duration"]
})
return details
The Quota Problem
Here's where it gets painful. The YouTube API gives you 10,000 quota units per day by default. Different operations cost different amounts:
| Operation | Cost |
|---|---|
| search.list | 100 units |
| videos.list | 1 unit |
| channels.list | 1 unit |
| commentThreads.list | 1 unit |
A single search.list call burns 100 units — so you can only do 100 searches per day. That's a hard ceiling for any serious project.
Workarounds:
- Use
videos.listwith known video IDs instead ofsearch.listwhen possible (1 unit vs 100) - Cache results aggressively
- Request a quota increase (Google sometimes grants 50K-100K for legitimate projects)
- Use multiple API keys across different Google Cloud projects
- Supplement the API with other extraction methods (see below)
yt-dlp: The Swiss Army Knife
yt-dlp is the community-maintained fork of youtube-dl, and it's incredibly powerful for metadata extraction — not just downloading.
import yt_dlp
def get_video_info(url):
ydl_opts = {
"quiet": True,
"no_download": True,
"extract_flat": False
}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
info = ydl.extract_info(url, download=False)
return {
"title": info.get("title"),
"views": info.get("view_count"),
"likes": info.get("like_count"),
"duration": info.get("duration"),
"upload_date": info.get("upload_date"),
"channel": info.get("channel"),
"subscriber_count": info.get("channel_follower_count"),
"description": info.get("description"),
"tags": info.get("tags"),
"categories": info.get("categories")
}
video = get_video_info("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
The key advantage: no API quota limits. yt-dlp works by parsing YouTube's web pages and internal APIs directly.
Scraping Entire Channels
def scrape_channel_videos(channel_url):
ydl_opts = {
"quiet": True,
"no_download": True,
"extract_flat": True,
"playlistend": 100
}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
info = ydl.extract_info(channel_url, download=False)
videos = []
for entry in info.get("entries", []):
videos.append({
"id": entry.get("id"),
"title": entry.get("title"),
"url": entry.get("url")
})
return videos
channel_videos = scrape_channel_videos("https://www.youtube.com/@mkbhd")
Important note: yt-dlp gets updated frequently to keep up with YouTube's changes. Always install the latest version:
pip install -U yt-dlp
Extracting Transcripts
Transcripts are gold for NLP work, content analysis, and building search indexes. The youtube-transcript-api library makes this straightforward:
from youtube_transcript_api import YouTubeTranscriptApi
def get_transcript(video_id, language="en"):
try:
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
try:
transcript = transcript_list.find_manually_created_transcript([language])
except:
transcript = transcript_list.find_generated_transcript([language])
data = transcript.fetch()
full_text = " ".join([entry["text"] for entry in data])
return {
"text": full_text,
"segments": data,
"language": language,
"is_generated": transcript.is_generated
}
except Exception as e:
return {"error": str(e)}
result = get_transcript("dQw4w9WgXcQ")
print(result["text"][:500])
Batch Transcript Extraction
For large-scale transcript collection:
import time
import json
def batch_transcripts(video_ids, output_file="transcripts.jsonl"):
results = []
for i, vid in enumerate(video_ids):
print(f"Processing {i+1}/{len(video_ids)}: {vid}")
transcript = get_transcript(vid)
transcript["video_id"] = vid
results.append(transcript)
with open(output_file, "a") as f:
f.write(json.dumps(transcript) + "\n")
time.sleep(1)
success = sum(1 for r in results if "error" not in r)
print(f"Done: {success}/{len(video_ids)} transcripts extracted")
return results
Not every video has transcripts available. Auto-generated transcripts cover most English-language content, but you'll see gaps with music, very short clips, and some older videos.
Scraping Comments
Comments are useful for sentiment analysis, finding common questions, and understanding audience engagement.
Using the Official API
def get_comments(video_id, max_results=100):
comments = []
request = youtube.commentThreads().list(
part="snippet",
videoId=video_id,
maxResults=min(max_results, 100),
order="relevance",
textFormat="plainText"
)
while request and len(comments) < max_results:
response = request.execute()
for item in response["items"]:
comment = item["snippet"]["topLevelComment"]["snippet"]
comments.append({
"author": comment["authorDisplayName"],
"text": comment["textDisplay"],
"likes": comment["likeCount"],
"published_at": comment["publishedAt"]
})
request = youtube.commentThreads().list_next(request, response)
return comments[:max_results]
Using yt-dlp for Comments (No Quota)
yt-dlp can also extract comments without API quota costs:
def get_comments_ytdlp(video_url, max_comments=200):
ydl_opts = {
"quiet": True,
"no_download": True,
"getcomments": True,
"extractor_args": {
"youtube": {
"max_comments": [str(max_comments)]
}
}
}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
info = ydl.extract_info(video_url, download=False)
comments = []
for c in info.get("comments", []):
comments.append({
"author": c.get("author"),
"text": c.get("text"),
"likes": c.get("like_count"),
"timestamp": c.get("timestamp")
})
return comments
This is slower than the API but has no daily limits.
Channel Analytics Scraping
For competitive analysis, you often need channel-level data:
def get_channel_stats(channel_id):
request = youtube.channels().list(
part="statistics,snippet,contentDetails,brandingSettings",
id=channel_id
)
response = request.execute()
if not response["items"]:
return None
channel = response["items"][0]
return {
"name": channel["snippet"]["title"],
"description": channel["snippet"]["description"],
"subscribers": int(channel["statistics"]["subscriberCount"]),
"total_views": int(channel["statistics"]["viewCount"]),
"video_count": int(channel["statistics"]["videoCount"]),
"created_at": channel["snippet"]["publishedAt"],
"uploads_playlist": channel["contentDetails"]["relatedPlaylists"]["uploads"]
}
Scaling Up: When Scripts Aren't Enough
The approaches above work well for small to medium projects. But if you need to scrape thousands of videos regularly, you'll run into challenges:
- Rate limiting: YouTube will temporarily block IPs that send too many requests
- Maintenance: YouTube changes its internal APIs frequently, breaking scrapers
- Infrastructure: Running scrapers 24/7 requires proxy management and monitoring
For production-scale YouTube scraping, managed solutions save significant engineering time. I built a YouTube Scraper on Apify that handles all of this — proxies, retries, anti-bot detection — so you can focus on what you do with the data instead of maintaining scraper infrastructure.
Best Practices
Respect rate limits. Whether using the API or scraping directly, add delays between requests. YouTube will temporarily block aggressive scrapers.
Cache everything. Video metadata doesn't change frequently. Store results in a database and only re-fetch what you need.
Use the right tool for the job. Official API for structured queries with low volume. yt-dlp for metadata extraction at scale. youtube-transcript-api for transcripts.
Handle errors gracefully. Videos get deleted, channels go private, transcripts aren't available. Your code should handle all of these.
Stay current. YouTube regularly updates its anti-bot measures. Keep yt-dlp updated and monitor your scraper's success rate.
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("yt_scraper")
def scrape_with_monitoring(video_ids):
success, failed = 0, 0
for vid in video_ids:
try:
data = get_video_info(f"https://youtube.com/watch?v={vid}")
success += 1
except Exception as e:
logger.warning(f"Failed {vid}: {e}")
failed += 1
rate = success / (success + failed) * 100
logger.info(f"Success rate: {rate:.1f}% ({success}/{success+failed})")
if rate < 90:
logger.warning("Success rate below 90% — check for blocks or API changes")
Conclusion
YouTube scraping in 2026 comes down to combining the right tools: the official API for structured, quota-limited access; yt-dlp for flexible metadata extraction; and specialized libraries for transcripts and comments. Start with the simplest approach that meets your needs, and scale up from there.
For production workloads where you don't want to deal with proxy rotation and maintenance, check out the YouTube Scraper on Apify — it handles the infrastructure so you can focus on the data.
Happy scraping, and remember to be respectful of rate limits and terms of service.
Have questions or want to share your YouTube scraping setup? Drop a comment below.
Top comments (0)