🧩 The problem
As promised in the previous post let's add support for reading RSS feeds from YouTube, so we can download the most recent videos which are still not present locally. This way we'll avoid changing the ones already downloaded.
It's pretty straight forward if we follow what has been done in the first post of this series. In-fact, all the feed reading code remains the same. What changes, is that we now have to compare video IDs instead of markdown file contents.
⚠️ Warning
⚠️⚠️ Before continuing, please only mirror content you have permission to... ⚠️⚠️
✅ The solution
Before starting, let's not forget the Python requirements.txt
file:
yt-dlp
feedparser>=6,<7
We only need these two dependencies for the moment.
🆔 YouTube video ID extraction
This first function can be imported verbatim from the original script. What it does is very simple: given a YouTube video URL it returns a video id. For example:
get_youtube_video_id('https://www.youtube.com/watch?v=PwEMWnuMdac')
returns
PwEMWnuMdac
Here it is:
# Import this from the generate_posts.py script.
def get_youtube_video_id(video_url: str) -> str:
parsed_url = urlparse(video_url)
# Check for the path and query parameters.
if 'youtube.com' in parsed_url.netloc:
# Standard YouTube URLs.
if parsed_url.path == '/watch':
# parse_qs gets a dict.
query_params = parse_qs(parsed_url.query)
return query_params.get('v', [None])[0]
elif parsed_url.path.startswith('/shorts'):
return pathlib.Path(parsed_url.path).stem
return None
📶 YouTube URL feeds
The feeds function has been simplified a lot because we only need the video URLs. All the other fields will not be used.
def extract_feed_youtube_urls(feed_source: str) -> list[str]:
data = feedparser.parse(feed_source)
return [d['link'] for d in data.entries if 'link' in d]
Remember the power of list comprehensions!
📽️ Filter Missing Videos
As described in the introduction, we need to know exactly which videos need to be downloaded, and avoiding re-iterating the ones already available locally. The following filter function does just that. Remember the directory structure:
videos
├── -9Tp3rB-5n0
│ ├── -9Tp3rB-5n0.en.vtt
│ ├── -9Tp3rB-5n0.it.vtt
│ ├── -9Tp3rB-5n0.png
│ ├── -9Tp3rB-5n0.webm
│ ├── description.txt
│ └── title.txt
├── 02RSfvOvb3g
│ ├── 02RSfvOvb3g.en.vtt
│ ├── 02RSfvOvb3g.it.vtt
│ ├── 02RSfvOvb3g.png
│ ├── 02RSfvOvb3g.webm
│ ├── description.txt
│ └── title.txt
[...]
This means that each video is inside a directory having the video ID as its name. The set
data structure comes in very handy here:
def filter_missing_videos(dst_dir: str, urls: list[str]) -> list[str]:
# 1. Get all directories at level 0 (root) that have
# compatible YouTube video ids (length 11).
directory: pathlib.Path = pathlib.Path(dst_dir)
directory_video_ids: list[str] = ([
d.stem for d in directory.rglob('*')
if (d.is_dir()
and str(d.parent) == dst_dir
and len(d.stem) == YOUTUBE_VIDEO_ID_LENGTH)])
# 2. Get YouTube URL video IDs.
url_video_ids: list[str] = [get_youtube_video_id(url) for url in urls]
# 3. Get all video IDs present on YouTube and not present locally.
ids_to_download: list[str] = (
list(set(url_video_ids) - set(directory_video_ids)))
# 4. Rebuild original YouTube URLs.
return ['https://www.youtube.com/watch?v=' + u for u in ids_to_download]
🧮 Main function
Finally, we have to glue every bit and piece.
To be able to optionally download, or re-sync the whole channel, just in case, as well as to download all videos in the RSS feed and avoid the previous filtering, I used argparse
and some simple logic:
if __name__ == '__main__':
logging.basicConfig(level=logging.INFO)
parser = argparse.ArgumentParser(
description='Mirror YouTube channel locally.'
)
subparsers = parser.add_subparsers(
dest='command',
required=True
)
download_parser = subparsers.add_parser(
'download',
help='Download all videos, rsync style'
)
download_group = download_parser.add_mutually_exclusive_group()
download_group.add_argument(
'--ignore-existing',
action='store_true',
help='Download all videos listed in the RSS feed'
)
download_group.add_argument(
'--whole-channel',
action='store_true',
help='Synchronize the whole YouTube channel. Use this to update titles and descriptions'
)
args = parser.parse_args()
urls: list[str]
if args.ignore_existing:
urls = extract_feed_youtube_urls(FEED_URL)
elif args.whole_channel:
urls = [CHANNEL_URL]
else:
urls = extract_feed_youtube_urls(FEED_URL)
urls = filter_missing_videos(DST_DIR, urls)
logging.info(f'URLs to download: {urls}')
if len(urls) > 0:
with YoutubeDL(ydl_opts) as ydl:
ydl.download(urls)
else:
logging.info('no videos to download')
🤖 Systemd unit files
In the service unit file we simply need to call the script:
[Unit]
Description=mirror-yt SolveComputerScience
Requires=network-online.target
After=network-online.target
[Service]
Type=simple
WorkingDirectory=/home/myuser/.scripts
ExecStart=/bin/sh -c '. .venv/bin/activate && python -m mirror_yt download; deactivate'
User=myuser
Group=myuser
And of course you can run it daily, for example:
[Unit]
Description=Once every day mirror-yt SolveComputerScience
[Timer]
OnCalendar=Daily
Persistent=true
[Install]
WantedBy=timers.target
🎉 Conclusion
Click to open the full script
import argparse
import logging
import pathlib
import sys
from urllib.parse import urlparse, parse_qs
from yt_dlp import YoutubeDL
import feedparser
DST_DIR: str = '/srv/http/videos'
FEED_URL: str = 'https://www.youtube.com/feeds/videos.xml?channel_id=UC2rr0LbIuy34JHEoCndmKiA'
CHANNEL_URL: str = 'https://www.youtube.com/channel/UC2rr0LbIuy34JHEoCndmKiA'
YOUTUBE_VIDEO_ID_LENGTH: int = 11
ydl_opts: dict = {
'verbose': True,
'no_overwrites': True,
'call_home': False,
'add_metadata': True,
'fixup': 'detect_or_warn',
'prefer_ffmpeg': True,
'subtitleslangs': ['en', 'it'],
'writesubtitles': True,
'writeautomaticsub': True,
'prefer_free_formats': True,
'writethumbnail': True,
'final_ext': 'webm',
'outtmpl': {
'default': str(pathlib.Path(DST_DIR,'%(id)s','%(id)s.%(ext)s'))
},
'postprocessors': [
{
# --convert-thumbnails png
'format': 'png',
'key': 'FFmpegThumbnailsConvertor',
'when': 'before_dl'
},
{
# --recode webm
'key': 'FFmpegVideoConvertor',
'preferedformat': 'webm'
},
{
'exec_cmd': ["cat > " + str(pathlib.Path(DST_DIR, '%(id)s', 'title.txt')) + " << 'EOF'\n"
'%(title)s\n'
'EOF'],
'key': 'Exec',
'when': 'after_move'
},
{
'exec_cmd': ["cat > " + str(pathlib.Path(DST_DIR, '%(id)s', 'description.txt')) + " << 'EOF'\n"
'%(description)s\n'
'EOF'],
'key': 'Exec',
'when': 'after_move'
},
],
}
def extract_feed_youtube_urls(feed_source: str) -> list[str]:
data = feedparser.parse(feed_source)
return [d['link'] for d in data.entries if 'link' in d]
# Import this from the generate_posts.py script.
def get_youtube_video_id(video_url: str) -> str:
parsed_url = urlparse(video_url)
# Check for the path and query parameters.
if 'youtube.com' in parsed_url.netloc:
# Standard YouTube URLs.
if parsed_url.path == '/watch':
# parse_qs gets a dict.
query_params = parse_qs(parsed_url.query)
return query_params.get('v', [None])[0]
elif parsed_url.path.startswith('/shorts'):
return pathlib.Path(parsed_url.path).stem
return None
def filter_missing_videos(dst_dir: str, urls: list[str]) -> list[str]:
# 1. Get all directories at level 0 (root) that have
# compatible YouTube video ids (length 11).
directory: pathlib.Path = pathlib.Path(dst_dir)
directory_video_ids: list[str] = ([
d.stem for d in directory.rglob('*')
if (d.is_dir()
and str(d.parent) == dst_dir
and len(d.stem) == YOUTUBE_VIDEO_ID_LENGTH)])
# 2. Get YouTube URL video IDs.
url_video_ids: list[str] = [get_youtube_video_id(url) for url in urls]
# 3. Get all video IDs present on YouTube and not present locally.
ids_to_download: list[str] = list(set(url_video_ids) - set(directory_video_ids))
# 4. Rebuild original YouTube URLs.
return ['https://www.youtube.com/watch?v=' + u for u in ids_to_download]
if __name__ == '__main__':
logging.basicConfig(level=logging.INFO)
parser = argparse.ArgumentParser(description='Mirror YouTube channel locally.')
subparsers = parser.add_subparsers(dest='command', required=True)
download_parser = subparsers.add_parser('download', help='Download all videos, rsync style')
download_group = download_parser.add_mutually_exclusive_group()
download_group.add_argument('--ignore-existing', action='store_true', help='Download all videos listed in the RSS feed')
download_group.add_argument('--whole-channel', action='store_true', help='Synchronize the whole YouTube channel. Use this to update titles and descriptions')
args = parser.parse_args()
urls: list[str]
if args.ignore_existing:
urls = extract_feed_youtube_urls(FEED_URL)
elif args.whole_channel:
urls = [CHANNEL_URL]
else:
urls = extract_feed_youtube_urls(FEED_URL)
urls = filter_missing_videos(DST_DIR, urls)
logging.info(f'URLs to download: {urls}')
if len(urls) > 0:
with YoutubeDL(ydl_opts) as ydl:
ydl.download(urls)
else:
logging.info('no videos to download')
In the next post we'll start to automate the Jekyll side of things.
You can comment here and check my YouTube channel.
Top comments (0)