Solve Computer Science

Posted on Jul 3

YouTube channel mirror on Jekyll - part 3

#youtube #python #rss #automation

🧩 The problem

As promised in the previous post let's add support for reading RSS feeds from YouTube, so we can download the most recent videos which are still not present locally. This way we'll avoid changing the ones already downloaded.

It's pretty straight forward if we follow what has been done in the first post of this series. In-fact, all the feed reading code remains the same. What changes, is that we now have to compare video IDs instead of markdown file contents.

⚠️ Warning

⚠️⚠️ Before continuing, please only mirror content you have permission to... ⚠️⚠️

✅ The solution

Before starting, let's not forget the Python requirements.txt file:

yt-dlp
feedparser>=6,<7

We only need these two dependencies for the moment.

🆔 YouTube video ID extraction

This first function can be imported verbatim from the original script. What it does is very simple: given a YouTube video URL it returns a video id. For example:

get_youtube_video_id('https://www.youtube.com/watch?v=PwEMWnuMdac')

returns

PwEMWnuMdac

Here it is:

# Import this from the generate_posts.py script.
def get_youtube_video_id(video_url: str) -> str:                                
    parsed_url = urlparse(video_url)

    # Check for the path and query parameters.
    if 'youtube.com' in parsed_url.netloc:
        # Standard YouTube URLs.
        if parsed_url.path == '/watch':
            # parse_qs gets a dict.
            query_params = parse_qs(parsed_url.query)
            return query_params.get('v', [None])[0]
        elif parsed_url.path.startswith('/shorts'):
            return pathlib.Path(parsed_url.path).stem

    return None

📶 YouTube URL feeds

The feeds function has been simplified a lot because we only need the video URLs. All the other fields will not be used.

def extract_feed_youtube_urls(feed_source: str) -> list[str]:
    data = feedparser.parse(feed_source)
    return [d['link'] for d in data.entries if 'link' in d]

Remember the power of list comprehensions!

📽️ Filter Missing Videos

As described in the introduction, we need to know exactly which videos need to be downloaded, and avoiding re-iterating the ones already available locally. The following filter function does just that. Remember the directory structure:

videos
├── -9Tp3rB-5n0
│   ├── -9Tp3rB-5n0.en.vtt
│   ├── -9Tp3rB-5n0.it.vtt
│   ├── -9Tp3rB-5n0.png
│   ├── -9Tp3rB-5n0.webm
│   ├── description.txt
│   └── title.txt
├── 02RSfvOvb3g
│   ├── 02RSfvOvb3g.en.vtt
│   ├── 02RSfvOvb3g.it.vtt
│   ├── 02RSfvOvb3g.png
│   ├── 02RSfvOvb3g.webm
│   ├── description.txt
│   └── title.txt
[...]

This means that each video is inside a directory having the video ID as its name. The set data structure comes in very handy here:

def filter_missing_videos(dst_dir: str, urls: list[str]) -> list[str]:
    # 1. Get all directories at level 0 (root) that have
    # compatible YouTube video ids (length 11).
    directory: pathlib.Path = pathlib.Path(dst_dir)
    directory_video_ids: list[str] = ([
        d.stem for d in directory.rglob('*')
            if (d.is_dir()
             and str(d.parent) == dst_dir
             and len(d.stem) == YOUTUBE_VIDEO_ID_LENGTH)])

    # 2. Get YouTube URL video IDs.
    url_video_ids: list[str] = [get_youtube_video_id(url) for url in urls]

    # 3. Get all video IDs present on YouTube and not present locally.
    ids_to_download: list[str] = (
        list(set(url_video_ids) - set(directory_video_ids)))

    # 4. Rebuild original YouTube URLs.
    return ['https://www.youtube.com/watch?v=' + u for u in ids_to_download]

🧮 Main function

Finally, we have to glue every bit and piece.

To be able to optionally download, or re-sync the whole channel, just in case, as well as to download all videos in the RSS feed and avoid the previous filtering, I used argparse and some simple logic:

if __name__ == '__main__':
    logging.basicConfig(level=logging.INFO)

    parser = argparse.ArgumentParser(
        description='Mirror YouTube channel locally.'
    )
    subparsers = parser.add_subparsers(
        dest='command',
        required=True
    )

    download_parser = subparsers.add_parser(
        'download',
        help='Download all videos, rsync style'
    )

    download_group = download_parser.add_mutually_exclusive_group()
    download_group.add_argument(
        '--ignore-existing',
        action='store_true',
        help='Download all videos listed in the RSS feed'
    )
    download_group.add_argument(
       '--whole-channel',
       action='store_true',
       help='Synchronize the whole YouTube channel. Use this to update titles and descriptions'
    )

    args = parser.parse_args()

    urls: list[str]

    if args.ignore_existing:
        urls = extract_feed_youtube_urls(FEED_URL)
    elif args.whole_channel:
        urls = [CHANNEL_URL]
    else:
        urls = extract_feed_youtube_urls(FEED_URL)
        urls = filter_missing_videos(DST_DIR, urls)

    logging.info(f'URLs to download: {urls}')

    if len(urls) > 0:
        with YoutubeDL(ydl_opts) as ydl:
            ydl.download(urls)
    else:
        logging.info('no videos to download')

🤖 Systemd unit files

In the service unit file we simply need to call the script:

[Unit]
Description=mirror-yt SolveComputerScience
Requires=network-online.target
After=network-online.target

[Service]
Type=simple
WorkingDirectory=/home/myuser/.scripts
ExecStart=/bin/sh -c '. .venv/bin/activate && python -m mirror_yt download; deactivate'
User=myuser
Group=myuser

And of course you can run it daily, for example:

[Unit]
Description=Once every day mirror-yt SolveComputerScience

[Timer]
OnCalendar=Daily
Persistent=true

[Install]
WantedBy=timers.target

🎉 Conclusion

Click to open the full script

import argparse
import logging
import pathlib
import sys
from urllib.parse import urlparse, parse_qs

from yt_dlp import YoutubeDL
import feedparser


DST_DIR: str = '/srv/http/videos'
FEED_URL: str = 'https://www.youtube.com/feeds/videos.xml?channel_id=UC2rr0LbIuy34JHEoCndmKiA'
CHANNEL_URL: str = 'https://www.youtube.com/channel/UC2rr0LbIuy34JHEoCndmKiA'
YOUTUBE_VIDEO_ID_LENGTH: int = 11


ydl_opts: dict = {
    'verbose':             True,
    'no_overwrites':       True,
    'call_home':           False,
    'add_metadata':        True,
    'fixup':               'detect_or_warn',
    'prefer_ffmpeg':       True,
    'subtitleslangs':      ['en', 'it'],
    'writesubtitles':      True,
    'writeautomaticsub':   True,
    'prefer_free_formats': True,
    'writethumbnail':      True,
    'final_ext':           'webm',
    'outtmpl': {
        'default': str(pathlib.Path(DST_DIR,'%(id)s','%(id)s.%(ext)s'))
    },
    'postprocessors': [
        {
            # --convert-thumbnails png
            'format': 'png',
            'key': 'FFmpegThumbnailsConvertor',
            'when': 'before_dl'
        },
        {
            # --recode webm
        'key': 'FFmpegVideoConvertor',
            'preferedformat': 'webm'
        },
        {
            'exec_cmd': ["cat > " + str(pathlib.Path(DST_DIR, '%(id)s', 'title.txt')) + " << 'EOF'\n"
                         '%(title)s\n'
                         'EOF'],
            'key': 'Exec',
            'when': 'after_move'
        },
        {
            'exec_cmd': ["cat > " + str(pathlib.Path(DST_DIR, '%(id)s', 'description.txt')) + " << 'EOF'\n"
                         '%(description)s\n'
                         'EOF'],
            'key': 'Exec',
            'when': 'after_move'
        },
    ],
}


def extract_feed_youtube_urls(feed_source: str) -> list[str]:
    data = feedparser.parse(feed_source)
    return [d['link'] for d in data.entries if 'link' in d]


# Import this from the generate_posts.py script.
def get_youtube_video_id(video_url: str) -> str:                                
    parsed_url = urlparse(video_url)

    # Check for the path and query parameters.
    if 'youtube.com' in parsed_url.netloc:
        # Standard YouTube URLs.
        if parsed_url.path == '/watch':
            # parse_qs gets a dict.
            query_params = parse_qs(parsed_url.query)
            return query_params.get('v', [None])[0]
        elif parsed_url.path.startswith('/shorts'):
            return pathlib.Path(parsed_url.path).stem

    return None


def filter_missing_videos(dst_dir: str, urls: list[str]) -> list[str]:
    # 1. Get all directories at level 0 (root) that have
    # compatible YouTube video ids (length 11).
    directory: pathlib.Path = pathlib.Path(dst_dir)
    directory_video_ids: list[str] = ([
        d.stem for d in directory.rglob('*')
            if (d.is_dir()
             and str(d.parent) == dst_dir
             and len(d.stem) == YOUTUBE_VIDEO_ID_LENGTH)])

    # 2. Get YouTube URL video IDs.
    url_video_ids: list[str] = [get_youtube_video_id(url) for url in urls]

    # 3. Get all video IDs present on YouTube and not present locally.
    ids_to_download: list[str] = list(set(url_video_ids) - set(directory_video_ids))

    # 4. Rebuild original YouTube URLs.
    return ['https://www.youtube.com/watch?v=' + u for u in ids_to_download]


if __name__ == '__main__':
    logging.basicConfig(level=logging.INFO)

    parser = argparse.ArgumentParser(description='Mirror YouTube channel locally.')
    subparsers = parser.add_subparsers(dest='command', required=True)

    download_parser = subparsers.add_parser('download', help='Download all videos, rsync style')

    download_group = download_parser.add_mutually_exclusive_group()
    download_group.add_argument('--ignore-existing', action='store_true', help='Download all videos listed in the RSS feed')
    download_group.add_argument('--whole-channel', action='store_true', help='Synchronize the whole YouTube channel. Use this to update titles and descriptions')

    args = parser.parse_args()

    urls: list[str]

    if args.ignore_existing:
        urls = extract_feed_youtube_urls(FEED_URL)
    elif args.whole_channel:
        urls = [CHANNEL_URL]
    else:
        urls = extract_feed_youtube_urls(FEED_URL)
        urls = filter_missing_videos(DST_DIR, urls)

    logging.info(f'URLs to download: {urls}')

    if len(urls) > 0:
        with YoutubeDL(ydl_opts) as ydl:
            ydl.download(urls)
    else:
        logging.info('no videos to download')

In the next post we'll start to automate the Jekyll side of things.

You can comment here and check my YouTube channel.

DEV Community