Jekyll auto posts from YouTube feeds

#youtube #jekyll #python #rss

🧩 The problem

I wanted to automate this boring and repetitive workflow: my idea is that every time a YouTube video is published on my channel I want to have an associated post on my personal Jekyll blog.

Until very recently I handled all this by hand. This process was very tedious because, except for the video summary (more on this later), it was merely a copy paste operation of fields in the YAML front matter.

✅ The solution

📶 Feeds

The solution to this is to leverage on the RSS feeds provided by YouTube itself, and a few Python libraries:

📶 feedparser: the core dependency element
💻 PyYAML: parses the existing front matter
🐌 slugify: not crucial for this use case but useful

Every YouTube channel, in-fact, has feeds in this URL format:

https://www.youtube.com/feeds/videos.xml?channel_id={channel_id}

⚙️ Algorithm

Essentially the algorithm work like this:

get and parse the feed file
for each news item (entry) in the feed file, extract:
1. URL
2. title
3. published date
4. tags (via the video description, using a regex)
create a new markdown file in the _posts directory using variables of step 2, and avoid changing existing auto-posts

Concerning step 1 and 2 it's quite simple thanks to list comprehensions:

def extract_feed_youtube_data(feed_source: str) -> list[dict]:
    d = feedparser.parse(feed_source)
    data = [
        {
            'url': e['link'],
            'title': e['title'],

            # feedparser generates a Python 9-tuple in UTC.
            'published': datetime.datetime(*e['published_parsed'][:6]),

            # Use a regex for this.
            # If there are no hash tags available just use default ones.
            'tags': get_video_tags(e['summary'])
                if ('summary' in e and e['summary'] != '')
                else STANDARD_TAGS,

            # The video description is not used at the moment.
            'sm': e['summary'],
        }
        for e in d.entries if 'link' in e
    ]

Before returning the data you can also perform a cleanup:

    for e in data:                                                              
        # Always use default tags. This also works in case videos do not have
        # a description.
        e['tags'] += STANDARD_TAGS

        # Unique.
        e['tags'] = sorted(list(set(e['tags'])))

        e['summary'] = ' '.join(e['tags'])

    return data

Step 3 involves an f-string. We need to take care of specific fields to avoid Jekyll throwing YAML parsing errors when using quotes. This can happen in the title and description fields specifically.

def create_markdown_blog_post(posts_base_directory: str, data: dict):
    return f"""---
title: "|-"
    {data["title"]}


tags: [{youtube_tags_to_jekyll_front_matter_tags(data["tags"])}]
related_links: ['{data["url"]}']
updated: {datetime_object_to_jekyll_front_matter_utc(data["published"])}
description: "|-"
    {data["title"]}


lang: 'en'
---
{generate_youtube_embed_code(get_youtube_video_id(data["url"]))}

<div markdown="0">
{data["summary"]}
</div>

*Note: post auto-generated from YouTube feeds.*
"""

🎯 Result

As you see, the content of each auto post is very basic: beside the standard fields, in the body we can find an HTML YouTube embed code, and a list of hash
tags extracted from the video description using a regex. The description (summary) was left out on purpose. The idea to implement in the future is to
get the video transcription and let a LLM (via Ollama) generate a summary. Of course I then need to manually proofread it. I also cannot copy the video
description verbatim because of SEO.

Another improvement could involve replicating a subset of the fields of the
auto-post to send toots to Mastodon using its API.

▶️ Running

At the moment the script is triggered by a local pre-commit hook which also installs the Python dependencies in a separate environment:

- repo: local
  hooks:
    - id: generate_posts
      name: generate_posts
      entry: python ./.scripts/youtube/generate_posts.py
      verbose: true
      always_run: true
      pass_filenames: false
      language: python
      types: [python]
      additional_dependencies: ['feedparser>=6,<7', 'python-slugify[unidecode]>=8,<9', 'PyYAML>=6,<7']

🎉 Conclusion

And that's it really. This script saves time and it's the kind of automation I like. Thankfully YouTube still provides RSS feeds, although I already had to fix the script once to adapt to their new structural changes.

If you are interested in the source code you can find it on its repo.

You can comment here and check my YouTube channel for related content!