Building a Podcast Guest Tracker Across Shows and Platforms

#python #tutorial #webdev #programming

Building a Podcast Guest Tracker Across Shows and Platforms

Tracking who appears on which podcasts reveals influence networks, trending experts, and booking patterns. Let's build a Python scraper that monitors podcast guests across platforms.

Data Sources

Podcast RSS feeds — Direct access to episode metadata
Apple Podcasts — Episode descriptions with guest names
ListenNotes API — Podcast search engine with API

Setting Up

pip install requests beautifulsoup4 feedparser pandas spacy
python3 -m spacy download en_core_web_sm

Parsing RSS Feeds

Every podcast has an RSS feed — the most reliable data source:

import feedparser

def parse_podcast_feed(feed_url):
    feed = feedparser.parse(feed_url)
    episodes = []
    for entry in feed.entries:
        episodes.append({
            "title": entry.get("title", ""),
            "published": entry.get("published", ""),
            "summary": entry.get("summary", ""),
            "duration": entry.get("itunes_duration", ""),
            "link": entry.get("link", "")
        })
    return {
        "show_title": feed.feed.get("title", ""),
        "episodes": episodes
    }

podcast = parse_podcast_feed("https://feeds.simplecast.com/54nAGcIl")
print(f"Show: {podcast['show_title']}, Episodes: {len(podcast['episodes'])}")

Extracting Guest Names with NLP

import spacy
import re

nlp = spacy.load("en_core_web_sm")

GUEST_PATTERNS = [
    r"(?:with|featuring|guest|interview(?:ing)?|joined by)\s+([A-Z][a-z]+ [A-Z][a-z]+)",
    r"([A-Z][a-z]+ [A-Z][a-z]+)\s+(?:joins|talks|discusses|shares|explains)",
]

def extract_guests(title, summary):
    guests = set()
    text = f"{title} {summary}"
    for pattern in GUEST_PATTERNS:
        matches = re.findall(pattern, text)
        guests.update(matches)

    doc = nlp(title + ". " + summary[:500])
    for ent in doc.ents:
        if ent.label_ == "PERSON" and len(ent.text.split()) >= 2:
            guests.add(ent.text)
    return list(guests)

print(extract_guests("Building AI Products with Sarah Chen",
    "Sarah Chen, VP of Engineering at TechCorp, joins us to discuss..."))

Tracking Across Multiple Shows

import pandas as pd
import time

PODCAST_FEEDS = {
    "Lex Fridman": "https://lexfridman.com/feed/podcast/",
    "The Changelog": "https://changelog.com/podcast/feed",
    "Talk Python": "https://talkpython.fm/episodes/rss",
}

def build_guest_database(feeds):
    all_appearances = []
    for show_name, feed_url in feeds.items():
        podcast = parse_podcast_feed(feed_url)
        for ep in podcast["episodes"]:
            guests = extract_guests(ep["title"], ep.get("summary", ""))
            for guest in guests:
                all_appearances.append({
                    "guest": guest, "show": show_name,
                    "episode_title": ep["title"],
                    "date": ep.get("published", "")
                })
        time.sleep(1)
    return pd.DataFrame(all_appearances)

df = build_guest_database(PODCAST_FEEDS)
print(f"Total appearances: {len(df)}")

Analysis

cross_show = df.groupby("guest")["show"].nunique()
frequent = cross_show[cross_show > 1].sort_values(ascending=False)
print("Guests on multiple shows:")
print(frequent.head(20))

For non-RSS sources, use ScraperAPI with JS rendering. Scale with ThorData proxies and monitor with ScrapeOps.

Key Takeaways

RSS feeds provide the most reliable podcast episode data
NLP + regex patterns extract guest names from titles and descriptions
Cross-show analysis reveals influence networks and trending experts
Automated tracking catches booking patterns early

RSS feeds are designed for public consumption. Respect rate limits when scraping additional platforms.

DEV Community

Building a Podcast Guest Tracker Across Shows and Platforms

Building a Podcast Guest Tracker Across Shows and Platforms

Data Sources

Setting Up

Parsing RSS Feeds

Extracting Guest Names with NLP

Tracking Across Multiple Shows

Analysis

Key Takeaways

Top comments (0)