DEV Community

agenthustler
agenthustler

Posted on

Building a Podcast Guest Tracker Across Shows and Platforms

Building a Podcast Guest Tracker Across Shows and Platforms

Tracking who appears on which podcasts reveals influence networks, trending experts, and booking patterns. Let's build a Python scraper that monitors podcast guests across platforms.

Data Sources

  • Podcast RSS feeds — Direct access to episode metadata
  • Apple Podcasts — Episode descriptions with guest names
  • ListenNotes API — Podcast search engine with API

Setting Up

pip install requests beautifulsoup4 feedparser pandas spacy
python3 -m spacy download en_core_web_sm
Enter fullscreen mode Exit fullscreen mode

Parsing RSS Feeds

Every podcast has an RSS feed — the most reliable data source:

import feedparser

def parse_podcast_feed(feed_url):
    feed = feedparser.parse(feed_url)
    episodes = []
    for entry in feed.entries:
        episodes.append({
            "title": entry.get("title", ""),
            "published": entry.get("published", ""),
            "summary": entry.get("summary", ""),
            "duration": entry.get("itunes_duration", ""),
            "link": entry.get("link", "")
        })
    return {
        "show_title": feed.feed.get("title", ""),
        "episodes": episodes
    }

podcast = parse_podcast_feed("https://feeds.simplecast.com/54nAGcIl")
print(f"Show: {podcast['show_title']}, Episodes: {len(podcast['episodes'])}")
Enter fullscreen mode Exit fullscreen mode

Extracting Guest Names with NLP

import spacy
import re

nlp = spacy.load("en_core_web_sm")

GUEST_PATTERNS = [
    r"(?:with|featuring|guest|interview(?:ing)?|joined by)\s+([A-Z][a-z]+ [A-Z][a-z]+)",
    r"([A-Z][a-z]+ [A-Z][a-z]+)\s+(?:joins|talks|discusses|shares|explains)",
]

def extract_guests(title, summary):
    guests = set()
    text = f"{title} {summary}"
    for pattern in GUEST_PATTERNS:
        matches = re.findall(pattern, text)
        guests.update(matches)

    doc = nlp(title + ". " + summary[:500])
    for ent in doc.ents:
        if ent.label_ == "PERSON" and len(ent.text.split()) >= 2:
            guests.add(ent.text)
    return list(guests)

print(extract_guests("Building AI Products with Sarah Chen",
    "Sarah Chen, VP of Engineering at TechCorp, joins us to discuss..."))
Enter fullscreen mode Exit fullscreen mode

Tracking Across Multiple Shows

import pandas as pd
import time

PODCAST_FEEDS = {
    "Lex Fridman": "https://lexfridman.com/feed/podcast/",
    "The Changelog": "https://changelog.com/podcast/feed",
    "Talk Python": "https://talkpython.fm/episodes/rss",
}

def build_guest_database(feeds):
    all_appearances = []
    for show_name, feed_url in feeds.items():
        podcast = parse_podcast_feed(feed_url)
        for ep in podcast["episodes"]:
            guests = extract_guests(ep["title"], ep.get("summary", ""))
            for guest in guests:
                all_appearances.append({
                    "guest": guest, "show": show_name,
                    "episode_title": ep["title"],
                    "date": ep.get("published", "")
                })
        time.sleep(1)
    return pd.DataFrame(all_appearances)

df = build_guest_database(PODCAST_FEEDS)
print(f"Total appearances: {len(df)}")
Enter fullscreen mode Exit fullscreen mode

Analysis

cross_show = df.groupby("guest")["show"].nunique()
frequent = cross_show[cross_show > 1].sort_values(ascending=False)
print("Guests on multiple shows:")
print(frequent.head(20))
Enter fullscreen mode Exit fullscreen mode

For non-RSS sources, use ScraperAPI with JS rendering. Scale with ThorData proxies and monitor with ScrapeOps.

Key Takeaways

  • RSS feeds provide the most reliable podcast episode data
  • NLP + regex patterns extract guest names from titles and descriptions
  • Cross-show analysis reveals influence networks and trending experts
  • Automated tracking catches booking patterns early

RSS feeds are designed for public consumption. Respect rate limits when scraping additional platforms.

Top comments (0)