Building a Podcast Guest Tracker Across Shows and Platforms
Tracking who appears on which podcasts reveals influence networks, trending experts, and booking patterns. Let's build a Python scraper that monitors podcast guests across platforms.
Data Sources
- Podcast RSS feeds — Direct access to episode metadata
- Apple Podcasts — Episode descriptions with guest names
- ListenNotes API — Podcast search engine with API
Setting Up
pip install requests beautifulsoup4 feedparser pandas spacy
python3 -m spacy download en_core_web_sm
Parsing RSS Feeds
Every podcast has an RSS feed — the most reliable data source:
import feedparser
def parse_podcast_feed(feed_url):
feed = feedparser.parse(feed_url)
episodes = []
for entry in feed.entries:
episodes.append({
"title": entry.get("title", ""),
"published": entry.get("published", ""),
"summary": entry.get("summary", ""),
"duration": entry.get("itunes_duration", ""),
"link": entry.get("link", "")
})
return {
"show_title": feed.feed.get("title", ""),
"episodes": episodes
}
podcast = parse_podcast_feed("https://feeds.simplecast.com/54nAGcIl")
print(f"Show: {podcast['show_title']}, Episodes: {len(podcast['episodes'])}")
Extracting Guest Names with NLP
import spacy
import re
nlp = spacy.load("en_core_web_sm")
GUEST_PATTERNS = [
r"(?:with|featuring|guest|interview(?:ing)?|joined by)\s+([A-Z][a-z]+ [A-Z][a-z]+)",
r"([A-Z][a-z]+ [A-Z][a-z]+)\s+(?:joins|talks|discusses|shares|explains)",
]
def extract_guests(title, summary):
guests = set()
text = f"{title} {summary}"
for pattern in GUEST_PATTERNS:
matches = re.findall(pattern, text)
guests.update(matches)
doc = nlp(title + ". " + summary[:500])
for ent in doc.ents:
if ent.label_ == "PERSON" and len(ent.text.split()) >= 2:
guests.add(ent.text)
return list(guests)
print(extract_guests("Building AI Products with Sarah Chen",
"Sarah Chen, VP of Engineering at TechCorp, joins us to discuss..."))
Tracking Across Multiple Shows
import pandas as pd
import time
PODCAST_FEEDS = {
"Lex Fridman": "https://lexfridman.com/feed/podcast/",
"The Changelog": "https://changelog.com/podcast/feed",
"Talk Python": "https://talkpython.fm/episodes/rss",
}
def build_guest_database(feeds):
all_appearances = []
for show_name, feed_url in feeds.items():
podcast = parse_podcast_feed(feed_url)
for ep in podcast["episodes"]:
guests = extract_guests(ep["title"], ep.get("summary", ""))
for guest in guests:
all_appearances.append({
"guest": guest, "show": show_name,
"episode_title": ep["title"],
"date": ep.get("published", "")
})
time.sleep(1)
return pd.DataFrame(all_appearances)
df = build_guest_database(PODCAST_FEEDS)
print(f"Total appearances: {len(df)}")
Analysis
cross_show = df.groupby("guest")["show"].nunique()
frequent = cross_show[cross_show > 1].sort_values(ascending=False)
print("Guests on multiple shows:")
print(frequent.head(20))
For non-RSS sources, use ScraperAPI with JS rendering. Scale with ThorData proxies and monitor with ScrapeOps.
Key Takeaways
- RSS feeds provide the most reliable podcast episode data
- NLP + regex patterns extract guest names from titles and descriptions
- Cross-show analysis reveals influence networks and trending experts
- Automated tracking catches booking patterns early
RSS feeds are designed for public consumption. Respect rate limits when scraping additional platforms.
Top comments (0)