Devil Scrapes

Posted on Jun 3

Podcast Guest Database: extract Spotify show guest history with Python

#webscraping #python #apify #data

Quick answer: Spotify's Web API returns episode metadata but has no guest field, no guest search, and no bulk export. To build a podcast guest database you page through a show's episodes, run NLP over each description, and emit one row per (episode × guest) pair. The Spotify Podcast Guest Extractor does this for $0.005 per row (~$5.05 per 1,000 results) using your own Spotify Developer credentials. No audio is downloaded — metadata and NLP only.

Podcast guesting is a multi-billion-dollar PR services market. Agencies charge $2-8k per month per client to secure guest spots on relevant shows, and their most time-consuming task is a question Spotify cannot answer: "who has been a guest on this show in the last two years?"

The episode list is there. Guest names are almost always in the title or description. But there is no guest field, no structured attribution, and no bulk export. You either read 200 episode pages by hand, or you build a scraper. Here is what that takes — and the one-call shortcut I packaged as an Apify Actor.

What is Spotify for Podcasts? 🎙️

Spotify's podcast catalog is the largest in the world by available show count — over 5 million shows indexed as of 2026. Every show's episode list is reachable via the Spotify Web API using a free Developer account and the client_credentials OAuth flow — no user login, no Spotify Premium, just a free app registration.

What the API gives you per episode:

Episode ID, title, and plain-text description
Release date (YYYY-MM-DD)
Duration in milliseconds
Public URL on open.spotify.com

What it does not give you: a guest field, a speaker label, or any structured attribution of who appeared. That inference work is left entirely to the caller.

Does Spotify have a podcast guest API?

No. Spotify's Web API exposes episode metadata but zero attribution data. The closest thing is the episode description — a free-text field producers write themselves, with no enforced schema. "Featuring Demis Hassabis" appears in one show; "#412 — Demis Hassabis" in another; "Today Lex sits down with the CEO of Google DeepMind" in a third. Extracting a clean, structured guest name from that variety requires pattern matching and NLP, not a single API field.

There is no guest-history search, no speaker-graph endpoint, and no bulk export. If you want this data as a database, you build the extraction layer yourself.

What the data looks like

Each row is one (episode × guest) pair. Here is a real example from the Lex Fridman Podcast:

{
  "show_id": "2MAi0BvDc6GTFvKFPXnkCL",
  "show_name": "Lex Fridman Podcast",
  "episode_id": "5kF8w2Q9pNeLBxXxNH1mxJ",
  "episode_name": "#412 — Demis Hassabis: AGI and the Future of AI",
  "episode_release_date": "2026-04-30",
  "episode_duration_ms": 9384210,
  "guest_name": "Demis Hassabis",
  "guest_role": "guest",
  "confidence": 0.92,
  "episode_url": "https://open.spotify.com/episode/5kF8w2Q9pNeLBxXxNH1mxJ",
  "scraped_at": "2026-05-16T12:00:00Z"
}

Eleven fields, Pydantic-validated, the same shape every time. The confidence field tells you whether the name came from a tight regex pattern (0.85-1.00) or a raw spaCy NER entity (0.55-0.75). Filter on confidence >= 0.8 for high precision; keep everything for broader coverage.

If an episode has no extractable guest — a solo episode, or a sparse description — the Actor still emits one row with guest_name: null, guest_role: "host", and confidence: 0.0, keeping the episode metadata in your dataset for downstream joins instead of silently dropping the record.

The naive approach (and why it falls apart) ⚙️

The obvious path looks simple:

Register a Spotify Developer app, get a token
GET /v1/shows/{id}/episodes?limit=50
Loop through description fields, apply a regex, done

It works for one show with a consistent title convention. At scale it breaks for three reasons:

1. Pagination with an undocumented rate limiter. Spotify publishes no threshold for its catalog rate limiter. An aggressive loop without Retry-After handling crashes mid-pagination and loses the cursor. We retry with exponential backoff (base 2 s, capped at 30 s, max 5 attempts) and honour Retry-After. We also refresh the OAuth token before its 3600-second expiry rather than waiting for a 401 mid-scrape.

2. Description format variance. A 200-episode show may use three different guest-attribution conventions across its history. We run two extraction stages: a curated regex sweep first (patterns like "with guest {Name}", "#N — {Name}: {Title}") at 0.85-1.00, then spaCy en_core_web_sm PERSON entity recognition at 0.55-0.75 for everything the regex missed. Names found by both keep the regex score; deduplication is case-insensitive.

3. False positives from raw NER. We apply a confidence floor of 0.30, require extracted names to be at least two tokens long unless they appear in the episode title, and drop NER-only matches that are known false-positive patterns. Every row carries its score so you tune precision/recall downstream.

We rotate browser fingerprints via curl-cffi impersonation so the API sees a real browser TLS handshake. When a run produces partial results — a show 404s, or the rate limiter cuts a pagination run short — we surface the shortfall with a clear set_status_message and never hand you a silently empty dataset.

The Actor 🔥

I packaged the result as an Apify Actor: Spotify Podcast Guest Extractor.

You need a free Spotify Developer app (60 seconds at developer.spotify.com/dashboard). Paste your client_id and client_secret into the Actor input as Apify Secrets (never logged), provide show IDs, and click Start. Or run it from code:

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("DevilScrapes/spotify-podcast-guest-graph").call(
    run_input={
        "show_ids": ["2MAi0BvDc6GTFvKFPXnkCL"],  # Lex Fridman Podcast
        "maxEpisodesPerShow": 50,
        "clientId": "YOUR_SPOTIFY_CLIENT_ID",
        "clientSecret": "YOUR_SPOTIFY_CLIENT_SECRET",
        "market": "US",
    }
)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["guest_name"], item["guest_role"], item["confidence"])

The show ID is the 22-character string after /show/ in any open.spotify.com/show/... URL. You can also pass showSearchQuery — the Actor resolves the top match via /v1/search.

What you would actually use this for

Four concrete scenarios:

PR firm prospect research. Pull guest histories for 50 target shows before sending a single outreach email, then cross-reference against your client's competitor list. At $0.005/row and ~100 rows per show that is $0.50 per show, or $25 for the full list — versus 4-8 hours of manual reading.
Founder-led PR outreach. A cold email that references a show's past guests ("I noticed you've had three enterprise SaaS founders on in Q1") converts better than a generic pitch. Pull the guest history, pick the most relevant past guests, and personalize.
Entity-graph AI training. Feed (person, show, episode date) triples into a knowledge graph. Each row is already normalized — show_id, episode_id, guest_name, guest_role, ISO date, stable URL — and the confidence score tells the pipeline how much to trust the node.
Journalist source mapping. Track which executives or public figures have been making the podcast circuit in a given period. A guest who appeared on 8 business podcasts in 90 days is either promoting something or about to be newsworthy.

Pricing — exact numbers 💰

Pay-per-event. You are charged only for rows that land in your dataset.

Event	Cost
`actor-start` (one per run)	$0.05
`result` (per row)	$0.005

A small cost table to calibrate:

Run size	Rows	Cost
1 show, 20 episodes, ~1.5 guests/ep	~30 rows	~$0.20
5 shows, 20 episodes each	~150 rows	~$0.80
10 shows, 50 episodes each	~750 rows	~$3.80
1,000 rows	1,000	$5.05

Apify's $5 free trial credit covers your first ~990 rows with no credit card. A full guest-history pull of one podcast's last 50 episodes costs roughly $0.30 and takes under two minutes.

The technically interesting bit

The extraction pipeline runs two stages in series: a regex match always wins over an NER match on the same name, and scores from the two stages never mix. This matters because NER on podcast descriptions produces false positives — production company names, product names, proper nouns spaCy tags as PERSON. Keeping the stages separate means a regex hit at 0.85 is never overridden by an NER entity at 0.90 that isn't actually a person.

The extractor is a pure function (a string in, [(name, role, confidence)] out, zero I/O) — independently testable against description fixtures without spinning up the Spotify auth stack.

Limitations

Metadata-only. No audio, no transcripts. Spotify's ToS prohibits downloading audio; this Actor never touches it. Guest extraction is from episode titles and plain-text descriptions.
English regex patterns only in v1. The spaCy NER fallback still surfaces PERSON entities from other languages, but role assignment is unreliable. Multi-language NLP is on the v2 roadmap.
NER accuracy varies by show format. Rich, structured descriptions yield high recall; sparse or marketing-copy descriptions yield lower recall — the Actor emits a guest_name: null row to preserve episode metadata.
Hard cap of 200 episodes per show. Sufficient for most research workflows.
Regionally restricted episodes that return 404 are logged at WARNING and skipped.

FAQ

Is using the Spotify Web API like this legal?
This Actor uses Spotify's official, documented client_credentials grant — the same one Spotify provides to third-party developers. It accesses only public catalog data, requires your own developer credentials, and makes no more than a few hundred requests per run. No audio is downloaded. Review Spotify's Developer Terms of Service against your use case.

Do I really need to create a Spotify Developer app?
Yes. Creating a free app at developer.spotify.com/dashboard takes 60 seconds: click "Create app", name it, set any redirect URI, accept the ToS. The client_credentials grant gives read-only access to the public catalog and never touches a user account.

Can I export results to a spreadsheet or warehouse?
Export CSV, JSON, or Excel from the Apify Console dataset view, webhook on ACTOR.RUN.SUCCEEDED into Make/Zapier/n8n, or pull data via the Apify Dataset API.

How accurate is the guest extraction?
Regex Stage 1 matches common podcast conventions at confidence 0.85-1.00. spaCy NER Stage 2 adds coverage at 0.55-0.75. Filter on confidence >= 0.8 for high precision; keep all rows for broader coverage with some false positives. Each row carries its score — the tradeoff is yours to tune.

Try it

The Actor is live on the Apify Store: apify.com/DevilScrapes/spotify-podcast-guest-graph.

Free $5 trial credit, no credit card. Pull the last 50 episodes of any show you care about — Lex Fridman, The Tim Ferriss Show, whatever is relevant — and you will have a clean, queryable guest history in your dataset inside two minutes.

Found a show format the extractor misses? A guest role pattern we should add to the regex set? Drop a comment — we ship based on real use cases.

Built by Devil Scrapes — Apify Actors with attitude. Pay-per-event, transparent pricing, no junk fields. 😈

DEV Community