Greg

Posted on Sep 11, 2020 • Edited on Nov 18, 2020 • Originally published at gregondata.com

Finding popular data science podcasts via web scraping

#datascience #python #tutorial

The article will go over the process I used to create the list of podcasts you see below. If you're just here for the podcasts, then have at it...

the most popular data science podcasts

title	author	avg_rtg	rtg_ct	episodes
Lex Fridman Podcast	Lex Fridman	4.9	2400	126
Machine Learning Guide	OCDevel	4.9	626	30
Data Skeptic	Kyle Polich	4.4	431	300
Data Stories	Enrico Bertini and Moritz Stefaner	4.5	405	162
The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)	Sam Charrington	4.7	300	300
DataFramed	DataCamp	4.9	188	59
The AI Podcast	NVIDIA	4.5	162	125
SuperDataScience	Kirill Eremenko	4.6	161	300
Partially Derivative	Partially Derivative	4.8	141	101
Machine Learning	Stanford	3.9	138	20
Talking Machines	Tote Bag Productions	4.6	133	106
AI in Business	Daniel Faggella	4.4	102	100
Learning Machines 101	Richard M. Golden, Ph.D., M.S.E.E., B.S.E.E.	4.4	87	82
storytelling with data podcast	Cole Nussbaumer Kna	4.9	80	33
Data Crunch	Data Crunch Corporation	4.9	70	64
Data Viz Today	Alli Torban	5.0	64	62
Artificial Intelligence	MIT	4.1	61	31
O'Reilly Data Show Podcast	O'Reilly Media	4.2	59	60
Machine Learning – Software Engineering Daily	Machine Learning – Software Engineering Daily	4.5	59	115
Data Science at Home	Francesco Gadaleta	4.2	58	100
Data Engineering Podcast	Tobias Macey	4.7	58	150
Big Data	Ryan Estrada	4.6	58	13
Follow the Data Podcast	Bloomberg Philanthropies	4.3	57	82
Making Data Simple	IBM	4.3	56	104
Analytics on Fire	Mico Yuk	4.4	51	48
Learn to Code in One Month	Learn to Code	4.9	50	26
Becoming A Data Scientist Podcast	Renee Teate	4.5	49	21
Practical AI: Machine Learning & Data Science	Changelog Media	4.5	48	105
The Present Beyond Measure Show: Data Visualization, Storytelling & Presentation for Digital Marketers	Lea Pica	4.9	44	58
The Data Chief	Mission	4.9	43	16
AI Today Podcast: Artificial Intelligence Insights, Experts, and Opinion	Cognilytica	4.2	42	161
Data Driven	Data Driven	4.9	41	257
HumAIn Podcast - Artificial Intelligence, Data Science, and Developer Education	David Yakobovitch	4.8	39	78
Data Gurus	Sima Vasa	5.0	39	106
Masters of Data Podcast	Sumo Logic hosted by Ben Newton	5.0	38	74
The PolicyViz Podcast	The PolicyViz Podcast	4.7	36	180
The Radical AI Podcast	Radical AI	4.9	34	35
Women in Data Science	Professor Margot Gerritsen	4.9	28	24
Towards Data Science	The TDS team	4.6	26	50
Data in Depth	Mountain Point	5.0	22	24
Data Science Imposters Podcast	Antonio Borges and Jordy Estevez	4.4	22	88
The Artists of Data Science	Harpreet Sahota	4.9	19	41
#DataFemme	Dikayo Data	5.0	17	30
The Banana Data Podcast	Dataiku	4.9	15	33
Experiencing Data with Brian T. O'Neill	Brian T. O'Neill from Designing for Analytics	4.9	14	13
Secrets of Data Analytics Leaders	Eckerson Group	4.8	13	82
Data Journeys	AJ Goldstein	5.0	13	26
Data Driven Discussions	Outlier.ai	5.0	12	8
Data Futurology - Leadership And Strategy in Artificial Intelligence, Machine Learning, Data Science	Felipe Flores	4.4	11	135
Artificially Intelligent	Christian Hubbs and Stephen Donnelly	4.9	11	100

why i want to find data science podcasts

This would normally be at the top of an article on finding data science podcasts. Well it would be at the top of any article. But realistically, most people are finding this from google, and they're just looking for the answer that's at the top of the page. If you type in 'the most popular data science podcasts', you really don't want to have to scroll down endlessly to find the answer you're looking for. So to make their experience better, we're just leaving the answer up there. And giving them sass. Lots of sass.

Anyways, I really like listening to things. While newsletters are great for keeping up with current events and blogs are great for learning specific things, podcasts have a special place in my heart for allowing me to aimlessly learn something new every day. The format really lends itself to delivering information efficiently, but in a way where you can multitask. Pre-COVID, my morning commute was typically full of podcasts. While COVID has rendered my commute a nonexistent affair, I still try to listen to at least a podcast a day if I can manage it. My view is that 30 minutes of learning a day will really add up in the long run, and I feel that podcasts are a great way to get there.

Now that we've been through my love affair with podcasts, you can imagine my surprise when I started looking for a few data science ones to subscribe to and I didn't find a tutorial on how to use web scraping to find the most popular data science podcasts to listen to. I know, crazy. There's a web scraping tutorial on everything under the sun except for - seemingly - podcasts. I mean there's probably not one on newsletters either, but we'll leave that alone for now...

So if no one else is crazy enough to write about finding data science podcasts with web scraping, then...

gameplanning the process

By now we're almost certainly rid of those savages who are only here for the answer (gasp, how could they), so we'll go into the little process I went through to gather the data. It's not particularly long, and took me probably an hour to put it together, so it should be a good length for an article.

I'm using python here with an installation of Anaconda (which is a common package management / deployment system for python). I'll be running this in a Jupyter notebook, since its a one-off task that I don't need to use ever again... hopefully.

In terms of what I'm going to do, I'll run a few google keyword searches which are limited to the 'https://podcasts.apple.com/us/podcast/' domain and scrape the results for the first few pages. From there I'll just be scraping the apple podcast page to get the total number of ratings and the average rating. Yea, the data will be biased, but its a quick and dirty way to get the answer I'm looking for.

code to find top data science podcasts - version 1

# import default python packages
import urllib
import requests
import time

The above packages are included in python, the below ones aren't always included. If you don't have them installed, you'll have to download them. You can find out how to use pip to do it or conda.

# import non-standard python packages
# if you dont have these installed, install them with pip or conda
from bs4 import BeautifulSoup
import pandas as pd

Now that the packages have been imported, you should define your user agent. First off, because its polite if you're scraping anything. Secondly, google gives different results for mobile and desktop searches. This isn't actually my user-agent, I took it from another tutorial since I'm a bit lazy. I actually use linux...

# define your desktop user-agent
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"

Alright now we're going to define the queries we want to run. And then create a function that spits out the URL we want to scrape on google. I'm putting the queries in a kwargs format, since I want to put them through a function. That means I can just loop through the list of kwargs and get the results that the function returns.

# Queries
list_kwargs = [
    {"string": 'data podcast'},
    {"string": 'data podcast', "pg": 2},
    {"string": 'data podcast', "pg": 3},
    {"string": 'data science podcast'},
    {"string": 'data engineering podcast'},
    {"string": 'data visualization podcast'},
]

def string_to_podcast_query(string, pg=None):
    query = urllib.parse.quote_plus(f'site:https://podcasts.apple.com/us/podcast/ {string}')
    if pg != None:
        query = query + "&start=" + str(10*(pg-1))
    return f"https://google.com/search?hl=en&lr=en&q={query}", string

# define the headers we will add to all of our requests
headers = {"user-agent" : USER_AGENT}

# set up an empty list to push results to
results = []

# cycle through the list of queries 
for x in list_kwargs:
    # return the query url and the search term that was used to create it (for classification later)
    url, search_term = string_to_podcast_query(**x)

    # make a get request to the url, include the headers with our user-agent
    resp = requests.get(url, headers=headers)

    # only proceed if you get a 200 code that the request was processed correctly
    if resp.status_code == 200:
        # feed the request into beautiful soup
        soup = BeautifulSoup(resp.content, "html.parser")

    # find all divs (a css element that wraps page areas) within google results
    for g in soup.find_all('div', class_='r'):
        # within the results, find all the links 
        anchors = g.find_all('a')
        if anchors:
            # get the link and title, add them to an object, and append that to the results array
            link = anchors[0]['href']
            title = g.find('h3').text
            item = {
                "title": title,
                "link": link,
                "search_term": search_term
            }
            results.append(item)

    # sleep for 2.5s between requests.  we don't want to annoy google and deal with recaptchas
    time.sleep(2.5)

Alright, now we have the google results back - nice. From here, lets put that in a pandas dataframe and filter it a bit.

google_results_df = pd.DataFrame(results)

# create a filter for anything that is an episode.  They should contain a ' | '.
# drop any duplicate results as well.
google_results_df['is_episode'] = google_results_df['title'].str.contains(' | ',regex=False)
google_results_df = google_results_df.drop_duplicates(subset='title')

google_results_podasts = google_results_df.copy().loc[google_results_df['is_episode']==False]

Ok cool, we have a list of podcasts. Lets define our apple podcasts scraper.

def podcast_scrape(link):
    # get the link, use the same headers as had previously been defined.
    resp = requests.get(link, headers=headers)
    if resp.status_code == 200:
        soup = BeautifulSoup(resp.content, "html.parser")

    # find the figcaption element on the page
    rtg_soup = soup.find("figcaption", {"class": "we-rating-count star-rating__count"})
    # the text will return an avg rating and a number of reviews, split by a •
    # we'll spit that out, so '4.3 • 57 Ratings' becomes '4.3', '57 Ratings'
    avg_rtg, rtg_ct = rtg_soup.get_text().split(' • ')
    # then we'll take numbers from the rtg_ct variable by splitting it on the space
    rtg_ct = rtg_ct.split(' ')[0]

    # find the title in the document, get the text and strip out whitespace
    title_soup = soup.find('span', {"class":"product-header__title"})
    title = title_soup.get_text().strip()
    # find the author in the document, get the text and strip out whitespace
    author_soup = soup.find('span', {"class":"product-header__identity podcast-header__identity"})
    author = author_soup.get_text().strip()

    # find the episode count div, then the paragraph under that, then just extract the # of episodes
    episode_soup = soup.find('div', {"class":"product-artwork__caption small-hide medium-show"})
    episode_soup_p = episode_soup.find('p')
    episode_ct = episode_soup_p.get_text().strip().split(' ')[0]

    # format the response as a dict, return that response as the result of the function
    response = {
        "title": title,
        "author": author,
        "link": link,
        "avg_rtg": avg_rtg,
        "rtg_ct": rtg_ct,
        "episodes": episode_ct
    }
    return response

Cool, we now have a podcast scraper. You can try it with the below code.

podcast_scrape('https://podcasts.apple.com/us/podcast/follow-the-data-podcast/id1104371750')


{'title': 'Follow the Data Podcast',
'author': 'Bloomberg Philanthropies',
'link': 'https://podcasts.apple.com/us/podcast/follow-the-data-podcast/id1104371750',
'avg_rtg': '4.3',
'rtg_ct': '57'}

Back to the code. Lets now loop through all the podcast links we have.

# define the result array we'll fill during the loop
podcast_summ = []
for link in google_results_podcasts['link']:
    # use a try/except, since there are a few episodes still in the list that will cause errors if we don't do this.  This way, if there is an error we just wont add anything to the array.
    try:
        # get the response from our scraper and append it to our results
        pod_resp = podcast_scrape(link)
        podcast_summ.append(pod_resp)
    except:
        pass
    # wait for 5 seconds to be nice to apple
    time.sleep(5)

Now to put everything into a dataframe and do a little bit of sorting and filtering.

pod_df = pd.DataFrame(podcast_summ)

# Remove non-english podcasts, sorry guys...
pod_df = pod_df.loc[~pod_df['link'].str.contains('l=')]
pod_df.drop_duplicates(subset='link', inplace=True)

# merge with the original dataframe (in case you want to see which queries were responsible for which podcasts)
merge_df = google_results_podcasts.merge(pod_df,on='link',suffixes=('_g',''))
merge_df.drop_duplicates(subset='title', inplace=True)

# change the average rating and rating count columns from strings to numbers
merge_df['avg_rtg'] = merge_df['avg_rtg'].astype('float64')
merge_df['rtg_ct'] = merge_df['rtg_ct'].astype('int64')

# sort by total ratings and then send them to a csv
merge_df.sort_values('rtg_ct',ascending=False).to_csv('podcasts.csv')

From here I exported the file to csv and did a bit of cheating where I combined the title and link to create a <a hrer="link">title</a>, but that's mainly because I got a bit lazy...

Anyways, that was the full process in creating the above list of data science podcasts. You now have the top podcasts, sorted by total reviews. I considered also using castbox as a source of scraping (since they have an approximation of subscribers / downloads), but I couldn't find any good way to search for generally popular podcasts. Or podcasts that contained a certain word.

The first version of this article stopped here and showed results from this code

code to find top data science podcasts - version 2

Well, that was fine, but I think its actually lacking a bit. There seem to be a few podcasts that I've stumbled across that are missing which I was hoping this would capture. So we're going to switch some stuff up. First, I'm going to use a mobile user agent to tell Google I'm searching from my phone.

Why? Well Google shows different results for desktop searches vs mobile searches, so if we're looking to find the best podcasts, we want to be where most of the searches are actually happening. And since you basically always listen to podcasts on your phone, it probably makes sense to search from your phone... The code for that is below, the main changes are in

# Mobile Search Version
headers = {"user-agent" : MOBILE_USER_AGENT}

results = []
for x in list_kwargs:
    url, search_term = string_to_podcast_query(**x)
    resp = requests.get(url, headers=headers)
    if resp.status_code == 200:
        soup = BeautifulSoup(resp.content, "html.parser")

    for g in soup.find_all('div', class_='mnr-c'): # updated target class
        anchors = g.find_all('a')
        if anchors:
            link = anchors[0]['href']
            title = anchors[0].find_all('div')[1].get_text().strip() # updated title crawler
            item = {
                "title": title,
                "link": link,
                "search_term": search_term
            }
            results.append(item)

    time.sleep(2.5)

What else did I switch up? I switched the Google queries up a bit and added a few more. I figure if I'm actually trying to find the best podcasts, it makes sense to search for them. That way, you get the ones that typically show up on these types of blog lists.

# Queries
list_kwargs = [
    {"string": 'best data podcast'},
    {"string": 'best data podcast', "pg": 2},
    {"string": 'best data podcast', "pg": 3},
    {"string": 'best data podcast', "pg": 4},
    {"string": 'best data science podcast'},
    {"string": 'best data science podcast', "pg": 2},
    {"string": 'best data science podcast', "pg": 3},
    {"string": 'best artificial intelligence podcast'},
    {"string": 'best machine learning podcast'},
    {"string": 'best data engineering podcast'},
    {"string": 'best data visualization podcast'},
]

And that's it - all of the changes I made for the second version. The results are updated up top, but it gets a more complete

code to find top data science podcasts - version 3

And I'm an idiot. 'Fixing' my queries to only find the 'best data science podcasts' ended up making me miss a few of the good ones I found earlier. So I'm going to do as any good data scientist does and just combine the results of both sets of queries...

# Queries

list_kwargs = [

    {"string": 'best data podcast'},

    {"string": 'best data podcast', "pg": 2},

    {"string": 'best data podcast', "pg": 3},

    {"string": 'best data podcast', "pg": 4},

    {"string": 'best data science podcast'},

    {"string": 'best data science podcast', "pg": 2},

    {"string": 'best data science podcast', "pg": 3},

    {"string": 'best artificial intelligence podcast'},

    {"string": 'best machine learning podcast'},

    {"string": 'best data engineering podcast'},

    {"string": 'best data visualization podcast'},

]

closing note

This is a cross-post from my blog. My current readership is a solid 0 views per month, so I thought it might be worth actually sharing it here...

DEV Community

Finding popular data science podcasts via web scraping

the most popular data science podcasts

why i want to find data science podcasts

gameplanning the process

code to find top data science podcasts - version 1

code to find top data science podcasts - version 2

code to find top data science podcasts - version 3

closing note

Top comments (0)

Read next

How to Define AI Agents with Cloudformation and SAM: A Builder's Guide

Technical Documentation Blog For AWS Services: AWS Lambda

From LocalHost to Public Endpoint - Quickly Share Your Work With zrok

While Loops