DEV Community

Greg
Greg

Posted on • Updated on • Originally published at gregondata.com

The Most Popular Data Science Newsletters

Or at least, the most often cited when looking through articles about data science newsletters.

** Paid Newsletter

Note: This was collected by hand because my web scraping gameplan didn't account for newsletters that go to multiple different links... Could I have built some super cool algorithm to collect this? ...Maybe? But I didn't. Instead, I just collected links for about an hour or two and here we are. Although the process was rather tedious, it probably provides a lot of value to you guys, so I'd say it was worth it.

honorable mention

These newsletters were mentioned one time each in the lists I was referencing. FYI to the two people who are combing through the source links wondering why newsletter x or newsletter y wasn't included - I took out newsletters where the links were no longer functional.

AI Times, AI Trends, Big Data News Weekly, Bootstrap Labs, CB Insights, ChinAI, Creative AI, dair.ai NLP Newsletter, Data Coalition, Data Community DC, Data Eng Weekly, DataQuest Newsletter, Exponential View, Eye on AI, Flowing Data, Gary’s Guide, Hacker Noon, Hilary Mason, Humane AI, Lionbridge AI, Machine Learning Blueprint, ML in Production, NLP News, Open AI, R-Bloggers, Skynet Today, Stratechery, Talking Machines, The Art of Data Science, The Batch, The Belamy, The Gradient, The Pudding, TLDR, TWIML

sources for the list

newsletter time!

You've made it to the body of the article! Congratulations!

In today's post, I'm continuing my completely absurd crusade to find publications to subscribe to - next up on my list is newsletters. My previous articles were about twitter accounts and podcasts, if you're inclined to check them out.

But similar to my views on podcasts, I love newsletters. I find them to be most useful for keeping up with all the cool, new developments in the fields that I'm interested in. Unlike other platforms - cough Twitter cough - where you get bogged down with a deluge of information, newsletters typically boil everything down into 5-10 bullet points per week (or month). And brevity is nice.

And I recently found a cool app* that lets me subscribe to them outside of my normal email, so I can have a separate 'Data Science/Tech Newsletter' app, effectively. Which is fantastic. Like truly, honestly, fantastic. Highly recommend.

*For reference, the newsletter app is called slick inbox and it's still in beta. There's also apparently one called stoop inbox, which seems pretty similar. I don't endorse either one of these, nor do I have any affiliation - I just like having newsletters out of my inbox so that I can actually read them when I have free time.

Anyways, you probably don't care about that, lets get on to the process.

gameplanning the process

So, off the back of some silly articles where I found popular twitter accounts and podcasts, I thought... what next?

The answer? Sleep. Also, work. But then, newsletters. Newsletters are definitely next.

But how to go about doing this?

My first thought was to find a site that aggregates newsletter listings and estimates subscribers, but I struck out hard on that front. So I moved to plan B - google-ing what the best data science newsletters were. That search brought me to a lot of blog posts in list format, which had links to - you guessed it - newsletters. My bright data-science-y mind thought, "Hey, you can scrape these links and count the frequency they occur in the articles - that'll be an easy way to get your answer". So that's exactly what I did.

Well, that's what I scoped out and built... unfortunately, I had to scrap the code because links to newsletters are inconsistent (in that some people link to the main site, some to a subscribe-specific link, a newsletter-hosting service, etc). But since I did already build the code, I think its worth sharing... even if it looks like its held together with duct tape.

code to find top data science newsletters

I'm going to throw the code below with a bit less than the standard amount of commentary. Since it doesn't actually work, no one's really going to end up using it... probably?

# import default python packages
import urllib
import requests
import time

# import non-standard python packages
# if you dont have these installed, install them with pip or conda
from bs4 import BeautifulSoup
from readability import Document
import pandas as pd

# define the desktop user-agent
USER_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1"
# mobile user-agent
MOBILE_USER_AGENT = "Mozilla/5.0 (iPhone; CPU iPhone OS 12_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Mobile/15E148 Safari/604.1"

# function to create google query string based on a query text
def string_to_google_query(string, pg=None):
    query = urllib.parse.quote_plus(f'{string}')
    if pg != None:
        query = query + "&start=" + str(10*(pg-1))
    return f"https://google.com/search?hl=en&lr=en&q={query}", string

# Queries to feed scraper
list_kwargs = [
    {"string": 'best data science newsletters'},
    {"string": 'best data science newsletters', "pg": 2},
    {"string": 'best data science newsletters', "pg": 3},
    {"string": 'best data engineering newsletters'},
    {"string": 'best data visualization newsletters'},
    {"string": 'best artificial intelligence newsletters'},
    {"string": 'best machine learning newsletters'},

    {"string": 'data science newsletters'},
    {"string": 'data science newsletters', "pg": 2},
    {"string": 'data science newsletters', "pg": 3},
    {"string": 'data engineering newsletters'},
    {"string": 'data visualization newsletters'},
    {"string": 'artificial intelligence newsletters'},
    {"string": 'machine learning newsletters'},
]

def google_scraper_mobile(link):
    results = []
    headers = {"user-agent" : MOBILE_USER_AGENT}
    resp = requests.get(url, headers=headers)
    if resp.status_code == 200:
        soup = BeautifulSoup(resp.content, "html.parser")

    for g in soup.find_all('div', class_='mnr-c'):
        anchors = g.find_all('a')
        if len(anchors) > 0:
            try:
                # this code will fail on featured snippets
                link = anchors[0]['href']
            except:
                next

            try:
                title = anchors[0].find_all('div')[1].get_text().strip()
            except:
                title = anchors[0].get_text().strip()

            item = {
                "title": title,
                "link": link,
                "search_term": search_term
            }
            results.append(item)
    return results


# Crawling Google as a mobile user
headers = {"user-agent" : MOBILE_USER_AGENT}

results = []
for x in list_kwargs:
    url, search_term = string_to_google_query(**x)
    scrape_res = google_scraper_mobile(url)
    results = results + scrape_res

    time.sleep(2.5)


# put results into a dataframe
newsletter_df = pd.DataFrame(results)

# Check there is a number in the title (such as 'top 10 newsletters')
# or that the title contains newsletter
newsletter_df = newsletter_df.loc[(newsletter_df['title'].str.contains('[0-9]') | newsletter_df['title'].str.lower().str.contains('newsletter'))]
newsletter_df.drop_duplicates(subset='link',inplace=True)

# switch the user agent to desktop - articles shouldn't differ on desktop vs mobile and will likely have fewer issues on desktop
headers = {"user-agent" : USER_AGENT}

#define the crawler for each article
def article_link_crawl(link):
    """
    Returns links and either a 1 or 0 if it was a success / failure.
    Only Crawls articles, so there should be a few failures
    """
    try:
        domain = link.split('://')[1].split('/')[0] # defines the site domain
        article_links = []
        resp = requests.get(link, headers=headers) # get request for the link

        # pass the article through readibility to get the article content rather than the full webpage
        rd_doc = Document(resp.text)
        if resp.status_code == 200:
            soup = BeautifulSoup(rd_doc.content(), "html.parser")
        link_soup = soup.find_all('a') # find all links
        for link in link_soup:
            # if the link has a href, create an item to add to the aggregate article_links list
            if link.has_attr('href'):
                item = {
                    "text": link.get_text().strip(),
                    "link": link['href']
                    }
                # dont add any blank links, internal links, links starting with '/' or '#' (also internal links)
                if item['text'] != '' and item['link'].find(domain) == -1 and item['link'][0] != '/' and item['link'][0] != '#':
                    article_links.append(item)
        return article_links, 1
    except:
        return None, 0


# loop through results
agg_links = []
total_success = 0
total_fail = 0
fail_urls = []
for link in newsletter_df['link']:
    res_links, is_success = article_link_crawl(link)
    if is_success == 1:
        total_success = total_success+1
    else:
        total_fail = total_fail+1
        fail_urls.append(link)

    if res_links != None:
        for lnk in res_links:
            agg_links.append(lnk)
    time.sleep(2.5)

# function to get frequency of a link
def list_freq(tgt_list): 
    freq = {} 
    for item in tgt_list: 
        if (item in freq): 
            freq[item] += 1
        else: 
            freq[item] = 1

    result = []
    for key, value in freq.items(): 
        result.append({
            "link": key,
            "count": value
        })
    return result

# count occurrences of each link
clean_link_list = [x['link'].replace('http://','https://') for x in agg_links]
link_freq = list_freq(clean_link_list)

# count occurrences of anchor text
clean_text_list = [x['text'].replace('http://','https://') for x in agg_links]
text_freq = list_freq(clean_text_list)

# move variables to data frames
link_freq_df = pd.DataFrame(link_freq)
text_freq_df = pd.DataFrame(text_freq)

link_freq_df = link_freq_df.sort_values('count',ascending=False)
link_freq_df = link_freq_df.loc[link_freq_df['count'] >= 3]
Enter fullscreen mode Exit fullscreen mode

This was the result of the code - not great, but not horrible. In the end, I just decided the 'old fashioned way' was the proper way to find newsletters...

Top comments (0)