DishyDev

Posted on Feb 23, 2020 • Originally published at dishy.dev

Scraping Images from Reddit Threads in Python

#python #azure #webdev #serverless

Introduction

This is a little side project I did to try and scrape images out of reddit threads. There's a few different subreddits discussing shows, specifically /r/anime where users add a lot screenshots of the episodes. And I thought it'd be cool to see how much effort it'd be to automatically collate a list of those screenshots from a thread and display them in a simple gallery. The result looked like this

PRAW

PRAW is the Python Reddit API Wrapper, that provides a nice set of bindings to talk to Reddit.

To scrape Reddit you need credentials. The way to generate credentials is hidden away at https://www.reddit.com/prefs/apps where you have to register a new "app" with Reddit. Connecting is as simple as

import praw

reddit = praw.Reddit(client_id='id', \
                     client_secret='secret', \
                     user_agent='useragent', \
                     username='username', \
                     password='DevToIsCool')

Traversing reddit is made simple by the API, for example printing all of the comments in a thread.

submission = reddit.submission(url="https://reddit.com/r/abcde")
for comment in submission.comments.list():
    print(comment)

Finding links

99% of the images I was looking for are posted to imgur so I just matched on those. I used a regular expression to extract the links. I always recommend using a tool like RegEx101 that makes it really easy to debug your regular expressions as they can be pretty brain bending.

    REGEX_TEST = r"((http|https)://i.imgur.com/.+?(jpg|png))"
    p = re.compile(REGEX_TEST, re.IGNORECASE)

Check if an image still exists

One of the problems I found was dead image links, so I created a simple helper that checks the status_code for that link.

# Check if a link still is exists
def checkLinkActive(url):
    request = requests.head(url)
    if request.status_code == 200:
        return True
    else:
        return False

Getting Thumbnails

To save bandwidth and your mobile data I wanted to return a smaller version of the image. In imgur you can append a size character onto a URL to get it at a different size, for example 'l' large and 's' small.

# Add a letter to an imgur url to make a small thumbnail
def getImgurThumbnail(url, size):
    startStr = url[:(len(url)-4)]
    endStr = url[len(url)-4:]
    return startStr + size + endStr

Putting it all together

Putting all of these bits together you get

def getImages(url):
    submission = reddit.submission(url=url)
    # Tell API to return all comment in thread, results are
    # paginated by default
    submission.comments.replace_more(limit=None)

    # Create RegEx object for matching images
    REGEX_TEST = r"((http|https)://i.imgur.com/.+?(jpg|png))"
    p = re.compile(REGEX_TEST, re.IGNORECASE)

    imageMatches = []
    for comment in submission.comments.list():
        matches = p.findall(comment.body)
        for match in matches:
            if checkLinkActive(match[0]):
                imageMatches.append(
                    {"image": match[0], "thumbnail": getImgurThumbnail(match[0], "m")}
                )

    return imageMatches

Trying it out

I decided to stand up a quick demo of this, using an Azure Function to host my new function and a simple web form to allow people to try it out. Just copy and paste a Reddit URL and the function will return any images.

The Demo App uses Bulma for the look and feel, and a little bit of JQuery for the loading of the page.

If you want to give it a go, you can have a play on my site here.

I'll be looking in a future article at providing a show name search instead of having to paste individual episode URLs. Happy Reddit scraping!

DEV Community

Scraping Images from Reddit Threads in Python

Introduction

PRAW

Finding links

Check if an image still exists

Getting Thumbnails

Putting it all together

Trying it out

Top comments (0)

Read next

Debian in WSL not Ubuntu

YOLOv11: A New Breakthrough in Document Layout Analysis

Build & Deploy a Real-Time Sports Voting App with Express.js, Next.js, Node.js, & SSE on Vercel & Google Cloud App Engine

Highly scalable image storage solution with AWS Serverless at ip.labs - Part 1