loading...
Cover image for The Easy Way to Scrape Instagram Using Python Scrapy & GraphQL

The Easy Way to Scrape Instagram Using Python Scrapy & GraphQL

iankerins profile image Ian Kerins ・9 min read

After e-commerce monitoring, building social media scrapers to monitor accounts and track new trends is the next most popular use case for web scraping.

However, for anyone who’s tried to build a web scraping spider for scraping Instagram, Facebook, Twitter or TikTok you know that it can be a bit tricky.

These sites use sophisticated anti-bot technologies to block your requests and regularly make changes to their site schemas which can break your spiders parsing logic.

So in this article, I’m going to show you the easiest way to build a Python Scrapy spider that scrapes all Instagram posts for every user account that you send to it. Whilst removing the worry of getting blocked or having to design XPath selectors to scrape the data from the raw HTML.

The code for the project is available on GitHub here, and is set up to scrape:

  • Post URL
  • Image URL or Video URL
  • Post Captions
  • Date Posted
  • Number of Likes
  • Number of Comments

For every post on that user's account. As you will see there is more data we could easily extract, however, to keep this guide simple I just limited it to the most important data types.

This code can also be quickly modified to scrape all the posts related to a specific tag or geographical location with only minor changes, so it is a great base to build future spiders with.

This article assumes you know the basics of Scrapy, so we’re going to focus on how to scrape Instagram at scale without getting blocked.

Setting Up Our Scrapy Spider

Getting up and running with Scrapy is very easy. To install Scrapy simply enter this command in the command line:

pip install scrapy

Then navigate to your project folder Scrapy automatically creates and run the “startproject” command along with the project name (“instascraper” in this case) and Scrapy will build a web scraping project folder for you, with everything already set up:

scrapy startproject instascraper

cd instascraper

scrapy genspider instagram instagram.com

Here is what you should see:

├── scrapy.cfg                # deploy configuration file
└── tutorial                  # project's Python module, you'll import your code from here
    ├── __init__.py
    ├── items.py              # project items definition file
    ├── middlewares.py        # project middlewares file
    ├── pipelines.py          # project pipeline file
    ├── settings.py           # project settings file
    └── spiders               # a directory where spiders are located
        ├── __init__.py
        └── amazon.py        # spider we just created

Okay, that’s the Scrapy spider templates set up. Now let’s start building our Instagram spiders.

From here we’re going to create five functions:

  1. start_requests - will construct the Instagram URL for the users account and send the request to Instagram.
  2. parse - will extract all the posts data from the users news feed.
  3. parse_page - if there is more than one page, this function will parse all the posts data from those pages.
  4. get_video - if the post includes a video, this function will be called and extract the videos url.
  5. get_url - will send the request to Scraper API so it can retrieve the HTML response.

Let’s get to work…

Requesting Instagram Accounts

To retrieve a user's data from Instagram we need to first create a list of users we want to monitor then incorporate their user ids into a URL. Luckily for us, Instagram uses a pretty straight forward URL structure.

Every user has a unique name and/or user id, that we can use to create the user URL:

https://www.instagram.com/<user_name>/

You can also retrieve the posts associated with a specific tag or from a specific location by using the following url format:

## Tags URL
https://www.instagram.com/explore/tags/<example_tag>/

## Location URL
https://www.instagram.com/explore/locations/<location_id>/

# Note: the location URL is a numeric value so you need to identify the location ID number for
# the locations you want to scrape. 

So for this example spider, I’m going to use Nike and Adidas as the two Instagram accounts I want to scrape.

Using the above framework the Nike url is https://www.instagram.com/nike/, and we also want to have the ability to specify the page language using the “hl” parameter. For example:

https://www.instagram.com/nike/?hl=en  #English
https://www.instagram.com/nike/?hl=de  #German

Spider #1: Retrieving Instagram Accounts

Now we have created a scrapy project and we are familiar with how instagram displays it’s data, we can begin coding the spiders.

Our start requests spider is going to be pretty simple, we just need to send requests to Instagram with the username url to get the users account:

def start_requests(self):
        for username in user_accounts:
            url = f'https://www.instagram.com/{username}/?hl=en'
            yield scrapy.Request(get_url(url), callback=self.parse)

The start_requests function will iterate through a list of user_accounts and then send the request to Instagram using the yield scrapy.Request(get_url(url), callback=self.parse) where the response is sent to the parse function in the callback.

Spider #2: Scraping Post Data

Okay, now that we are getting a response back from Instagram we can extract the data we want.

On first glance, the post data we want like image urls, likes, comments, etc. don’t seem to be in the HTML data. However, on a closer look we will see that the data is in the form of a JSON dictionary in the scripts tag that starts with “window._sharedData”.

This is because Instagram first loads the layout and all the data it needs from its internal GraphQL API and then puts the data in the correct layout.

Image of window._sharedData

We could scrape this data directly if we queried Instagrams GraphQL endpoint directly by adding "/?__a=1" onto the end of the URL. For example:

https://www.instagram.com/nike/?__a=1/

But we wouldn’t be able to iterate through all the pages, so instead we’re going to get the HTML response and then extract the data from the window._sharedData JSON dictionary.

Because the data is already formatted as JSON it will be very easy to extract the data we want. We can just use a simple XPath selector to extract the JSON string and then convert it into a JSON dictionary.

def parse(self, response):
        x = response.xpath("//script[starts-with(.,'window._sharedData')]/text()").extract_first()
        json_string = x.strip().split('= ')[1][:-1]
        data = json.loads(json_string)

From here we just need to extract the data we want from the JSON dictionary.

def parse(self, response):
        x = response.xpath("//script[starts-with(.,'window._sharedData')]/text()").extract_first()
        json_string = x.strip().split('= ')[1][:-1]
        data = json.loads(json_string)
        # all that we have to do here is to parse the JSON we have
        user_id = data['entry_data']['ProfilePage'][0]['graphql']['user']['id']
        next_page_bool = \
            data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['page_info'][
                'has_next_page']
        edges = data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_felix_video_timeline']['edges']
        for i in edges:
            url = 'https://www.instagram.com/p/' + i['node']['shortcode']
            video = i['node']['is_video']
            date_posted_timestamp = i['node']['taken_at_timestamp']
            date_posted_human = datetime.fromtimestamp(date_posted_timestamp).strftime("%d/%m/%Y %H:%M:%S")
            like_count = i['node']['edge_liked_by']['count'] if "edge_liked_by" in i['node'].keys() else ''
            comment_count = i['node']['edge_media_to_comment']['count'] if 'edge_media_to_comment' in i[
                'node'].keys() else ''
            captions = ""
            if i['node']['edge_media_to_caption']:
                for i2 in i['node']['edge_media_to_caption']['edges']:
                    captions += i2['node']['text'] + "\n"

            if video:
                image_url = i['node']['display_url']
            else:
                image_url = i['node']['thumbnail_resources'][-1]['src']
            item = {'postURL': url, 'isVideo': video, 'date_posted': date_posted_human,
                    'timestamp': date_posted_timestamp, 'likeCount': like_count, 'commentCount': comment_count, 'image_url': image_url,
                    'captions': captions[:-1]}

Spider #3: Extracting Video URLs

To extract the video URL we need to make another request to that specific post as that data isn’t included in the JSON response previously returned by Instagram.

If the post includes a video then the is_video flag will be set to true, which will trigger our scraper to request that posts page and send the response to the get_video function.

if video:
     yield scrapy.Request(get_url(url), callback=self.get_video, meta={'item': item})
else:
     item['videoURL'] = ''
     yield item

The get_video function will then extract the videoURL from the response.

def get_video(self, response):
        # only from the first page
        item = response.meta['item']
        video_url = response.xpath('//meta[@property="og:video"]/@content').extract_first()
        item['videoURL'] = video_url
        yield item

Spider #4: Iterating Through Available Pages

The last piece of extraction logic we need to implement is the ability for our crawler to iterate through all the available pages on that user account and scrape all the data.

Like the get_video function we need to check if there are any more pages available before calling the parse_pages function. We do that by checking if the has_next_page field in the JSON dictionary is true or false.

next_page_bool = \
            data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['page_info'][
                'has_next_page']

If it is true, then we will extract the end_cursor value from the JSON dictionary and create a new request for Instagrams GraphQL api endpoint, along with the user_id, query_hash, etc.

        if next_page_bool:
            cursor = \
                data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['page_info'][
                    'end_cursor']
            di = {'id': user_id, 'first': 12, 'after': cursor}
            print(di)
            params = {'query_hash': 'e769aa130647d2354c40ea6a439bfc08', 'variables': json.dumps(di)}
            url = 'https://www.instagram.com/graphql/query/?' + urlencode(params)
            yield scrapy.Request(get_url(url), callback=self.parse_pages, meta={'pages_di': di})

This will then call the parse_pages function which will repeat the process of extracting all the post data and checking to see if there are any more pages.

The difference between this function and the original parse function is that it won’t scrape the video url of each post. However, you can easily add this in if you would like.

def parse_pages(self, response):
   di = response.meta['pages_di']
   data = json.loads(response.text)
   for i in data['data']['user']['edge_owner_to_timeline_media']['edges']:
       video = i['node']['is_video']
       url = 'https://www.instagram.com/p/' + i['node']['shortcode']
       if video:
           image_url = i['node']['display_url']
           video_url = i['node']['video_url']
       else:
           video_url = ''
           image_url = i['node']['thumbnail_resources'][-1]['src']
       date_posted_timestamp = i['node']['taken_at_timestamp']
       captions = ""
       if i['node']['edge_media_to_caption']:
           for i2 in i['node']['edge_media_to_caption']['edges']:
               captions += i2['node']['text'] + "\n"
       comment_count = i['node']['edge_media_to_comment']['count'] if 'edge_media_to_comment' in i['node'].keys() else ''
       date_posted_human = datetime.fromtimestamp(date_posted_timestamp).strftime("%d/%m/%Y %H:%M:%S")
       like_count = i['node']['edge_liked_by']['count'] if "edge_liked_by" in i['node'].keys() else ''
       item = {'postURL': url, 'isVideo': video, 'date_posted': date_posted_human,
               'timestamp': date_posted_timestamp, 'likeCount': like_count, 'commentCount': comment_count, 'image_url': image_url,
               'videoURL': video_url,'captions': captions[:-1]
               }
       yield item
   next_page_bool = data['data']['user']['edge_owner_to_timeline_media']['page_info']['has_next_page']
   if next_page_bool:
       cursor = data['data']['user']['edge_owner_to_timeline_media']['page_info']['end_cursor']
       di['after'] = cursor
       params = {'query_hash': 'e769aa130647d2354c40ea6a439bfc08', 'variables': json.dumps(di)}
       url = 'https://www.instagram.com/graphql/query/?' + urlencode(params)
       yield scrapy.Request(get_url(url), callback=self.parse_pages, meta={'pages_di': di})

Going Live!

Finally, we are pretty much ready to go live. Last thing we need to do is to set our spiders up to use a proxy to enable us to scrape at scale without getting blocked.

For this project, I’ve gone with Scraper API as it is super easy to use and because they have a great success rate with scraping Instagram.

Scraper API is a proxy API that manages everything to do with proxies for you. You simply have to send them the URL you want to scrape and their API will route your request through one of their proxy pools and give you back the HTML response.

To use Scraper API you need to sign up to a free account here and get an API key which will allow you to make 1,000 free requests per month and use all the extra features like Javascript rendering, geotargeting, residential proxies, etc.

Next, we need to integrate it with our spider. Reading their documentation, we see that there are three ways to interact with the API: via a single API endpoint, via their Python SDK, or via their proxy port.

For this project, I integrated the API by configuring my spiders to send all our requests to their API endpoint.

API = ‘<YOUR_API_KEY>’

def get_url(url):
    payload = {'api_key': API, 'url': url}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url

And then modify our spider functions so as to use the Scraper API proxy by setting the url parameter in scrapy.Request to get_url(url). For example:

def start_requests(self):
        for username in user_accounts:
            url = f'https://www.instagram.com/{username}/?hl=en'
            yield scrapy.Request(get_url(url), callback=self.parse)

We also have to change the spiders settings to set the allowed_domains to api.scraperapi.com, and the max concurrency per domain to the concurrency limit of our Scraper API plan. Which in the case of the Scraper API Free plan is 5 concurrent threads:

class InstagramSpider(scrapy.Spider):
    name = 'instagram'
    allowed_domains = ['api.scraperapi.com']
    custom_settings = {'CONCURRENT_REQUESTS_PER_DOMAIN': 5}

Also, we should set RETRY_TIMES to tell Scrapy to retry any failed requests (to 5 for example) and make sure that DOWNLOAD_DELAY and RANDOMIZE_DOWNLOAD_DELAY aren’t enabled as these will lower your concurrency and are not needed with Scraper API.

One bonus of extracting the data directly from the JSON response from the GraphQL API is that we don’t need to write any pipelines to clean the data as it is already good to consume.

Now we are good to go. You can test the spider again by running the spider with the crawl command.

scrapy crawl instagram -o test.csv

Once complete the spider will store the accounts data in a csv file.

If you would like to run the spider for yourself or modify it for your particular Instagram project then feel free to do so. The code is on GitHub here. Just remember that you need to get your own Scraper API api key by signing up here.

Posted on by:

iankerins profile

Ian Kerins

@iankerins

Hi I'm Ian! I'm currently working on web development & growth marketing projects. Using NodeJS, React, Python.

Discussion

markdown guide
 

Hi there! Great post...it answers a lot of questions.

Small thing, though: the "likes" count & comment count isn't working properly. I'm assuming it's due to the near-constant moving target of Instagram changing their page. Any hints on how to resolve this?

Thanks very much for your time!