Ian Kerins

Posted on Aug 6, 2020 • Edited on Feb 2, 2022

The Easy Way to Scrape Instagram Using Python Scrapy & GraphQL

#webscraping #scraping #scrapy #python

After e-commerce monitoring, building social media scrapers to monitor accounts and track new trends is the next most popular use case for web scraping.

However, for anyone who’s tried to build a web scraping spider for scraping Instagram, Facebook, Twitter or TikTok you know that it can be a bit tricky.

These sites use sophisticated anti-bot technologies to block your requests and regularly make changes to their site schemas which can break your spiders parsing logic.

So in this article, I’m going to show you the easiest way to build a Python Scrapy spider that scrapes all Instagram posts for every user account that you send to it. Whilst removing the worry of getting blocked or having to design XPath selectors to scrape the data from the raw HTML.

The code for the project is available on GitHub here, and is set up to scrape:

Post URL
Image URL or Video URL
Post Captions
Date Posted
Number of Likes
Number of Comments

For every post on that user's account. As you will see there is more data we could easily extract, however, to keep this guide simple I just limited it to the most important data types.

This code can also be quickly modified to scrape all the posts related to a specific tag or geographical location with only minor changes, so it is a great base to build future spiders with.

This article assumes you know the basics of Scrapy, so we’re going to focus on how to scrape Instagram at scale without getting blocked.

The full-code is on GitHub here.

For this example, we're going to use:

Scraper API as our proxy solution, as Instagram has pretty aggressive anti-scraping in place. You can sign up to a free account here which will give you 5,000 free requests.
ScrapeOps to monitor our scrapers for free and alert us if they run into trouble. Live demo here: ScrapeOps Demo

Setting Up Our Scrapy Spider

Getting up and running with Scrapy is very easy. To install Scrapy simply enter this command in the command line:

pip install scrapy

Then navigate to your project folder Scrapy automatically creates and run the “startproject” command along with the project name (“instascraper” in this case) and Scrapy will build a web scraping project folder for you, with everything already set up:

scrapy startproject instascraper

cd instascraper

scrapy genspider instagram instagram.com

Here is what you should see:

├── scrapy.cfg                # deploy configuration file
└── tutorial                  # project's Python module, you'll import your code from here
    ├── __init__.py
    ├── items.py              # project items definition file
    ├── middlewares.py        # project middlewares file
    ├── pipelines.py          # project pipeline file
    ├── settings.py           # project settings file
    └── spiders               # a directory where spiders are located
        ├── __init__.py
        └── amazon.py        # spider we just created

Okay, that’s the Scrapy spider templates set up. Now let’s start building our Instagram spiders.

From here we’re going to create five functions:

start_requests - will construct the Instagram URL for the users account and send the request to Instagram.
parse - will extract all the posts data from the users news feed.
parse_page - if there is more than one page, this function will parse all the posts data from those pages.
get_video - if the post includes a video, this function will be called and extract the videos url.
get_url - will send the request to Scraper API so it can retrieve the HTML response.

Let’s get to work…

Requesting Instagram Accounts

To retrieve a user's data from Instagram we need to first create a list of users we want to monitor then incorporate their user ids into a URL. Luckily for us, Instagram uses a pretty straight forward URL structure.

Every user has a unique name and/or user id, that we can use to create the user URL:

https://www.instagram.com/<user_name>/

You can also retrieve the posts associated with a specific tag or from a specific location by using the following url format:

## Tags URL
https://www.instagram.com/explore/tags/<example_tag>/

## Location URL
https://www.instagram.com/explore/locations/<location_id>/

# Note: the location URL is a numeric value so you need to identify the location ID number for
# the locations you want to scrape.

So for this example spider, I’m going to use Nike and Adidas as the two Instagram accounts I want to scrape.

Using the above framework the Nike url is https://www.instagram.com/nike/, and we also want to have the ability to specify the page language using the “hl” parameter. For example:

https://www.instagram.com/nike/?hl=en  #English
https://www.instagram.com/nike/?hl=de  #German

Spider #1: Retrieving Instagram Accounts

Now we have created a scrapy project and we are familiar with how instagram displays it’s data, we can begin coding the spiders.

Our start requests spider is going to be pretty simple, we just need to send requests to Instagram with the username url to get the users account:

def start_requests(self):
        for username in user_accounts:
            url = f'https://www.instagram.com/{username}/?hl=en'
            yield scrapy.Request(get_url(url), callback=self.parse)

The start_requests function will iterate through a list of user_accounts and then send the request to Instagram using the yield scrapy.Request(get_url(url), callback=self.parse) where the response is sent to the parse function in the callback.

Spider #2: Scraping Post Data

Okay, now that we are getting a response back from Instagram we can extract the data we want.

On first glance, the post data we want like image urls, likes, comments, etc. don’t seem to be in the HTML data. However, on a closer look we will see that the data is in the form of a JSON dictionary in the scripts tag that starts with “window._sharedData”.

This is because Instagram first loads the layout and all the data it needs from its internal GraphQL API and then puts the data in the correct layout.

We could scrape this data directly if we queried Instagrams GraphQL endpoint directly by adding "/?__a=1" onto the end of the URL. For example:

https://www.instagram.com/nike/?__a=1/

But we wouldn’t be able to iterate through all the pages, so instead we’re going to get the HTML response and then extract the data from the window._sharedData JSON dictionary.

Because the data is already formatted as JSON it will be very easy to extract the data we want. We can just use a simple XPath selector to extract the JSON string and then convert it into a JSON dictionary.

def parse(self, response):
        x = response.xpath("//script[starts-with(.,'window._sharedData')]/text()").extract_first()
        json_string = x.strip().split('= ')[1][:-1]
        data = json.loads(json_string)

From here we just need to extract the data we want from the JSON dictionary.

def parse(self, response):
        x = response.xpath("//script[starts-with(.,'window._sharedData')]/text()").extract_first()
        json_string = x.strip().split('= ')[1][:-1]
        data = json.loads(json_string)
        # all that we have to do here is to parse the JSON we have
        user_id = data['entry_data']['ProfilePage'][0]['graphql']['user']['id']
        next_page_bool = \
            data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['page_info'][
                'has_next_page']
        edges = data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_felix_video_timeline']['edges']
        for i in edges:
            url = 'https://www.instagram.com/p/' + i['node']['shortcode']
            video = i['node']['is_video']
            date_posted_timestamp = i['node']['taken_at_timestamp']
            date_posted_human = datetime.fromtimestamp(date_posted_timestamp).strftime("%d/%m/%Y %H:%M:%S")
            like_count = i['node']['edge_liked_by']['count'] if "edge_liked_by" in i['node'].keys() else ''
            comment_count = i['node']['edge_media_to_comment']['count'] if 'edge_media_to_comment' in i[
                'node'].keys() else ''
            captions = ""
            if i['node']['edge_media_to_caption']:
                for i2 in i['node']['edge_media_to_caption']['edges']:
                    captions += i2['node']['text'] + "\n"

            if video:
                image_url = i['node']['display_url']
            else:
                image_url = i['node']['thumbnail_resources'][-1]['src']
            item = {'postURL': url, 'isVideo': video, 'date_posted': date_posted_human,
                    'timestamp': date_posted_timestamp, 'likeCount': like_count, 'commentCount': comment_count, 'image_url': image_url,
                    'captions': captions[:-1]}

Spider #3: Extracting Video URLs

To extract the video URL we need to make another request to that specific post as that data isn’t included in the JSON response previously returned by Instagram.

If the post includes a video then the is_video flag will be set to true, which will trigger our scraper to request that posts page and send the response to the get_video function.

if video:
     yield scrapy.Request(get_url(url), callback=self.get_video, meta={'item': item})
else:
     item['videoURL'] = ''
     yield item

The get_video function will then extract the videoURL from the response.

def get_video(self, response):
        # only from the first page
        item = response.meta['item']
        video_url = response.xpath('//meta[@property="og:video"]/@content').extract_first()
        item['videoURL'] = video_url
        yield item

Spider #4: Iterating Through Available Pages

The last piece of extraction logic we need to implement is the ability for our crawler to iterate through all the available pages on that user account and scrape all the data.

Like the get_video function we need to check if there are any more pages available before calling the parse_pages function. We do that by checking if the has_next_page field in the JSON dictionary is true or false.

next_page_bool = \
            data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['page_info'][
                'has_next_page']

If it is true, then we will extract the end_cursor value from the JSON dictionary and create a new request for Instagrams GraphQL api endpoint, along with the user_id, query_hash, etc.

        if next_page_bool:
            cursor = \
                data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['page_info'][
                    'end_cursor']
            di = {'id': user_id, 'first': 12, 'after': cursor}
            print(di)
            params = {'query_hash': 'e769aa130647d2354c40ea6a439bfc08', 'variables': json.dumps(di)}
            url = 'https://www.instagram.com/graphql/query/?' + urlencode(params)
            yield scrapy.Request(get_url(url), callback=self.parse_pages, meta={'pages_di': di})

This will then call the parse_pages function which will repeat the process of extracting all the post data and checking to see if there are any more pages.

The difference between this function and the original parse function is that it won’t scrape the video url of each post. However, you can easily add this in if you would like.

def parse_pages(self, response):
   di = response.meta['pages_di']
   data = json.loads(response.text)
   for i in data['data']['user']['edge_owner_to_timeline_media']['edges']:
       video = i['node']['is_video']
       url = 'https://www.instagram.com/p/' + i['node']['shortcode']
       if video:
           image_url = i['node']['display_url']
           video_url = i['node']['video_url']
       else:
           video_url = ''
           image_url = i['node']['thumbnail_resources'][-1]['src']
       date_posted_timestamp = i['node']['taken_at_timestamp']
       captions = ""
       if i['node']['edge_media_to_caption']:
           for i2 in i['node']['edge_media_to_caption']['edges']:
               captions += i2['node']['text'] + "\n"
       comment_count = i['node']['edge_media_to_comment']['count'] if 'edge_media_to_comment' in i['node'].keys() else ''
       date_posted_human = datetime.fromtimestamp(date_posted_timestamp).strftime("%d/%m/%Y %H:%M:%S")
       like_count = i['node']['edge_liked_by']['count'] if "edge_liked_by" in i['node'].keys() else ''
       item = {'postURL': url, 'isVideo': video, 'date_posted': date_posted_human,
               'timestamp': date_posted_timestamp, 'likeCount': like_count, 'commentCount': comment_count, 'image_url': image_url,
               'videoURL': video_url,'captions': captions[:-1]
               }
       yield item
   next_page_bool = data['data']['user']['edge_owner_to_timeline_media']['page_info']['has_next_page']
   if next_page_bool:
       cursor = data['data']['user']['edge_owner_to_timeline_media']['page_info']['end_cursor']
       di['after'] = cursor
       params = {'query_hash': 'e769aa130647d2354c40ea6a439bfc08', 'variables': json.dumps(di)}
       url = 'https://www.instagram.com/graphql/query/?' + urlencode(params)
       yield scrapy.Request(get_url(url), callback=self.parse_pages, meta={'pages_di': di})

Setting Up Proxies

Finally, we are pretty much ready to go live. Last thing we need to do is to set our spiders up to use a proxy to enable us to scrape at scale without getting blocked.

For this project, I’ve gone with Scraper API as it is super easy to use and because they have a great success rate with scraping Instagram.

Scraper API is a proxy API that manages everything to do with proxies for you. You simply have to send them the URL you want to scrape and their API will route your request through one of their proxy pools and give you back the HTML response.

To use Scraper API you need to sign up to a free account here and get an API key which will allow you to make 1,000 free requests per month and use all the extra features like Javascript rendering, geotargeting, residential proxies, etc.

Next, we need to integrate it with our spider. Reading their documentation, we see that there are three ways to interact with the API: via a single API endpoint, via their Python SDK, or via their proxy port.

For this project, I integrated the API by configuring my spiders to send all our requests to their API endpoint.

API = ‘<YOUR_API_KEY>’

def get_url(url):
    payload = {'api_key': API, 'url': url}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url

And then modify our spider functions so as to use the Scraper API proxy by setting the url parameter in scrapy.Request to get_url(url). For example:

def start_requests(self):
        for username in user_accounts:
            url = f'https://www.instagram.com/{username}/?hl=en'
            yield scrapy.Request(get_url(url), callback=self.parse)

We also have to change the spiders settings to set the allowed_domains to api.scraperapi.com, and the max concurrency per domain to the concurrency limit of our Scraper API plan. Which in the case of the Scraper API Free plan is 5 concurrent threads:

class InstagramSpider(scrapy.Spider):
    name = 'instagram'
    allowed_domains = ['api.scraperapi.com']
    custom_settings = {'CONCURRENT_REQUESTS_PER_DOMAIN': 5}

Also, we should set RETRY_TIMES to tell Scrapy to retry any failed requests (to 5 for example) and make sure that DOWNLOAD_DELAY and RANDOMIZE_DOWNLOAD_DELAY aren’t enabled as these will lower your concurrency and are not needed with Scraper API.

Setting Up Monitoring

To monitor our scraper we're going to use ScrapeOps, a free monitoring and alerting tool dedicated to web scraping.

With a simple 30 second install ScrapeOps gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box.

Live demo here: ScrapeOps Demo

Getting setup with ScrapeOps is simple. Just install the Python package:

pip install scrapeops-scrapy

And add 3 lines to your settings.py file:

## settings.py

## Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'

## Add In The ScrapeOps Extension
EXTENSIONS = {
 'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500, 
}

## Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = { 
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550, 
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None, 
}

From there, our scraping stats will be automatically logged and automatically shipped to our dashboard.

Going Live!

Now we are good to go. You can test the spider again by running the spider with the crawl command.

scrapy crawl instagram -o test.csv

Once complete the spider will store the accounts data in a csv file.

If you would like to run the spider for yourself or modify it for your particular Instagram project then feel free to do so. The code is on GitHub here. Just remember that you need to get your own Scraper API api key by signing up here.

Top comments (25)

Vlad • Jan 17 '21

Looks like Instagram doesn't work via Scraper API anymore. But it still works on webscraping.ai

djk50 • May 27 '21

instagram.com/explore/tags//

Do you know how to get the posts for a tagged username. Simply replacing the former URL with this new one doesn't seem to work.

Vlad • Jun 15 '21

instagram.com/explore/tags/sport/?... works for hashtags, and instagram.com/nike/tagged/?__a=1 works for username, but this one requries login

karisjochen • Feb 1 '21

Do you mind sharing how you adjusted the code to use webscraping.ai instead? Thanks!

Vlad • Feb 1 '21 • Edited

Sure, here it is gist.github.com/Drakula2k/035cc5bd...
I also fixed a couple of bugs there

karisjochen • Feb 1 '21

Thanks so much for sharing! After making the changes I am unfortunately still getting blocked by the robots.txt file. Is this code still working for you?

Vlad • Feb 2 '21

Yes, it's working. You can disable the robots.txt check by setting ROBOTSTXT_OBEY = False on your settings.py. It works via an API so there is no need for the robots.txt check.

karisjochen • Feb 4 '21

incredible, thank you! It worked! So is it always a good idea to set the ROBOTSTXT_OBEY = False considering we dont want to be stopped?

Vlad • Feb 4 '21

Yes, ROBOTSTXT_OBEY is good when you're building something like a search engine and it may request all sorts of random URLs posted on the Internet. In that case, using robots.txt is good to skip non-public pages.

But if you're requesting particularly defined URLs or using an API, robots.txt is not so useful and may block access to the API.

kaiwangyu • Dec 24 '21

thanks a lot, I learned a ton from your code... but im still get confused by the query_hash. may I ask how do you get this constant for this tpye of query,pls?

Vlad • Dec 24 '21

Open Inspector in Chrome, visit Instagram and scroll through the posts, you'll see the same GraphQL queries with query_hash.
I'm not sure what query_hash value means exactly, but they're static for each type of query it seems.

kaiwangyu • Dec 25 '21 • Edited

Ohhh, I see, it's a constant number(every time drop-down the perfil), but for me it's a diferent number, not 'e769aa130647d2354c40ea6a439bfc08', by the way, thank you so much, I am beginner on Scrapy, and do you sugguest any book or tutorial to learn advanced project based on Scrapy, I already bought this book .

Kai
Merry Chrismas
Regards

Vlad • Dec 25 '21

They may have changed something, but the old value still works too, it seems.
I'm not a specialist in Scrapy, but generally, I'd read official docs (docs.scrapy.org/en/latest/) and then start doing some projects using it and learn from them.

abbas53333 • Dec 30 '20 • Edited

It works like A Charm. Thank you sooooooo much. but i have 2 questions:
1) How do we include the User name to identify the posts to which username.
2) How can we get the Basic information suck as Name Bio Handle Number of followers Number of following and Media Count ?

if this works for all those information i might need to subscrive to Scrap Api 1+Million xD

Thank you in Advance

Comment deleted

abbas53333 • Jan 17 '21

Hey there, you have to download python and install something called Scrapy its an application for Python i would recommend to look some Videos on youtube to learn and i suggest to start by following this tutorial 25 episodes
youtube.com/watch?v=ve_0h4Y8nuI&li... this channel is very good follow it and you shall start!
Have a good day

Mayank Bali • Jun 9 '21

Hey this code is giving me Error

Ignoring response <403 https://api.webscraping.ai/html?api_key=45299f85b2302dd84a9f53e5a799114e&proxy=residential&timeout=20000&url=https%3A%2F%2Fwww.instagram.com%2Fnike%2F%3Fhl%3Den>: HTTP status code is not handled or not allowed

Can Anyone help me out here?

Ian Kerins • Jun 14 '21

The code in the article is designed to use scraperapi.com as the proxy, you are using webscraping.ai. You need to adapt the code to use this proxy as the error suggests that they use a different authentication method for their API.

GhostGardens • Sep 3 '20

Hi there! Great post...it answers a lot of questions.

Small thing, though: the "likes" count & comment count isn't working properly. I'm assuming it's due to the near-constant moving target of Instagram changing their page. Any hints on how to resolve this?

Thanks very much for your time!

jacksonbull87 • Nov 16 '20

the likes count isn't working for me either. its just giving me NaN values. Any idea on how to fix this?

vasana12 • Dec 17 '20

Hi. This is a very helpful article.
What does the variable "first" in the dictionary mean? I am making a hashtag-based crawler. There is a problem setting the value of the "first" variable. Can you answer the criteria for setting?

Amber Alina • Dec 24 '25

Hi,

I’m from the "Rteetech Marketing Agency". I am SEO Expert "GUEST POST" Provider And "Content Writer" with existing post links with our high-quality DA, PA, and TRAFFIC websites. I have a large number of quality websites according to yours. requirement, Our service boosts your website on Google’s page and you get good traffic. I can increase traffic to your website, I can rank your website on the first page of Google, and can do all the work in SEO.
want.
1=Confirm Do-Follow links
2=Permanent post
3=100% Google index
4=No sponsored tags
5=Cheap prices
You are sure to find a website that's best for you, as this list has different categories available, such as:
Technology/Finance
Tech
Health/Beauty
Food
Travel
Spanish
Italian
Finch
Sports
Cars/Pets
Cryptocurrency/Blockchain/Bitcoin
Business/Marketing
Education
Real Estate
Personal Blogs
Astrology/Spirituality
Love Relationship/Yoga
and much more!
Note: Our sites are not PBN Sites So, I am really interested to work with you on your business promotion projects.
Let me know if you're interested. Should I send you the list?

Please let me know to proceed further. Looking forward to hearing from you soon.
Thanks & Regards,

karisjochen • Feb 1 '21

It appears I am getting stopped by Instagram's robots.txt file. Any ideas on how to adjust the code to circumvent this?

thedukeofnada • May 5 '21

Saved my life with this script. Is there a way to extract the actual user comments and not just the count? >username /text/date/time

View full discussion (25 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.