After e-commerce monitoring, building social media scrapers to monitor accounts and track new trends is the next most popular use case for web scraping.
However, for anyone who’s tried to build a web scraping spider for scraping Instagram, Facebook, Twitter or TikTok you know that it can be a bit tricky.
These sites use sophisticated anti-bot technologies to block your requests and regularly make changes to their site schemas which can break your spiders parsing logic.
So in this article, I’m going to show you the easiest way to build a Python Scrapy spider that scrapes all Instagram posts for every user account that you send to it. Whilst removing the worry of getting blocked or having to design XPath selectors to scrape the data from the raw HTML.
The code for the project is available on GitHub here, and is set up to scrape:
- Post URL
- Image URL or Video URL
- Post Captions
- Date Posted
- Number of Likes
- Number of Comments
For every post on that user's account. As you will see there is more data we could easily extract, however, to keep this guide simple I just limited it to the most important data types.
This code can also be quickly modified to scrape all the posts related to a specific tag or geographical location with only minor changes, so it is a great base to build future spiders with.
This article assumes you know the basics of Scrapy, so we’re going to focus on how to scrape Instagram at scale without getting blocked.
The full-code is on GitHub here.
For this example, we're going to use:
- Scraper API as our proxy solution, as Instagram has pretty aggressive anti-scraping in place. You can sign up to a free account here which will give you 5,000 free requests.
- ScrapeOps to monitor our scrapers for free and alert us if they run into trouble. Live demo here: ScrapeOps Demo
Setting Up Our Scrapy Spider
Getting up and running with Scrapy is very easy. To install Scrapy simply enter this command in the command line:
pip install scrapy
Then navigate to your project folder Scrapy automatically creates and run the “startproject” command along with the project name (“instascraper” in this case) and Scrapy will build a web scraping project folder for you, with everything already set up:
scrapy startproject instascraper
cd instascraper
scrapy genspider instagram instagram.com
Here is what you should see:
├── scrapy.cfg # deploy configuration file
└── tutorial # project's Python module, you'll import your code from here
├── __init__.py
├── items.py # project items definition file
├── middlewares.py # project middlewares file
├── pipelines.py # project pipeline file
├── settings.py # project settings file
└── spiders # a directory where spiders are located
├── __init__.py
└── amazon.py # spider we just created
Okay, that’s the Scrapy spider templates set up. Now let’s start building our Instagram spiders.
From here we’re going to create five functions:
- start_requests - will construct the Instagram URL for the users account and send the request to Instagram.
- parse - will extract all the posts data from the users news feed.
- parse_page - if there is more than one page, this function will parse all the posts data from those pages.
- get_video - if the post includes a video, this function will be called and extract the videos url.
- get_url - will send the request to Scraper API so it can retrieve the HTML response.
Let’s get to work…
Requesting Instagram Accounts
To retrieve a user's data from Instagram we need to first create a list of users we want to monitor then incorporate their user ids into a URL. Luckily for us, Instagram uses a pretty straight forward URL structure.
Every user has a unique name and/or user id, that we can use to create the user URL:
https://www.instagram.com/<user_name>/
You can also retrieve the posts associated with a specific tag or from a specific location by using the following url format:
## Tags URL
https://www.instagram.com/explore/tags/<example_tag>/
## Location URL
https://www.instagram.com/explore/locations/<location_id>/
# Note: the location URL is a numeric value so you need to identify the location ID number for
# the locations you want to scrape.
So for this example spider, I’m going to use Nike and Adidas as the two Instagram accounts I want to scrape.
Using the above framework the Nike url is https://www.instagram.com/nike/
, and we also want to have the ability to specify the page language using the “hl” parameter. For example:
https://www.instagram.com/nike/?hl=en #English
https://www.instagram.com/nike/?hl=de #German
Spider #1: Retrieving Instagram Accounts
Now we have created a scrapy project and we are familiar with how instagram displays it’s data, we can begin coding the spiders.
Our start requests spider is going to be pretty simple, we just need to send requests to Instagram with the username url to get the users account:
def start_requests(self):
for username in user_accounts:
url = f'https://www.instagram.com/{username}/?hl=en'
yield scrapy.Request(get_url(url), callback=self.parse)
The start_requests function will iterate through a list of user_accounts and then send the request to Instagram using the yield scrapy.Request(get_url(url), callback=self.parse) where the response is sent to the parse function in the callback.
Spider #2: Scraping Post Data
Okay, now that we are getting a response back from Instagram we can extract the data we want.
On first glance, the post data we want like image urls, likes, comments, etc. don’t seem to be in the HTML data. However, on a closer look we will see that the data is in the form of a JSON dictionary in the scripts tag that starts with “window._sharedData”.
This is because Instagram first loads the layout and all the data it needs from its internal GraphQL API and then puts the data in the correct layout.
We could scrape this data directly if we queried Instagrams GraphQL endpoint directly by adding "/?__a=1" onto the end of the URL. For example:
https://www.instagram.com/nike/?__a=1/
But we wouldn’t be able to iterate through all the pages, so instead we’re going to get the HTML response and then extract the data from the window._sharedData JSON dictionary.
Because the data is already formatted as JSON it will be very easy to extract the data we want. We can just use a simple XPath selector to extract the JSON string and then convert it into a JSON dictionary.
def parse(self, response):
x = response.xpath("//script[starts-with(.,'window._sharedData')]/text()").extract_first()
json_string = x.strip().split('= ')[1][:-1]
data = json.loads(json_string)
From here we just need to extract the data we want from the JSON dictionary.
def parse(self, response):
x = response.xpath("//script[starts-with(.,'window._sharedData')]/text()").extract_first()
json_string = x.strip().split('= ')[1][:-1]
data = json.loads(json_string)
# all that we have to do here is to parse the JSON we have
user_id = data['entry_data']['ProfilePage'][0]['graphql']['user']['id']
next_page_bool = \
data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['page_info'][
'has_next_page']
edges = data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_felix_video_timeline']['edges']
for i in edges:
url = 'https://www.instagram.com/p/' + i['node']['shortcode']
video = i['node']['is_video']
date_posted_timestamp = i['node']['taken_at_timestamp']
date_posted_human = datetime.fromtimestamp(date_posted_timestamp).strftime("%d/%m/%Y %H:%M:%S")
like_count = i['node']['edge_liked_by']['count'] if "edge_liked_by" in i['node'].keys() else ''
comment_count = i['node']['edge_media_to_comment']['count'] if 'edge_media_to_comment' in i[
'node'].keys() else ''
captions = ""
if i['node']['edge_media_to_caption']:
for i2 in i['node']['edge_media_to_caption']['edges']:
captions += i2['node']['text'] + "\n"
if video:
image_url = i['node']['display_url']
else:
image_url = i['node']['thumbnail_resources'][-1]['src']
item = {'postURL': url, 'isVideo': video, 'date_posted': date_posted_human,
'timestamp': date_posted_timestamp, 'likeCount': like_count, 'commentCount': comment_count, 'image_url': image_url,
'captions': captions[:-1]}
Spider #3: Extracting Video URLs
To extract the video URL we need to make another request to that specific post as that data isn’t included in the JSON response previously returned by Instagram.
If the post includes a video then the is_video flag will be set to true, which will trigger our scraper to request that posts page and send the response to the get_video function.
if video:
yield scrapy.Request(get_url(url), callback=self.get_video, meta={'item': item})
else:
item['videoURL'] = ''
yield item
The get_video function will then extract the videoURL from the response.
def get_video(self, response):
# only from the first page
item = response.meta['item']
video_url = response.xpath('//meta[@property="og:video"]/@content').extract_first()
item['videoURL'] = video_url
yield item
Spider #4: Iterating Through Available Pages
The last piece of extraction logic we need to implement is the ability for our crawler to iterate through all the available pages on that user account and scrape all the data.
Like the get_video function we need to check if there are any more pages available before calling the parse_pages function. We do that by checking if the has_next_page field in the JSON dictionary is true or false.
next_page_bool = \
data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['page_info'][
'has_next_page']
If it is true, then we will extract the end_cursor value from the JSON dictionary and create a new request for Instagrams GraphQL api endpoint, along with the user_id, query_hash, etc.
if next_page_bool:
cursor = \
data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['page_info'][
'end_cursor']
di = {'id': user_id, 'first': 12, 'after': cursor}
print(di)
params = {'query_hash': 'e769aa130647d2354c40ea6a439bfc08', 'variables': json.dumps(di)}
url = 'https://www.instagram.com/graphql/query/?' + urlencode(params)
yield scrapy.Request(get_url(url), callback=self.parse_pages, meta={'pages_di': di})
This will then call the parse_pages function which will repeat the process of extracting all the post data and checking to see if there are any more pages.
The difference between this function and the original parse function is that it won’t scrape the video url of each post. However, you can easily add this in if you would like.
def parse_pages(self, response):
di = response.meta['pages_di']
data = json.loads(response.text)
for i in data['data']['user']['edge_owner_to_timeline_media']['edges']:
video = i['node']['is_video']
url = 'https://www.instagram.com/p/' + i['node']['shortcode']
if video:
image_url = i['node']['display_url']
video_url = i['node']['video_url']
else:
video_url = ''
image_url = i['node']['thumbnail_resources'][-1]['src']
date_posted_timestamp = i['node']['taken_at_timestamp']
captions = ""
if i['node']['edge_media_to_caption']:
for i2 in i['node']['edge_media_to_caption']['edges']:
captions += i2['node']['text'] + "\n"
comment_count = i['node']['edge_media_to_comment']['count'] if 'edge_media_to_comment' in i['node'].keys() else ''
date_posted_human = datetime.fromtimestamp(date_posted_timestamp).strftime("%d/%m/%Y %H:%M:%S")
like_count = i['node']['edge_liked_by']['count'] if "edge_liked_by" in i['node'].keys() else ''
item = {'postURL': url, 'isVideo': video, 'date_posted': date_posted_human,
'timestamp': date_posted_timestamp, 'likeCount': like_count, 'commentCount': comment_count, 'image_url': image_url,
'videoURL': video_url,'captions': captions[:-1]
}
yield item
next_page_bool = data['data']['user']['edge_owner_to_timeline_media']['page_info']['has_next_page']
if next_page_bool:
cursor = data['data']['user']['edge_owner_to_timeline_media']['page_info']['end_cursor']
di['after'] = cursor
params = {'query_hash': 'e769aa130647d2354c40ea6a439bfc08', 'variables': json.dumps(di)}
url = 'https://www.instagram.com/graphql/query/?' + urlencode(params)
yield scrapy.Request(get_url(url), callback=self.parse_pages, meta={'pages_di': di})
Setting Up Proxies
Finally, we are pretty much ready to go live. Last thing we need to do is to set our spiders up to use a proxy to enable us to scrape at scale without getting blocked.
For this project, I’ve gone with Scraper API as it is super easy to use and because they have a great success rate with scraping Instagram.
Scraper API is a proxy API that manages everything to do with proxies for you. You simply have to send them the URL you want to scrape and their API will route your request through one of their proxy pools and give you back the HTML response.
To use Scraper API you need to sign up to a free account here and get an API key which will allow you to make 1,000 free requests per month and use all the extra features like Javascript rendering, geotargeting, residential proxies, etc.
Next, we need to integrate it with our spider. Reading their documentation, we see that there are three ways to interact with the API: via a single API endpoint, via their Python SDK, or via their proxy port.
For this project, I integrated the API by configuring my spiders to send all our requests to their API endpoint.
API = ‘<YOUR_API_KEY>’
def get_url(url):
payload = {'api_key': API, 'url': url}
proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
return proxy_url
And then modify our spider functions so as to use the Scraper API proxy by setting the url parameter in scrapy.Request to get_url(url). For example:
def start_requests(self):
for username in user_accounts:
url = f'https://www.instagram.com/{username}/?hl=en'
yield scrapy.Request(get_url(url), callback=self.parse)
We also have to change the spiders settings to set the allowed_domains to api.scraperapi.com, and the max concurrency per domain to the concurrency limit of our Scraper API plan. Which in the case of the Scraper API Free plan is 5 concurrent threads:
class InstagramSpider(scrapy.Spider):
name = 'instagram'
allowed_domains = ['api.scraperapi.com']
custom_settings = {'CONCURRENT_REQUESTS_PER_DOMAIN': 5}
Also, we should set RETRY_TIMES to tell Scrapy to retry any failed requests (to 5 for example) and make sure that DOWNLOAD_DELAY and RANDOMIZE_DOWNLOAD_DELAY aren’t enabled as these will lower your concurrency and are not needed with Scraper API.
Setting Up Monitoring
To monitor our scraper we're going to use ScrapeOps, a free monitoring and alerting tool dedicated to web scraping.
With a simple 30 second install ScrapeOps gives you all the monitoring, alerting, scheduling and data validation functionality you need for web scraping straight out of the box.
Live demo here: ScrapeOps Demo
Getting setup with ScrapeOps is simple. Just install the Python package:
pip install scrapeops-scrapy
And add 3 lines to your settings.py
file:
## settings.py
## Add Your ScrapeOps API key
SCRAPEOPS_API_KEY = 'YOUR_API_KEY'
## Add In The ScrapeOps Extension
EXTENSIONS = {
'scrapeops_scrapy.extension.ScrapeOpsMonitor': 500,
}
## Update The Download Middlewares
DOWNLOADER_MIDDLEWARES = {
'scrapeops_scrapy.middleware.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': None,
}
From there, our scraping stats will be automatically logged and automatically shipped to our dashboard.
Going Live!
Now we are good to go. You can test the spider again by running the spider with the crawl command.
scrapy crawl instagram -o test.csv
Once complete the spider will store the accounts data in a csv file.
If you would like to run the spider for yourself or modify it for your particular Instagram project then feel free to do so. The code is on GitHub here. Just remember that you need to get your own Scraper API api key by signing up here.
Top comments (23)
Looks like Instagram doesn't work via Scraper API anymore. But it still works on webscraping.ai
instagram.com/explore/tags//
Do you know how to get the posts for a tagged username. Simply replacing the former URL with this new one doesn't seem to work.
instagram.com/explore/tags/sport/?... works for hashtags, and instagram.com/nike/tagged/?__a=1 works for username, but this one requries login
Do you mind sharing how you adjusted the code to use webscraping.ai instead? Thanks!
Sure, here it is gist.github.com/Drakula2k/035cc5bd...
I also fixed a couple of bugs there
Thanks so much for sharing! After making the changes I am unfortunately still getting blocked by the robots.txt file. Is this code still working for you?
Yes, it's working. You can disable the robots.txt check by setting ROBOTSTXT_OBEY = False on your settings.py. It works via an API so there is no need for the robots.txt check.
incredible, thank you! It worked! So is it always a good idea to set the ROBOTSTXT_OBEY = False considering we dont want to be stopped?
Yes, ROBOTSTXT_OBEY is good when you're building something like a search engine and it may request all sorts of random URLs posted on the Internet. In that case, using robots.txt is good to skip non-public pages.
But if you're requesting particularly defined URLs or using an API, robots.txt is not so useful and may block access to the API.
thanks a lot, I learned a ton from your code... but im still get confused by the query_hash. may I ask how do you get this constant for this tpye of query,pls?
Open Inspector in Chrome, visit Instagram and scroll through the posts, you'll see the same GraphQL queries with query_hash.
I'm not sure what query_hash value means exactly, but they're static for each type of query it seems.
Ohhh, I see, it's a constant number(every time drop-down the perfil), but for me it's a diferent number, not 'e769aa130647d2354c40ea6a439bfc08', by the way, thank you so much, I am beginner on Scrapy, and do you sugguest any book or tutorial to learn advanced project based on Scrapy, I already bought this book .
Kai
Merry Chrismas
Regards
They may have changed something, but the old value still works too, it seems.
I'm not a specialist in Scrapy, but generally, I'd read official docs (docs.scrapy.org/en/latest/) and then start doing some projects using it and learn from them.
It works like A Charm. Thank you sooooooo much. but i have 2 questions:
1) How do we include the User name to identify the posts to which username.
2) How can we get the Basic information suck as Name Bio Handle Number of followers Number of following and Media Count ?
if this works for all those information i might need to subscrive to Scrap Api 1+Million xD
Thank you in Advance
Hey there, you have to download python and install something called Scrapy its an application for Python i would recommend to look some Videos on youtube to learn and i suggest to start by following this tutorial 25 episodes
youtube.com/watch?v=ve_0h4Y8nuI&li... this channel is very good follow it and you shall start!
Have a good day
Hi there! Great post...it answers a lot of questions.
Small thing, though: the "likes" count & comment count isn't working properly. I'm assuming it's due to the near-constant moving target of Instagram changing their page. Any hints on how to resolve this?
Thanks very much for your time!
the likes count isn't working for me either. its just giving me NaN values. Any idea on how to fix this?
Hey this code is giving me Error
Ignoring response <403 https://api.webscraping.ai/html?api_key=45299f85b2302dd84a9f53e5a799114e&proxy=residential&timeout=20000&url=https%3A%2F%2Fwww.instagram.com%2Fnike%2F%3Fhl%3Den>: HTTP status code is not handled or not allowed
Can Anyone help me out here?
The code in the article is designed to use scraperapi.com as the proxy, you are using webscraping.ai. You need to adapt the code to use this proxy as the error suggests that they use a different authentication method for their API.
Hi. This is a very helpful article.
What does the variable "first" in the dictionary mean? I am making a hashtag-based crawler. There is a problem setting the value of the "first" variable. Can you answer the criteria for setting?
It appears I am getting stopped by Instagram's robots.txt file. Any ideas on how to adjust the code to circumvent this?
Saved my life with this script. Is there a way to extract the actual user comments and not just the count? >username /text/date/time