DEV Community

The Easy Way to Scrape Instagram Using Python Scrapy & GraphQL

Ian Kerins on August 06, 2020

After e-commerce monitoring, building social media scrapers to monitor accounts and track new trends is the next most popular use case for web scra...

Read full post

Vlad • Jan 17 '21

Looks like Instagram doesn't work via Scraper API anymore. But it still works on webscraping.ai

djk50 • May 27 '21

instagram.com/explore/tags//

Do you know how to get the posts for a tagged username. Simply replacing the former URL with this new one doesn't seem to work.

Vlad • Jun 15 '21

instagram.com/explore/tags/sport/?... works for hashtags, and instagram.com/nike/tagged/?__a=1 works for username, but this one requries login

karisjochen • Feb 1 '21

Do you mind sharing how you adjusted the code to use webscraping.ai instead? Thanks!

Vlad • Feb 1 '21 • Edited

Sure, here it is gist.github.com/Drakula2k/035cc5bd...
I also fixed a couple of bugs there

karisjochen • Feb 1 '21

Thanks so much for sharing! After making the changes I am unfortunately still getting blocked by the robots.txt file. Is this code still working for you?

Vlad • Feb 2 '21

Yes, it's working. You can disable the robots.txt check by setting ROBOTSTXT_OBEY = False on your settings.py. It works via an API so there is no need for the robots.txt check.

karisjochen • Feb 4 '21

incredible, thank you! It worked! So is it always a good idea to set the ROBOTSTXT_OBEY = False considering we dont want to be stopped?

Vlad • Feb 4 '21

Yes, ROBOTSTXT_OBEY is good when you're building something like a search engine and it may request all sorts of random URLs posted on the Internet. In that case, using robots.txt is good to skip non-public pages.

But if you're requesting particularly defined URLs or using an API, robots.txt is not so useful and may block access to the API.

kaiwangyu • Dec 24 '21

thanks a lot, I learned a ton from your code... but im still get confused by the query_hash. may I ask how do you get this constant for this tpye of query,pls?

Vlad • Dec 24 '21

Open Inspector in Chrome, visit Instagram and scroll through the posts, you'll see the same GraphQL queries with query_hash.
I'm not sure what query_hash value means exactly, but they're static for each type of query it seems.

kaiwangyu • Dec 25 '21 • Edited

Ohhh, I see, it's a constant number(every time drop-down the perfil), but for me it's a diferent number, not 'e769aa130647d2354c40ea6a439bfc08', by the way, thank you so much, I am beginner on Scrapy, and do you sugguest any book or tutorial to learn advanced project based on Scrapy, I already bought this book .

Kai
Merry Chrismas
Regards

Vlad • Dec 25 '21

They may have changed something, but the old value still works too, it seems.
I'm not a specialist in Scrapy, but generally, I'd read official docs (docs.scrapy.org/en/latest/) and then start doing some projects using it and learn from them.

abbas53333 • Dec 30 '20 • Edited

It works like A Charm. Thank you sooooooo much. but i have 2 questions:
1) How do we include the User name to identify the posts to which username.
2) How can we get the Basic information suck as Name Bio Handle Number of followers Number of following and Media Count ?

if this works for all those information i might need to subscrive to Scrap Api 1+Million xD

Thank you in Advance

Comment deleted

abbas53333 • Jan 17 '21

Hey there, you have to download python and install something called Scrapy its an application for Python i would recommend to look some Videos on youtube to learn and i suggest to start by following this tutorial 25 episodes
youtube.com/watch?v=ve_0h4Y8nuI&li... this channel is very good follow it and you shall start!
Have a good day

GhostGardens • Sep 3 '20

Hi there! Great post...it answers a lot of questions.

Small thing, though: the "likes" count & comment count isn't working properly. I'm assuming it's due to the near-constant moving target of Instagram changing their page. Any hints on how to resolve this?

Thanks very much for your time!

jacksonbull87 • Nov 16 '20

the likes count isn't working for me either. its just giving me NaN values. Any idea on how to fix this?

Mayank Bali • Jun 9 '21

Hey this code is giving me Error

Ignoring response <403 https://api.webscraping.ai/html?api_key=45299f85b2302dd84a9f53e5a799114e&proxy=residential&timeout=20000&url=https%3A%2F%2Fwww.instagram.com%2Fnike%2F%3Fhl%3Den>: HTTP status code is not handled or not allowed

Can Anyone help me out here?

Ian Kerins • Jun 14 '21

The code in the article is designed to use scraperapi.com as the proxy, you are using webscraping.ai. You need to adapt the code to use this proxy as the error suggests that they use a different authentication method for their API.

vasana12 • Dec 17 '20

Hi. This is a very helpful article.
What does the variable "first" in the dictionary mean? I am making a hashtag-based crawler. There is a problem setting the value of the "first" variable. Can you answer the criteria for setting?

karisjochen • Feb 1 '21

It appears I am getting stopped by Instagram's robots.txt file. Any ideas on how to adjust the code to circumvent this?

thedukeofnada • May 5 '21

Saved my life with this script. Is there a way to extract the actual user comments and not just the count? >username /text/date/time