DEV Community

Cover image for Scraping every post on an Instagram profile with less than 10 lines of Python
Chris Greening
Chris Greening

Posted on • Updated on

Scraping every post on an Instagram profile with less than 10 lines of Python

In this blog post, I'm going to give a quick tutorial on how you can scrape every post on an Instagram profile page using instascrape with less than 10 lines of Python!

Specifically, I am going to be scraping every post from Joe Biden's Instagram account (@joebiden)

GitHub logo chris-greening / instascrape

Powerful and flexible Instagram scraping library for Python, providing easy-to-use and expressive tools for accessing data programmatically

instascrape: powerful Instagram data scraping toolkit

DISCLAIMER:

Instagram has gotten increasingly strict with scraping and using this library can result in getting flagged for botting AND POSSIBLE DISABLING OF YOUR INSTAGRAM ACCOUNT. This is a research project and I am not responsible for how you use it. Independently, the library is designed to be responsible and respectful and it is up to you to decide what you do with it. I don't claim any responsibility if your Instagram account is affected by how you use this library.

Version Downloads Release License

Activity Dependencies Issues

What is it?

instascrape is a lightweight Python package that provides an expressive and flexible API for scraping Instagram data. It is geared towards being a high-level building block on the data scientist's toolchain and can be seamlessly integrated and extended with industry standard tools for web scraping, data science, and analysis.

Key features

Here are a few of the things that…


Chris Greening - Software Developer

Hey! My name's Chris Greening and I'm a software developer from the New York metro area with a diverse range of engineering experience - beam me a message and let's build something great!

favicon christophergreening.com

Prerequisites for those of you following along at home

Importing from our libraries

Let's start by importing the tools we'll be using.

from selenium.webdriver import Chrome 
from instascrape import Profile, scrape_posts
Enter fullscreen mode Exit fullscreen mode

Preparing the profile for scraping

As I've mentioned in previous blog posts, Instagram serves most content asynchronously using JavaScript allowing for the seamless infinite scroll effect and decreased load times.

Alt Text

To render JavaScript, this is where our webdriver comes in handy. For this tutorial, I will be using chromedriver to automate Google Chrome as my browser but feel free to use whatever webdriver you are comfortable with!

webdriver = Chrome("path/to/chromedriver.exe")
Enter fullscreen mode Exit fullscreen mode

Now a quick aside before we start with this next part; you are going to have to find your Instagram sessionid *gasp* Don't worry! Here is a super short guide. Be sure to paste it below in the headers dictionary where indicated.

headers = {
    "user-agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57",
    "cookie": "sessionid=PASTE_YOUR_SESSIONID_HERE;"
}
joe = Profile("joebiden")
joe.scrape(headers=headers)
Enter fullscreen mode Exit fullscreen mode

Dynamically loading all posts

And now for the part you've all been waiting for! Using the Profile.get_posts instance method, there are a variety of arguments we can pass for painlessly loading all the posts on a page.

In this case, we are going to have to manually login to our Instagram account when the browser opens so we pass login_first=True. This will give us 60 seconds to enter our username and password (this wait time can be modified to whatever you want)

posts = joe.get_posts(webdriver=webdriver, login_first=True)
Enter fullscreen mode Exit fullscreen mode

Now, to prove to you that it worked, here is a GIF of me scrolling through the scraped URLs of all 1,261 posts 😏

Alt Text

Scraping the data from each post

Now there is only one thing left to do, and that is scrape every individual post. The scrape_posts function takes a variety of arguments that let you configure your scrape however you want!

The most important argument in this case is posts which is a list of unscraped instascrape.Post objects.

In this case, I'm going to set a pause of 10 seconds between each scrape so that Instagram doesn't temporarily IP block us.

scraped_posts, unscraped_posts = scrape_posts(posts, headers=headers, pause=10, silent=False)
Enter fullscreen mode Exit fullscreen mode

In the event that there is a problem, we are able to configure scrape_posts such that all posts that were not scraped are returned so we don't lose all of the work we did, hence the unscraped.

In conclusion

And there we have it! In less than 10 lines of Python, we were able to scrape almost 50,000 data points from @joebiden's Instagram account!

Alt Text

We can now analyze his engagement, how many hashtags he uses, who he tags in photos, etc. In my next blog post, I'll be showing some ways we can analyze this data and glean useful insights!

In the meantime, here is a related article where I analyze 10,000 data points scraped from Donald Trump's Instagram account.

Click here for the full code, dataset, and file containing all the URLs used in this tutorial.

If you have any questions, feel free to drop them in the comments below, message me, or contact me at my website!

Chris Greening - Software Developer

Hey! My name's Chris Greening and I'm a software developer from the New York metro area with a diverse range of engineering experience - beam me a message and let's build something great!

favicon christophergreening.com

Alt Text

Top comments (26)

Collapse
 
karisjochen profile image
karisjochen

Thanks so much for these tutorials and sharing your work! Question: I ran this exact code for my own instagram data and it has been executing for ~30 minutes now. I have 552 instagram posts. I'm hesitant to kill it but I am unsure if it is stuck. Any ideas?

Collapse
 
chrisgreening profile image
Chris Greening

Unfortunately it's an incredibly slow approach, Instagram starts blocking if you scrape too much too fast so I try to play the long game and let it run in the background.

In the scrape_posts function, you'll see pause=10 which refers to a 10s pause between each post scrape. Considering you have 552 posts, that'll be (552*10)/60 = 92 minutes 😬

In the future, passing silent=False as an argument will print what number the scrape is currently on, I'm actually gonna edit that in right now for anyone else reading the article in the future!

Thanks for reaching out!

Collapse
 
chrisgreening profile image
Chris Greening

If it's any consolation though, that means it's working! You're just gonna have to wait an extra hour or so before you can get your data 😬

Collapse
 
karisjochen profile image
karisjochen

haha thank you! So it did eventually finish without error but then I appeared to have a list of "Post" objects of which I could not tell how I was to get the data from. From reading the GitHub documentation I tried various methods but to no avail (this isn't a knock on you more a knock on my learning curve).

So now after a few hours of messing around I tried to run the "joe biden code" for my own account and even though I am setting login_first=False in the get_posts function, the chrome driver brings me to a login page. Im able to log into instagram but meanwhile my code says it has finished running without error but my posts and scraped_posts objects are now just empty lists.

Thread Thread
 
karisjochen profile image
karisjochen

oh I guess I should also mention that my end goal is to collect data similar to the data you analyzed in your donald trump post. I saw you published a notebook of the analysis code (thank you!) but didn't see a line-by-line on how you got that data.

Thread Thread
 
chrisgreening profile image
Chris Greening

scraped Post objects contain the scraped data as instance attributes! Try using the to_dict method on one of the Post's and it should return a dictionary with the data it scraped for that Post. The key/values of the returned dict will correspond one-to-one with the available instance attributes

I'll take a look at the login_first bug rn and see if I can replicate it, it might be on the library's end! Instagram has been making a lot of changes the last month or so and have been making it increasingly harder to scrape

Thread Thread
 
chrisgreening profile image
Chris Greening

ahhh okay, so when you set login_first=False, Instagram is still redirecting to the login page automatically but instascrape is trying to start scrolling immediately which results in an empty list since there are no posts rendered on the page

to access dynamically rendered content like posts you're pretty much always gonna have to be logged in so it's best to leave login_first as True unless you're chaining scrapes and your webdriver is already logged in manually

Thread Thread
 
karisjochen profile image
karisjochen

amazing thank you! So I was able to get my first 10 posts no problem by specifying amount=10 but then I tried to do all ~500 pictures and after 232 pictures I came across this error:

ConnectionError: ('Connection aborted.', OSError("(54, 'ECONNRESET')"))

Im guessing this means instagram blocked my request? Have you come across this issue?

Collapse
 
idilkylmz profile image
idilkylmz

Hi Chris,

Thank you for your sharing. I tried to use your code but I am getting this error.

ImportError: cannot import name 'QUOTE_NONNUMERIC' from partially initialized module 'csv' (most likely due to a circular import) (/home/idil/MasaΓΌstΓΌ/csv.py)

Do you know what this is about?

Collapse
 
d3c3ptr0n profile image
Yug Khatri • Edited

Firstly, thankyou for making such an awesome library/module. What if I want to scrape first 12 posts or the first page (containing the recent posts) of a public profile, like how to apply that? And if I want these recent posts to be returned in the form of .JSON, how to do that? Here's the sample of an object I want to be returned from .JSON array:

"node": {
              "__typename": "GraphImage",
              "id": "2730428346900395591",
              "shortcode": "CXkbh11p7ZH",
              "dimensions": { "height": 1080, "width": 1080 },
              "display_url": "https://instagram.famd4-1.fna.fbcdn.net/v/t51.2885-15/e35/s1080x1080/269265717_437445878085899_6310820446477644652_n.jpg?_nc_ht=instagram.famd4-1.fna.fbcdn.net&_nc_cat=111&_nc_ohc=BKGvyd56hscAX9nnnEt&edm=ABfd0MgBAAAA&ccb=7-4&oh=00_AT8t9PchQPCY7kkxcgKsiXiZda_BgcgS_DP5DNRNHjnhfQ&oe=61C9609B&_nc_sid=7bff83",
              "edge_media_to_tagged_user": { "edges": [] },
              "fact_check_overall_rating": null,
              "fact_check_information": null,
              "gating_info": null,
              "sharing_friction_info": {
                "should_have_sharing_friction": false,
                "bloks_app_url": null
              },
              "media_overlay_info": null,
              "media_preview": "ACoqw/IKvtfs20/XjjP41uuIo4DGCAxAONwJyMf4VltCrbiAfvcZznHI7n8auwWjrwynBIxx2pASSR72J6gkkfjTXgAYnHUf0xWtHGFUZHT+lR7RM5oAyY7dVBwKT7OvpWvOioAB1qvigCApVxceaOAMjoOuCvX7oPPs2AagIxUoLBlc/dxgdMn8eCAPQZz3oA0SeMe39Kz4pdgJqY3I6etZ3WgCyzGU8VZFu1JAojXc3Wg3NAFU0wgjpx+NOWnN0oAg5z1pqjFSCmigCVpMjFR0DrRQB//Z",
              "owner": { "id": "31997451", "username": "getpeid" },
              "is_video": false,
              "has_upcoming_event": false,
              "accessibility_caption": "Photo by Carl Pei on December 16, 2021. May be an image of indoor.",
              "edge_media_to_caption": {
                "edges": [
                  {
                    "node": { "text": "Nothing to see here\u2026 \ud83e\udd16" }
                  }
                ]
              },
              "edge_media_to_comment": { "count": 67 },
              "comments_disabled": false,
              "taken_at_timestamp": 1639712445,
              "edge_liked_by": { "count": 3047 },
              "edge_media_preview_like": { "count": 3047 },
              "location": null,
              "thumbnail_src": "https://instagram.famd4-1.fna.fbcdn.net/v/t51.2885-15/sh0.08/e35/s640x640/269265717_437445878085899_6310820446477644652_n.jpg?_nc_ht=instagram.famd4-1.fna.fbcdn.net&_nc_cat=111&_nc_ohc=BKGvyd56hscAX9nnnEt&edm=ABfd0MgBAAAA&ccb=7-4&oh=00_AT99lHhVotAjz6UZ_8YX-zkpaccYeirUMSxKXT8tWGSBWQ&oe=61CB131F&_nc_sid=7bff83",
              "thumbnail_resources": [
                {
                  "src": "https://instagram.famd4-1.fna.fbcdn.net/v/t51.2885-15/e35/s150x150/269265717_437445878085899_6310820446477644652_n.jpg?_nc_ht=instagram.famd4-1.fna.fbcdn.net&_nc_cat=111&_nc_ohc=BKGvyd56hscAX9nnnEt&edm=ABfd0MgBAAAA&ccb=7-4&oh=00_AT-3RwtltAzqqy9VNUK6i5uthTh_fuxfQzMIglI83keZXA&oe=61CAEE1C&_nc_sid=7bff83",
                  "config_width": 150,
                  "config_height": 150
                },
                {
                  "src": "https://instagram.famd4-1.fna.fbcdn.net/v/t51.2885-15/e35/s240x240/269265717_437445878085899_6310820446477644652_n.jpg?_nc_ht=instagram.famd4-1.fna.fbcdn.net&_nc_cat=111&_nc_ohc=BKGvyd56hscAX9nnnEt&edm=ABfd0MgBAAAA&ccb=7-4&oh=00_AT9nE3ABD9PKSZ5UTo_NDkLjcLQySpZzaPbFzDDmxPhs_Q&oe=61CB331E&_nc_sid=7bff83",
                  "config_width": 240,
                  "config_height": 240
                },
                {
                  "src": "https://instagram.famd4-1.fna.fbcdn.net/v/t51.2885-15/e35/s320x320/269265717_437445878085899_6310820446477644652_n.jpg?_nc_ht=instagram.famd4-1.fna.fbcdn.net&_nc_cat=111&_nc_ohc=BKGvyd56hscAX9nnnEt&edm=ABfd0MgBAAAA&ccb=7-4&oh=00_AT_VJtqot_T-RKh2HxY-mdHraqWZp76iO52bHoJEVORXgw&oe=61C9E0A4&_nc_sid=7bff83",
                  "config_width": 320,
                  "config_height": 320
                },
                {
                  "src": "https://instagram.famd4-1.fna.fbcdn.net/v/t51.2885-15/e35/s480x480/269265717_437445878085899_6310820446477644652_n.jpg?_nc_ht=instagram.famd4-1.fna.fbcdn.net&_nc_cat=111&_nc_ohc=BKGvyd56hscAX9nnnEt&edm=ABfd0MgBAAAA&ccb=7-4&oh=00_AT8h0qPpCUdsJC32Zz-nXoi1kKITa_AT2SQtYeEf572AmQ&oe=61C9D9A5&_nc_sid=7bff83",
                  "config_width": 480,
                  "config_height": 480
                },
                {
                  "src": "https://instagram.famd4-1.fna.fbcdn.net/v/t51.2885-15/sh0.08/e35/s640x640/269265717_437445878085899_6310820446477644652_n.jpg?_nc_ht=instagram.famd4-1.fna.fbcdn.net&_nc_cat=111&_nc_ohc=BKGvyd56hscAX9nnnEt&edm=ABfd0MgBAAAA&ccb=7-4&oh=00_AT99lHhVotAjz6UZ_8YX-zkpaccYeirUMSxKXT8tWGSBWQ&oe=61CB131F&_nc_sid=7bff83",
                  "config_width": 640,
                  "config_height": 640
                }
              ],
              "coauthor_producers": []
            }
          }
Enter fullscreen mode Exit fullscreen mode
Collapse
 
cu2epoison profile image
cutepoison

Thank you so much for these tutorial and for publishing your work πŸ™πŸ™

when i'm trying to save both scraped_posts and unscraped_posts, it says function has no 'to_csv' member. I do see the urls and the upload date but i can't see whether or not if the individual posts are being scraped or not.

Maybe i'm doing it wrong, i've looked through your other blog posts and documentation i couldn't find any examples of how to save the scraped data or use the to_csv/to_json line ( yes i am a beginner in programming, apologies if this question sounds stupid)

Collapse
 
max236 profile image
Mageshwaran

This is Cool Man

Collapse
 
chrisgreening profile image
Chris Greening

Hey thanks so much, I appreciate it! πŸ˜„

Collapse
 
gorpcorrespondent profile image
gorpcorrespondent

InvalidArgumentException: invalid argument: 'url' must be a string
Do you know why I might be getting this error?
Code is as follows:
import pandas as pd
from selenium.webdriver import Chrome
from instascrape import Profile, scrape_posts
from webdriver_manager.chrome import ChromeDriverManager

defining path for Google Chrome webdriver;

driver = webdriver.Chrome(ChromeDriverManager().install())

Scraping Joe Biden's profile

SESSIONID = 'session id' #Actual session id excluded on purpose
headers = {"user-agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Mobile Safari/537.36 Edg/87.0.664.57",
"cookie": f"sessionid={SESSIONID};"}
prof = Profile('instagram.com/username/') #username exlcuded as well
prof.scrape()

Scraping the posts

posts = prof.get_posts(webdriver=driver, login_first=True)
scraped, unscraped = scrape_posts(posts, silent=False, headers=headers, pause=10)

posts_data = [post.to_dict() for post in posts]
posts_df = pd.DataFrame(posts_data)
print(posts_df[['upload_date', 'comments', 'likes']])

The issue seems to be stemming from my 'get_posts' call

Collapse
 
10deepaktripathi profile image
Deepak Tripathi

Hi Chris,
I tried running the above code, but i keep on getting below error. I have put my valid session id.

"Instagram is redirecting you to the login page instead of the page you are trying to scrape. This could be occuring because you made too many requests too quickly or are not logged into Instagram on your machine. Try passing a valid session ID to the scrape method as a cookie to bypass the login requirement"

Collapse
 
pramodadf profile image
pramodadf

Hi Chris,

joebiden.py:8: DeprecationWarning: executable_path has been deprecated, please pass in a Service object
webdriver = Chrome("/home/pramod/Downloads/chromedriver/chromedriver")
Traceback (most recent call last):
File "joebiden.py", line 19, in
scraped, unscraped = scrape_posts(posts, silent=False, headers=headers, pause=10)
File "/home/pramod/.local/lib/python3.8/site-packages/instascrape/scrapers/scrape_tools.py", line 179, in scrape_posts
post.scrape(session=session, webdriver=webdriver, headers=headers)
File "/home/pramod/.local/lib/python3.8/site-packages/instascrape/scrapers/post.py", line 88, in scrape
return_instance.upload_date = datetime.datetime.fromtimestamp(return_instance.timestamp)

I am facing an error . Please help me out .
thanks

Collapse
 
villival profile image
villival

Always crisp and clear... thanks for sharing ...

Collapse
 
anambarajas_ profile image
Anaaa

Thanks for your hard work. I'm really lucky because I found out about this project just as I wanted to scrape my business IG profile. Keep up with the good work!

Collapse
 
chrisgreening profile image
Chris Greening

This is exactly why I released it, thanks so much for the feedback πŸ˜„ motivates me to keep working on it

Collapse
 
alessandrosassi profile image
Alessandro Sassi • Edited

Thank you very much for this precious tool!
I'm trying to run the code, but despite inserting my session id i still get 'MissingCookiesWarning' and 'InstagramRedirectLoginError1.
How to fix this?

Collapse
 
apet profile image
apet • Edited

I am getting the error "ValueError: Invalid value NaN (not a number)" when the scrape_posts() method is called. Something about getting time shows up before if that is relevant. Thanks

Collapse
 
xxlisapeter profile image
xxLisaPeter

awesome thanks so much :) I'm pretty new to python but I could run your code!
Is there a way to save the posts (images + texts) to a certain folder?

Collapse
 
fatima2309 profile image
Fatima-2309

Why am I getting this error :(
JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)

Collapse
 
ivanvilches profile image
ivan-vilches

Hello amazing post, i wish the link of the second part of analizing data,
To do this scrapping do you need change ip or use multiple ig accounts?
Thanks so much