Chris Greening

Posted on Oct 22, 2020 • Edited on May 3, 2022

Exploratory data analysis of Instagram using instascrape and Python

#python #datascience #opensource #webscraping

The Content
- Working with static content...
- ...and dealing with the dynamic
The Tools
- Data processing: the before and after
- Data visualization and interaction
The Exploration
The Conclusion

My recent two posts have been light introduction's into instascrape: the lightweight, open source Instagram web scraper written for Python 🐍!

Scrape data from Instagram with instascrape and Python

Chris Greening ・ Oct 20 '20

#python #datascience #hacktoberfest #opensource

Visualizing Instagram engagement with instascrape and Python

Chris Greening ・ Oct 21 '20

#python #datascience #opensource #hacktoberfest

In this post, I'm going to show some of the ways I've personally explored Instagram programatically using Python and instascrape 😉

Chris Greening - Software Developer

Hey! My name's Chris Greening and I'm a software developer from the New York metro area with a diverse range of engineering experience - beam me a message and let's build something great!

christophergreening.com

The Content

🚶 Working with static content...

On its own, instascrape is a purely static web scraper. That means it only scrapes the initial source HTML served back by Instagram and does not deal with dynamic content rendered by JavaScript.

...and dealing with the dynamic 🏃

Like many other modern websites, Instagram uses a combination of server-side and client-side technology's such as AJAX to dynamically load content as you scroll. This allows Instagram to respond to an HTTP request quickly and then load more content as it's needed. By doing this, the user is presented with a clean, seamless user-experience (UX) with infinite scrolls and fast page refresh times.

While great for UX, this dynamically rendered content can become a bit of a pain for web scraping... but no worries 🙌! There are ways we can get around this and take it in step. For the most part, I use selenium which allows us to automate web browsers such as Google Chrome and Firefox using Python 💻! With this tool at our disposal, we can render the JavaScript and grab the HTML as it's loaded, integrating it into instascrape for scraping.

The Tools

💾 Data processing: the before and after

Regarding the data analysis, I use a combination of

pandas: powerful tools and data structures for analyzing and exploring data
numpy: support for multidimensional arrays
scikit-learn: machine learning library that we will use for preprocessing data and building regression models

👁️ Data visualization and interaction

The library I use for data visualization is matplotlib. I use Jupyter Notebook or the IPython console for interactivity.

The Exploration

👴 Takin' a peak at politicians

With one of the early iterations of instascrape in early March 2020, I used it to take a look at how various politician's Instagram game's stacked up against one another, specifically Bernie Sanders and Joe Biden:

Fascinating! Let's take a look at Bernie first. It appears he's enjoyed very steady growth on his Instagram since 2016, nearly quintupling his likes per post. Additionally, we can see when he's on the campaign trail based on the frequency of posts.

Now let's take a look at Joe. He has no posts prior to mid-2018 and it's clear he enjoys less likes than Bernie did at the time of this data collection. This certainly makes sense considering Bernie is so popular with younger voters who make up a larger portion of social media platforms!

📉 The rise and fall of @chris_greening

Yes that's a David Bowie reference; yes I am Chris Greening and my insta is in fact falling 😢... but that's okay 🤷! It made for a fun exercise to analyze. Let's check out the data:

Gasp! Shock! I know, it's tragic. But let's get down to it 😏. We can see that my growth was quite stagnant between 2016 and 2020 until mid-March of this year when my page suddenly blew up 😮 (quarantine was beginning and I decided to learn Photoshop) Let's zoom in a bit to just 2020 🔎:

I went from averaging <100 likes per post to almost 400 likes in just a matter of months with some posts netting over 800! We can also see that I was pretty steady with my frequency of posting until June when I slipped up and missed an entire month! Whoops! And it's all been downhill from there 😄. This type of data can be great though for seeing how a page is performing!

Let's take a look at a popular Instagram page right now and see how they're doing:

Wow, honestly kind of incredible how linear @dudewithsign's growth has been since he started posting, nearly an exact straight line.

⌚ Determining best time of day/day of week to post

In the same vein using the same data as the previous exploration we just did, we can also create a heatmap that will show us the best time of day/day of week for @chris_greening (me) to post to net the highest average engagement 🔥:

It certainly seems that I get the most engagement when I post in the late morning/early afternoon but additionally, we can see some of my best engaged posts were posted in the middle of the week on Wednesday and Thursday. This is great information to keep in mind the next time I go to post something 👍.

⌛ Scraping a post in real time

For the final bit of data exploration I'll show in this post, let's take a look at the output of a program I wrote that watches a post's engagement as it grows in real time. The program tracked a post by @dacre_montgomery and gathered how many likes/comments it got as a time-series across a 30 minute window:

The red/left y-axis represents likes on the post while the blue/right y-axis represents comments on the post. Incredibly enough, Dacre was able to amass over 35,000 likes and almost 400 comments in that time period alone (and that was after the post had already been up for an hour or so). That's more likes/comments than I have probably ever gotten on all my posts combined 😬

A future idea could be to write a script that watches a user's page continuously and as soon as a new post is detected, the real-time engagement tracker is triggered and we could watch the post grow across a longer period, say 8 hours!

The Conclusion

And there you have it! Leveraging instascrape to gather data, I was able to perform some really great data exploration I wouldn't have been able to do otherwise. Not only was I able to explore my own profile but I was able to look at the profile's of some public figures as well. These are just drops in a great ocean of possibilities you can accomplish using instascrape and the data is just out there waiting for you!

Keep an eye out for a future post with more data exploration that will take a look at real-time hashtag growth, real-time post growth, and some more interesting examples to mess around with.

Let me know what you think in the comments below or even better, contribute to the official repo 😃

chris-greening / instascrape

Powerful and flexible Instagram scraping library for Python, providing easy-to-use and expressive tools for accessing data programmatically

instascrape: powerful Instagram data scraping toolkit

Note: This module is no longer actively maintained.

DISCLAIMER:

Instagram has gotten increasingly strict with scraping and using this library can result in getting flagged for botting AND POSSIBLE DISABLING OF YOUR INSTAGRAM ACCOUNT. This is a research project and I am not responsible for how you use it. Independently, the library is designed to be responsible and respectful and it is up to you to decide what you do with it. I don't claim any responsibility if your Instagram account is affected by how you use this library.

What is it?

instascrape is a lightweight Python package that provides an expressive and flexible API for scraping Instagram data. It is geared towards being a high-level building block on the data scientist's toolchain and can be seamlessly integrated and extended with industry standard tools for web scraping, data science, and analysis.

Key features

…

View on GitHub

Chris Greening - Software Developer

Hey! My name's Chris Greening and I'm a software developer from the New York metro area with a diverse range of engineering experience - beam me a message and let's build something great!

christophergreening.com

Top comments (1)

quirosaspron • Jun 23 '21

Hey I'm trying to do a heatmap with seaborn like yours for a school project with some data from my instagram profile. I'm having some trouble making it look similiar like the one above. Can you show me the code you used to create the heatmap? Also what data did you used to crete it?

DEV Community

Exploratory data analysis of Instagram using instascrape and Python

Table of Contents

Scrape data from Instagram with instascrape and Python

Chris Greening ・ Oct 20 '20

Visualizing Instagram engagement with instascrape and Python

Chris Greening ・ Oct 21 '20

Chris Greening - Software Developer

The Content

🚶 Working with static content...

...and dealing with the dynamic 🏃

The Tools

💾 Data processing: the before and after

👁️ Data visualization and interaction

The Exploration

👴 Takin' a peak at politicians

📉 The rise and fall of @chris_greening

⌚ Determining best time of day/day of week to post

⌛ Scraping a post in real time

The Conclusion

chris-greening / instascrape

Powerful and flexible Instagram scraping library for Python, providing easy-to-use and expressive tools for accessing data programmatically

instascrape: powerful Instagram data scraping toolkit

Note: This module is no longer actively maintained.

DISCLAIMER:

What is it?

Key features

Chris Greening - Software Developer

Top comments (1)

Read next

9 Open Source Libraries to Supercharge Your Next Project 🔋⚡️

How does WebAssembly enhance web application performance?

Testing LLM Applications: Misadventures in Mocking SDKs vs Direct HTTP Requests

Testing with Jest