DEV Community

Cover image for Exploratory data analysis of Instagram using instascrape and Python

Exploratory data analysis of Instagram using instascrape and Python

chrisgreening profile image Chris Greening ・6 min read

Table of Contents

My recent two posts have been light introduction's into instascrape: the lightweight, open source Instagram web scraper written for Python 🐍!

In this post, I'm going to show some of the ways I've personally explored Instagram programatically using Python and instascrape 😉

The Content

🚶 Working with static content...

On its own, instascrape is a purely static web scraper. That means it only scrapes the initial source HTML served back by Instagram and does not deal with dynamic content rendered by JavaScript.

...and dealing with the dynamic 🏃

Like many other modern websites, Instagram uses a combination of server-side and client-side technology's such as AJAX to dynamically load content as you scroll. This allows Instagram to respond to an HTTP request quickly and then load more content as it's needed. By doing this, the user is presented with a clean, seamless user-experience (UX) with infinite scrolls and fast page refresh times.

While great for UX, this dynamically rendered content can become a bit of a pain for web scraping... but no worries 🙌! There are ways we can get around this and take it in step. For the most part, I use selenium which allows us to automate web browsers such as Google Chrome and Firefox using Python 💻! With this tool at our disposal, we can render the JavaScript and grab the HTML as it's loaded, integrating it into instascrape for scraping.

The Tools

💾 Data processing: the before and after

Regarding the data analysis, I use a combination of

  • pandas: powerful tools and data structures for analyzing and exploring data
  • numpy: support for multidimensional arrays
  • scikit-learn: machine learning library that we will use for preprocessing data and building regression models

👁️ Data visualization and interaction

The library I use for data visualization is matplotlib. I use Jupyter Notebook or the IPython console for interactivity.

The Exploration

👴 Takin' a peak at politicians

With one of the early iterations of instascrape in early March 2020, I used it to take a look at how various politician's Instagram game's stacked up against one another, specifically Bernie Sanders and Joe Biden:

Alt Text

Fascinating! Let's take a look at Bernie first. It appears he's enjoyed very steady growth on his Instagram since 2016, nearly quintupling his likes per post. Additionally, we can see when he's on the campaign trail based on the frequency of posts.

Now let's take a look at Joe. He has no posts prior to mid-2018 and it's clear he enjoys less likes than Bernie did at the time of this data collection. This certainly makes sense considering Bernie is so popular with younger voters who make up a larger portion of social media platforms!

📉 The rise and fall of @chris_greening

Yes that's a David Bowie reference; yes I am Chris Greening and my insta is in fact falling 😢... but that's okay 🤷! It made for a fun exercise to analyze. Let's check out the data:

Alt Text

Gasp! Shock! I know, it's tragic. But let's get down to it 😏. We can see that my growth was quite stagnant between 2016 and 2020 until mid-March of this year when my page suddenly blew up 😮 (quarantine was beginning and I decided to learn Photoshop) Let's zoom in a bit to just 2020 🔎:

Alt Text

I went from averaging <100 likes per post to almost 400 likes in just a matter of months with some posts netting over 800! We can also see that I was pretty steady with my frequency of posting until June when I slipped up and missed an entire month! Whoops! And it's all been downhill from there 😄. This type of data can be great though for seeing how a page is performing!

Let's take a look at a popular Instagram page right now and see how they're doing:

Alt Text

Wow, honestly kind of incredible how linear @dudewithsign's growth has been since he started posting, nearly an exact straight line.

⌚ Determining best time of day/day of week to post

In the same vein using the same data as the previous exploration we just did, we can also create a heatmap that will show us the best time of day/day of week for @chris_greening (me) to post to net the highest average engagement 🔥:

Alt Text

It certainly seems that I get the most engagement when I post in the late morning/early afternoon but additionally, we can see some of my best engaged posts were posted in the middle of the week on Wednesday and Thursday. This is great information to keep in mind the next time I go to post something 👍.

⌛ Scraping a post in real time

For the final bit of data exploration I'll show in this post, let's take a look at the output of a program I wrote that watches a post's engagement as it grows in real time. The program tracked a post by @dacre_montgomery and gathered how many likes/comments it got as a time-series across a 30 minute window:

Alt Text

The red/left y-axis represents likes on the post while the blue/right y-axis represents comments on the post. Incredibly enough, Dacre was able to amass over 35,000 likes and almost 400 comments in that time period alone (and that was after the post had already been up for an hour or so). That's more likes/comments than I have probably ever gotten on all my posts combined 😬

A future idea could be to write a script that watches a user's page continuously and as soon as a new post is detected, the real-time engagement tracker is triggered and we could watch the post grow across a longer period, say 8 hours!

The Conclusion

And there you have it! Leveraging instascrape to gather data, I was able to perform some really great data exploration I wouldn't have been able to do otherwise. Not only was I able to explore my own profile but I was able to look at the profile's of some public figures as well. These are just drops in a great ocean of possibilities you can accomplish using instascrape and the data is just out there waiting for you!

Keep an eye out for a future post with more data exploration that will take a look at real-time hashtag growth, real-time post growth, and some more interesting examples to mess around with.

Let me know what you think in the comments below or even better, contribute to the official repo 😃

GitHub logo chris-greening / instascrape

Powerful and flexible Instagram scraping library for Python, providing easy-to-use and expressive tools for accessing data programmatically

instascrape: powerful Instagram data scraping toolkit

Version Downloads Release License

Activity Dependencies Issues Code style: black

What is it?

instascrape is a lightweight Python package that provides expressive and flexible tools for scraping Instagram data. It is geared towards being a high-level building block on the data scientist's toolchain and can be seamlessly integrated and extended with industry standard tools for web scraping, data science, and analysis.

Key features

Here are a few of the things that instascrape does well:

  • Powerful, object-oriented scraping tools for profiles, posts, hashtags, reels, and IGTV
  • Scrapes HTML, BeautifulSoup, and JSON
  • Download content to your computer as png, jpg, mp4, and mp3
  • Dynamically retrieve HTML embed code for posts
  • Expressive and consistent API for concise and elegant code
  • Designed for seamless integration with Selenium, Pandas, and other industry standard tools for data collection and analysis
  • Lightweight; no boilerplate or configurations necessary
  • The only hard dependencies are Requests and Beautiful

Discussion (0)

Forem Open with the Forem app