- The Content
- The Tools
- The Exploration
- The Conclusion
My recent two posts have been light introduction's into
instascrape: the lightweight, open source Instagram web scraper written for Python 🐍!
In this post, I'm going to show some of the ways I've personally explored Instagram programatically using Python and
On its own,
Like many other modern websites, Instagram uses a combination of server-side and client-side technology's such as AJAX to dynamically load content as you scroll. This allows Instagram to respond to an HTTP request quickly and then load more content as it's needed. By doing this, the user is presented with a clean, seamless user-experience (UX) with infinite scrolls and fast page refresh times.
While great for UX, this dynamically rendered content can become a bit of a pain for web scraping... but no worries 🙌! There are ways we can get around this and take it in step. For the most part, I use
instascrape for scraping.
Regarding the data analysis, I use a combination of
pandas: powerful tools and data structures for analyzing and exploring data
numpy: support for multidimensional arrays
scikit-learn: machine learning library that we will use for preprocessing data and building regression models
With one of the early iterations of
instascrape in early March 2020, I used it to take a look at how various politician's Instagram game's stacked up against one another, specifically Bernie Sanders and Joe Biden:
Fascinating! Let's take a look at Bernie first. It appears he's enjoyed very steady growth on his Instagram since 2016, nearly quintupling his likes per post. Additionally, we can see when he's on the campaign trail based on the frequency of posts.
Now let's take a look at Joe. He has no posts prior to mid-2018 and it's clear he enjoys less likes than Bernie did at the time of this data collection. This certainly makes sense considering Bernie is so popular with younger voters who make up a larger portion of social media platforms!
Yes that's a David Bowie reference; yes I am Chris Greening and my insta is in fact falling 😢... but that's okay 🤷! It made for a fun exercise to analyze. Let's check out the data:
Gasp! Shock! I know, it's tragic. But let's get down to it 😏. We can see that my growth was quite stagnant between 2016 and 2020 until mid-March of this year when my page suddenly blew up 😮 (quarantine was beginning and I decided to learn Photoshop) Let's zoom in a bit to just 2020 🔎:
I went from averaging <100 likes per post to almost 400 likes in just a matter of months with some posts netting over 800! We can also see that I was pretty steady with my frequency of posting until June when I slipped up and missed an entire month! Whoops! And it's all been downhill from there 😄. This type of data can be great though for seeing how a page is performing!
Let's take a look at a popular Instagram page right now and see how they're doing:
Wow, honestly kind of incredible how linear @dudewithsign's growth has been since he started posting, nearly an exact straight line.
In the same vein using the same data as the previous exploration we just did, we can also create a heatmap that will show us the best time of day/day of week for @chris_greening (me) to post to net the highest average engagement 🔥:
It certainly seems that I get the most engagement when I post in the late morning/early afternoon but additionally, we can see some of my best engaged posts were posted in the middle of the week on Wednesday and Thursday. This is great information to keep in mind the next time I go to post something 👍.
For the final bit of data exploration I'll show in this post, let's take a look at the output of a program I wrote that watches a post's engagement as it grows in real time. The program tracked a post by @dacre_montgomery and gathered how many likes/comments it got as a time-series across a 30 minute window:
The red/left y-axis represents likes on the post while the blue/right y-axis represents comments on the post. Incredibly enough, Dacre was able to amass over 35,000 likes and almost 400 comments in that time period alone (and that was after the post had already been up for an hour or so). That's more likes/comments than I have probably ever gotten on all my posts combined 😬
A future idea could be to write a script that watches a user's page continuously and as soon as a new post is detected, the real-time engagement tracker is triggered and we could watch the post grow across a longer period, say 8 hours!
And there you have it! Leveraging
instascrape to gather data, I was able to perform some really great data exploration I wouldn't have been able to do otherwise. Not only was I able to explore my own profile but I was able to look at the profile's of some public figures as well. These are just drops in a great ocean of possibilities you can accomplish using
instascrape and the data is just out there waiting for you!
Keep an eye out for a future post with more data exploration that will take a look at real-time hashtag growth, real-time post growth, and some more interesting examples to mess around with.
Let me know what you think in the comments below or even better, contribute to the official repo 😃
Powerful and flexible Instagram scraping library for Python, providing easy-to-use and expressive tools for accessing data programmatically
instascrape: powerful Instagram data scraping toolkit
What is it?
instascrape is a lightweight Python package that provides expressive and flexible tools for scraping Instagram data. It is geared towards being a high-level building block on the data scientist's toolchain and can be seamlessly integrated and extended with industry standard tools for web scraping, data science, and analysis.
Here are a few of the things that
instascrape does well:
- Powerful, object-oriented scraping tools for profiles, posts, hashtags, reels, and IGTV
- Scrapes HTML, BeautifulSoup, and JSON
- Download content to your computer as png, jpg, mp4, and mp3
- Dynamically retrieve HTML embed code for posts
- Expressive and consistent API for concise and elegant code
- Designed for seamless integration with Selenium, Pandas, and other industry standard tools for data collection and analysis
- Lightweight; no boilerplate or configurations necessary
- The only hard dependencies are Requests and Beautiful…