DEV Community

loading...
Cover image for Creating a scatter matrix of Instagram data using Python

Creating a scatter matrix of Instagram data using Python

chrisgreening profile image Chris Greening Updated on ・3 min read

In this post, I'm going to show you how you can use Python to visualize the relationship between Instagram variables using a scatter matrix. This will allow you to optimize your engagement and work towards beating the infamous algorithm!

📚 Libraries

We'll be working in three distinct steps with four distinct libraries:

📂 Gathering the posts

For the sake of brevity, I will mostly summarize the code for this part but a very similar full version can be found in this repo.

Since Instagram dynamically loads content, we aren't able to make simple GET requests and call it a day. Instead, we have to be clever and use a tool that can render JavaScript, simulating a user interacting with a webpage.

This is where selenium is gonna come in handy; it's a library that allows us to automate web browsers such as Google Chrome or Firefox programatically using Python 🐍.

The script we use will:

  1. go to an Instagram page
  2. scroll it automatically
  3. gather the HTML at each scroll
  4. compare the HTML and find differences

One of the handy features that selenium provides us is the ability to inject our own JavaScript scripts in the browser that interact with the webpage. In this case, we will use this script to continuously scroll the page in a loop:

//JavaScript scroll script
window.scrollTo(0, document.body.scrollHeight);
var lenOfPage=document.body.scrollHeight;
return lenOfPage;
Enter fullscreen mode Exit fullscreen mode

After each scroll, we use BeautifulSoup to get the unique shortcode of every post on the profile that we just scrolled. This can be used by instascrape to construct a instascrape.Post object for scraping with Post.from_shortcode

🔧 Scraping the data

Assuming we have created a list of Post objects called post_objects, we are now ready to scrape the data we need. Leveraging instascrape, all we have to do to scrape each post is:

for post in post_objects:
    post.scrape()
Enter fullscreen mode Exit fullscreen mode

And that's it! Each scrape loads a ton of data points with everything ranging from the amount of likes, hashtags used, tagged users, upload datetime, etc.

📊 Analysis and visualization

To get our data all neat and tidy, we're going to instantiate a pandas.DataFrame that will store our data:

import pandas as pd 

dataframe = pd.DataFrame([post.to_dict() for post in post_objects])
Enter fullscreen mode Exit fullscreen mode

Now that we have an expressive and powerful way of handling our data, we can create some more useful columns with

dataframe["upload_hour"] = dataframe['upload_date'].dt.hour 
dataframe["upload_weekday"] = dataframe['upload_date'].dt.weekday
dataframe["amt_tagged_users"] = dataframe['tagged_users'].str.len()
dataframe["amt_hashtags"] = dataframe['hashtags'].str.len()
Enter fullscreen mode Exit fullscreen mode

Now, to visualize it we use pandas.plotting.scatter_matrix which will let us view a matrix of scatter plots that show the different interactions between variables! For this example, we'll compare the

  • hour of the upload
  • day of the week
  • amount of comments
  • amount of likes
  • amount of tagged users
  • amount of hashtags

Using my own personal Instagram page (@chris_greening), we get:

pd.plotting.scatter_matrix(dataframe[['likes', 'comments', 'amt_tagged_users', 'upload_hour']], figsize=(8,8))
Enter fullscreen mode Exit fullscreen mode

Alt Text

Analyzing this scatter matrix, we can now look at how different variables interact with one another and get an idea of what we can do to better boost our engagement 🙌

For example, looking at the scatter plot that compares upload_hour and likes, we see a peak sometime around noon. This indicates that on average, the best time for me to post to my Instagram is around noon.

This is just one relationship and there are plenty more to be discovered! Let me know what other relationships you found interesting in the comments below ❤️

📰 Additional resources

If you want to learn more about exploratory data analysis using instascrape, check out my other blog posts

💻 The official repo

instascrape is always looking for more contributors, come join us at the official repo

GitHub logo chris-greening / instascrape

Powerful and flexible Instagram scraping library for Python, providing easy-to-use and expressive tools for accessing data programmatically

instascrape: powerful Instagram data scraping toolkit

Version Downloads Release License

Activity Dependencies Issues Code style: black

What is it?

instascrape is a lightweight Python package that provides expressive and flexible tools for scraping Instagram data. It is geared towards being a high-level building block on the data scientist's toolchain and can be seamlessly integrated and extended with industry standard tools for web scraping, data science, and analysis.

Key features

Here are a few of the things that instascrape does well:

  • Powerful, object-oriented scraping tools for profiles, posts, hashtags, reels, and IGTV
  • Scrapes HTML, BeautifulSoup, and JSON
  • Download content to your computer as png, jpg, mp4, and mp3
  • Dynamically retrieve HTML embed code for posts
  • Expressive and consistent API for concise and elegant code
  • Designed for seamless integration with Selenium, Pandas, and other industry standard tools for data collection and analysis
  • Lightweight; no boilerplate or configurations necessary
  • The only hard dependencies are Requests and Beautiful

Discussion

pic
Editor guide