DEV Community

Chris Greening

Posted on Oct 25, 2020 • Edited on Dec 11, 2020

Creating a scatter matrix of Instagram data using Python

#python #datascience #webscraping #opensource

In this post, I'm going to show you how you can use Python to visualize the relationship between Instagram variables using a scatter matrix. This will allow you to optimize your engagement and work towards beating the infamous algorithm!

📚 Libraries

We'll be working in three distinct steps with four distinct libraries:

Selenium: Automate a web browser to get the HTML
BeautifulSoup: Parsing and scraping HTML
Instascrape: Scrape the posts and load the data
Pandas: Organize and visualize the data

📂 Gathering the posts

For the sake of brevity, I will mostly summarize the code for this part but a very similar full version can be found in this repo.

Since Instagram dynamically loads content, we aren't able to make simple GET requests and call it a day. Instead, we have to be clever and use a tool that can render JavaScript, simulating a user interacting with a webpage.

This is where selenium is gonna come in handy; it's a library that allows us to automate web browsers such as Google Chrome or Firefox programatically using Python 🐍.

The script we use will:

go to an Instagram page
scroll it automatically
gather the HTML at each scroll
compare the HTML and find differences

One of the handy features that selenium provides us is the ability to inject our own JavaScript scripts in the browser that interact with the webpage. In this case, we will use this script to continuously scroll the page in a loop:

//JavaScript scroll script
window.scrollTo(0, document.body.scrollHeight);
var lenOfPage=document.body.scrollHeight;
return lenOfPage;

After each scroll, we use BeautifulSoup to get the unique shortcode of every post on the profile that we just scrolled. This can be used by instascrape to construct a instascrape.Post object for scraping with Post.from_shortcode

🔧 Scraping the data

Assuming we have created a list of Post objects called post_objects, we are now ready to scrape the data we need. Leveraging instascrape, all we have to do to scrape each post is:

for post in post_objects:
    post.scrape()

And that's it! Each scrape loads a ton of data points with everything ranging from the amount of likes, hashtags used, tagged users, upload datetime, etc.

📊 Analysis and visualization

To get our data all neat and tidy, we're going to instantiate a pandas.DataFrame that will store our data:

import pandas as pd 

dataframe = pd.DataFrame([post.to_dict() for post in post_objects])

Now that we have an expressive and powerful way of handling our data, we can create some more useful columns with

dataframe["upload_hour"] = dataframe['upload_date'].dt.hour 
dataframe["upload_weekday"] = dataframe['upload_date'].dt.weekday
dataframe["amt_tagged_users"] = dataframe['tagged_users'].str.len()
dataframe["amt_hashtags"] = dataframe['hashtags'].str.len()

Now, to visualize it we use pandas.plotting.scatter_matrix which will let us view a matrix of scatter plots that show the different interactions between variables! For this example, we'll compare the

hour of the upload
day of the week
amount of comments
amount of likes
amount of tagged users
amount of hashtags

Using my own personal Instagram page (@chris_greening), we get:

pd.plotting.scatter_matrix(dataframe[['likes', 'comments', 'amt_tagged_users', 'upload_hour']], figsize=(8,8))

Analyzing this scatter matrix, we can now look at how different variables interact with one another and get an idea of what we can do to better boost our engagement 🙌

For example, looking at the scatter plot that compares upload_hour and likes, we see a peak sometime around noon. This indicates that on average, the best time for me to post to my Instagram is around noon.

This is just one relationship and there are plenty more to be discovered! Let me know what other relationships you found interesting in the comments below ❤️

📰 Additional resources

If you want to learn more about exploratory data analysis using instascrape, check out my other blog posts

Exploratory data analysis of Instagram using instascrape and Python

Chris Greening ・ Oct 22 '20

#python #datascience #opensource #webscraping

Visualizing Instagram engagement with instascrape and Python

Chris Greening ・ Oct 21 '20

#python #datascience #opensource #hacktoberfest

💻 The official repo

instascrape is always looking for more contributors, come join us at the official repo

chris-greening / instascrape

Powerful and flexible Instagram scraping library for Python, providing easy-to-use and expressive tools for accessing data programmatically

instascrape: powerful Instagram data scraping toolkit

Note: This module is no longer actively maintained.

DISCLAIMER:

Instagram has gotten increasingly strict with scraping and using this library can result in getting flagged for botting AND POSSIBLE DISABLING OF YOUR INSTAGRAM ACCOUNT. This is a research project and I am not responsible for how you use it. Independently, the library is designed to be responsible and respectful and it is up to you to decide what you do with it. I don't claim any responsibility if your Instagram account is affected by how you use this library.

What is it?

instascrape is a lightweight Python package that provides an expressive and flexible API for scraping Instagram data. It is geared towards being a high-level building block on the data scientist's toolchain and can be seamlessly integrated and extended with industry standard tools for web scraping, data science, and analysis.

Key features

…

View on GitHub

Top comments (0)

Chris Greening

generalized specialist - software engineer and data scientist - particularly fond of Python and R

Location

New York metropolitan area
Education

BSc in Physics, Stony Brook University, New York
Work

Manager of Analytics Engineering at route1.io | Freelance software engineer and data scientist
Joined

Oct 5, 2020