In this post, I'm going to show you how you can use Python to visualize the relationship between Instagram variables using a scatter matrix. This will allow you to optimize your engagement and work towards beating the infamous algorithm!
📚 Libraries
We'll be working in three distinct steps with four distinct libraries:
-
Selenium
: Automate a web browser to get the HTML -
BeautifulSoup
: Parsing and scraping HTML -
Instascrape
: Scrape the posts and load the data -
Pandas
: Organize and visualize the data
📂 Gathering the posts
For the sake of brevity, I will mostly summarize the code for this part but a very similar full version can be found in this repo.
Since Instagram dynamically loads content, we aren't able to make simple GET requests and call it a day. Instead, we have to be clever and use a tool that can render JavaScript, simulating a user interacting with a webpage.
This is where selenium
is gonna come in handy; it's a library that allows us to automate web browsers such as Google Chrome or Firefox programatically using Python 🐍.
The script we use will:
- go to an Instagram page
- scroll it automatically
- gather the HTML at each scroll
- compare the HTML and find differences
One of the handy features that selenium
provides us is the ability to inject our own JavaScript scripts in the browser that interact with the webpage. In this case, we will use this script to continuously scroll the page in a loop:
//JavaScript scroll script
window.scrollTo(0, document.body.scrollHeight);
var lenOfPage=document.body.scrollHeight;
return lenOfPage;
After each scroll, we use BeautifulSoup
to get the unique shortcode of every post on the profile that we just scrolled. This can be used by instascrape
to construct a instascrape.Post
object for scraping with Post.from_shortcode
🔧 Scraping the data
Assuming we have created a list
of Post
objects called post_objects
, we are now ready to scrape the data we need. Leveraging instascrape
, all we have to do to scrape each post is:
for post in post_objects:
post.scrape()
And that's it! Each scrape loads a ton of data points with everything ranging from the amount of likes, hashtags used, tagged users, upload datetime, etc.
📊 Analysis and visualization
To get our data all neat and tidy, we're going to instantiate a pandas.DataFrame that will store our data:
import pandas as pd
dataframe = pd.DataFrame([post.to_dict() for post in post_objects])
Now that we have an expressive and powerful way of handling our data, we can create some more useful columns with
dataframe["upload_hour"] = dataframe['upload_date'].dt.hour
dataframe["upload_weekday"] = dataframe['upload_date'].dt.weekday
dataframe["amt_tagged_users"] = dataframe['tagged_users'].str.len()
dataframe["amt_hashtags"] = dataframe['hashtags'].str.len()
Now, to visualize it we use pandas.plotting.scatter_matrix which will let us view a matrix of scatter plots that show the different interactions between variables! For this example, we'll compare the
- hour of the upload
- day of the week
- amount of comments
- amount of likes
- amount of tagged users
- amount of hashtags
Using my own personal Instagram page (@chris_greening), we get:
pd.plotting.scatter_matrix(dataframe[['likes', 'comments', 'amt_tagged_users', 'upload_hour']], figsize=(8,8))
Analyzing this scatter matrix, we can now look at how different variables interact with one another and get an idea of what we can do to better boost our engagement 🙌
For example, looking at the scatter plot that compares upload_hour
and likes
, we see a peak sometime around noon. This indicates that on average, the best time for me to post to my Instagram is around noon.
This is just one relationship and there are plenty more to be discovered! Let me know what other relationships you found interesting in the comments below ❤️
📰 Additional resources
If you want to learn more about exploratory data analysis using instascrape
, check out my other blog posts
Exploratory data analysis of Instagram using instascrape and Python
Chris Greening ・ Oct 22 '20
Visualizing Instagram engagement with instascrape and Python
Chris Greening ・ Oct 21 '20
💻 The official repo
instascrape
is always looking for more contributors, come join us at the official repo
chris-greening / instascrape
Powerful and flexible Instagram scraping library for Python, providing easy-to-use and expressive tools for accessing data programmatically
instascrape: powerful Instagram data scraping toolkit
Note: This module is no longer actively maintained.
DISCLAIMER:
Instagram has gotten increasingly strict with scraping and using this library can result in getting flagged for botting AND POSSIBLE DISABLING OF YOUR INSTAGRAM ACCOUNT. This is a research project and I am not responsible for how you use it. Independently, the library is designed to be responsible and respectful and it is up to you to decide what you do with it. I don't claim any responsibility if your Instagram account is affected by how you use this library.
What is it?
instascrape is a lightweight Python package that provides an expressive and flexible API for scraping Instagram data. It is geared towards being a high-level building block on the data scientist's toolchain and can be seamlessly integrated and extended with industry standard tools for web scraping, data science, and analysis.
Top comments (0)