Meta Spy: https://github.com/DEENUU1/meta-spy
Full code is available here: https://pastebin.com/QMmDUZtj
Demo: https://github.com/DEENUU1/meta-spy/blob/main/assets/instagram/imagescraper.gif?raw=true
Info
This article is based on my project which I am still developing - Meta Spy (Facebook Spy before) this week I started to add commands for scraping data from Instagram, my idea is to expand this app for all Meta applications and also add Flet framework as a GUI because typing this commands is making me bored.
How to bypass login ?
Bypassing Instagram's login process might sound like a daunting task, but it's surprisingly straightforward. We'll extract the sessionid key from a browser where we're already logged in and integrate it into the Selenium driver. Here's a step-by-step guide:
- Launch Instagram in your browser and press F12 to open the Developer Tools.
- In the Developer Tools sidebar, select "Data."
- Locate and select the "Cookies" option, then choose cookies for instagram.com.
- Copy the sessionid value.
It's time to write some code
Now that we've covered the initial steps, it's time to dive into the code.
Setting Up Chrome Driver Options
To begin, we'll create a class with a static method that simplifies the configuration of the Chrome driver. This class will serve as the foundation for our scraper.
from typing import List
from time import sleep
from selenium.webdriver.common.by import By
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
class Scraper:
@staticmethod
def _chrome_driver_configuration() -> Options:
chrome_options = Options()
chrome_options.add_argument("--disable-notifications")
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--disable-popup-blocking")
chrome_options.add_argument("--disable-default-apps")
chrome_options.add_argument("--disable-infobars")
chrome_options.add_argument("--disable-web-security")
chrome_options.add_argument(
"--disable-features=IsolateOrigins,site-per-process"
)
chrome_options.add_argument(
"--enable-features=NetworkService,NetworkServiceInProcess"
)
chrome_options.add_argument("--profile-directory=Default")
chrome_options.add_experimental_option("excludeSwitches", ["enable-logging"])
return chrome_options
Implementing the Base Scraper Class
While this tutorial might appear to introduce more classes than necessary, it aligns with our modular approach to project development. This approach allows us to showcase the complete implementation of specific functionalities.
class BaseInstagramScraper(Scraper):
def __init__(self, user_id: str, base_url: str) -> None:
super().__init__()
self._user_id = user_id
self._base_url = base_url.format(self._user_id)
self._driver = webdriver.Chrome(options=self._chrome_driver_configuration())
self._driver.get(self._base_url)
self._wait = WebDriverWait(self._driver, 10)
Scoll
Retrieving the full content from Instagram profiles requires scrolling, but it's not as simple as a one-time scroll-and-scrape process. When scrolling through a profile, data appears and disappears dynamically. As only a few rows of images are visible at a time, scrolling to the end and scraping the data is not feasible. To address this, we've created a function that provides a callback mechanism for dynamic content retrieval.
Our standard function scrolls the page down and captures all the visible content. However, in this case, dynamic data retrieval is necessary.
def scroll_page_callback(driver, callback) -> None:
"""
Scrolls the page to load more data from a website """ try:
last_height = driver.execute_script("return document.body.scrollHeight")
consecutive_scrolls = 0
while consecutive_scrolls < 3:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(3)
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
consecutive_scrolls += 1
else:
consecutive_scrolls = 0
last_height = new_height
callback(driver)
except Exception as e:
logs.log_error(f"Error occurred while scrolling: {e}")
Scraping data
Now, let's put all the pieces together and explore the main class responsible for scraping Instagram data.
class ProfileScraper(BaseInstagramScraper):
def __init__(self, user_id: str) -> None:
super().__init__(user_id, base_url=f"https://www.instagram.com/{user_id}/")
self._driver.add_cookie(
{
"name": "sessionid",
"value": "your_sessionid_goes_HERE",
"domain": ".instagram.com",
}
)
self._refresh_driver()
def _refresh_driver(self) -> None:
self._driver.refresh()
The ProfileScraper class inherits from the BaseInstagramScraper, which already includes Chrome driver configurations and more. We add the sessionid cookie to the driver, ensuring that the "value" field contains your sessionid. Next, we call the method:
self._refresh_driver
This method refreshes the driver and correctly loads any newly added cookies.
def extract_images(self) -> List[str]:
extracted_image_urls = []
try:
def extract_callback(driver):
img_elements = self._driver.find_elements(
By.CLASS_NAME,
"x5yr21d.xu96u03.x10l6tqk.x13vifvy.x87ps6o.xh8yej3",
)
for img_element in img_elements:
src_attribute = img_element.get_attribute("src")
if src_attribute and src_attribute not in extracted_image_urls:
#print(f"Extracted image URL: {src_attribute}")
extracted_image_urls.append(src_attribute)
scroll_page_callback(self._driver, extract_callback)
except Exception as e:
print(f"An error occurred while extracting images: {e}")
return extracted_image_urls
The core of this class lies in the extract_images
method, which returns a list of all scraped image URLs. Inside this method, we find the extract_callback
function. It identifies image elements, prints them to the console, and adds them to the extracted_image_url
list, checking for duplicates.
Finally, we call the scroll_page_callback
function with the Chrome driver and the data extraction function as arguments, ensuring that our scraper works seamlessly.
With this comprehensive guide, you're well-equipped to dive into Instagram data scraping with Meta Spy. As we continue developing this project, expect more features and functionalities that expand its capabilities across all Meta applications. And don't forget, our plans to integrate Flet as a GUI promise to make the experience even more user-friendly. Happy scraping!
Running code
if __name__ == "__main__":
scraper = ProfileScraper("sawardega_wataha")
data = scraper.extract_images()
print(len(data))
print(data[0])
Inside ProfileScraper class add a user_id from instagram account.
Results
> python .\main.py
33 # This is a number of scraped urls
# This is a full url to the scraped image
https://scontent-waw1-1.cdninstagram.com/v/t51.2885-15/387688415_1338700880368645_3875950289382108239_n.jpg?stp=dst-jpg_e35&efg=eyJ2ZW5jb2RlX3RhZyI6ImltYWdlX3VybGdlbi4xNDQweDE4MDAuc2RyIn0&_nc_ht=scontent-waw1-1.cdnin
stagram.com&_nc_cat=101&_nc_ohc=-w6WTMiiWj4AX-_Qfkt&edm=ACWDqb8BAAAA&ccb=7-5&ig_cache_key=MzIxMTM2ODUyNjYzMDkzMTEzMA%3D%3D.2-ccb7-5&oh=00_AfDoHMVh0dS6msk5yKaW9d81HCeCSgBUJzW82sKRHYRvwQ&oe=65433911&_nc_sid=ee9879
Top comments (0)