InstaStalker using AWS Lambda (SNS too)

#aws #instagram #docker #lambda

First of all don't ask me why 😅

Project

InstaStalker = Python + Docker + Lambda + SNS (and touch of EventBridge)
InstaStalker: Your daily stalking assistant.

The main goal is to scrape Instagram profiles and know if a profile is public or private. On the other hand, this is a tracking application, and we assume that the profile we want to follow is private. Therefore, Lambda triggers SNS to send an email if the profile visibility status has changed to visible (public). The whole scenario works without login (at least without my credentials).

Scraping

Scraping was completely a trial and error stage. First I started using beautifulsoup4 to get the page. In the beginning, it worked. I got the whole page source (HTML) and parsed the “is_private” key using regex. Voila! However, it worked for a short time. Instagram allows you to see a few pages without login per IP. So this is why it worked for a short time. Also, Instagram changes many things frequently. When you reach the limit bs4 gets the login page’s HTML, meaning there is no is_private key.

V1 Code

from bs4 import BeautifulSoup
from urllib.request import urlopen, Request

username = "amazonwebservices"
req = Request("https://instagram.com/" + str(user_name), 
headers={"User-Agent": "Mozilla/5.0"})
html = urlopen(req).read().decode("utf-8")
soup = BeautifulSoup(html, features="lxml")
scripts = soup.find_all("script") # script block contains user info
...
# then regex part and convert key-values into pandas df

As I mentioned, I needed to find a new way due to IP restriction and frequent changes. I was Googling and found a way to get data from Instagram API. It was much better than parsing HTML.

V2 Code

username = "amazonwebservices"
import json
import requests

response = requests.get("https://i.instagram.com/api/v1/users/web_profile_info/?username={0}".format(username), headers={"x-ig-app-id": "936619743392459"})
data = json.loads(response.content)
print(data['data']['user']['is_private'])
# False

Still, I couldn't find a solution for login. Although the API had retrieved the user information after a few requests, it was still necessary to log in. I tried some try & except blocks and added headers (user agents, encoding, schema, etc.), but they never worked out. Also, I tried instaloader but got the same error; log in. I saw some Proxy solutions, but these were paid services, and I wouldn't pay for them. Then I discovered RapidAPI. It is an API platform and provides Proxy. Finally, I found what I was looking for. I tried a few and decided to move forward with one of them.

Code V3 (latest for now 😅)

def rapid_checker(user_name: str):

    url = "http://api_url/{0}".format(user_name)

    headers = {
        "X-RapidAPI-Key": "api_key",
        "X-RapidAPI-Host": "api_host",
    }

    response = requests.request("GET", url, headers=headers)
    if response.status_code == 200:
        response = json.loads(response.text)
        return {
            "username": response["username"],
            "fullname": response["full_name"],
            "bio": response["biography"],
            "is_private": response["is_private"],
            "follower_count": int(response["edge_followed_by"]["count"]),
            "following_count": int(response["edge_follow"]["count"]),
            "total_post": int(response["edge_owner_to_timeline_media"]["count"]),
            "profile_pic": response["profile_pic_url_hd"],
        }
    else:
        print("Couldn't retrieve data for {0}. Reason: {1}".format(user_name, json.loads(response.text)["message"]))
        return None

Lambda Deployment

Finally, I can gather data without any login errors. However, there is a daily limit for the APIs, but it is a tolerable thing.

Then I passed through the automatization stage using Lambda. I create a lambda_handler.py using the gathering function and add SNS codes for the notifications. Whole script 👇🏻

import boto3
import os
import requests
import json
from json import JSONDecodeError

client = boto3.client("sns", region_name="region")

def rapid_checker(user_name: str):

    # same as the code above 👆🏻.

def lambda_handler(event, context):
    try:

        topic_arn = "sns_arn"
        try:
            username = event["user_name"]
        except:
            username = "amazonwebservices"
        print("Username:", username)
        result = rapid_checker(username)
        if result is not None:
            print(result)

            if result["is_private"] != True:
                client.publish(
                    TopicArn=topic_arn,
                    Subject="🚨🚨🚨",
                    Message="⚠ '{0}' profile is now public! \n \
        Follower Count: {1} \n \
        Following Count: {2} \n \
        Total Post: {3} \n \
        Profile Pic: {4}".format(
                        result["username"],
                        result["follower_count"],
                        result["following_count"],
                        result["total_post"],
                        result["profile_pic"],
                        subtype="html"),)
                print("Message sent to SNS!")
            print("Checked for:", username, "and the visibility status is:", result["is_private"])
            return result
        else:
            print("Couldn't retrieve data for {0}. Reason: API Limit Reached!".format(username))
    except JSONDecodeError as error:
        print(error)
        return "Couldn't retrieve data for {0}".format(username)

Next I created a Dockerfile and pushed it to ECR. Almost ready!


FROM public.ecr.aws/lambda/python:3.8
RUN pip install requests
COPY lambda_function.py .

CMD [ "lambda_function.lambda_handler" ]

Then I created a Lambda function using a Docker image and tested it. Everything works perfectly. Finally, I created an EventBridge Rule for the scheduled runs.

Screenshots 👇🏻