DEV Community

Cover image for How To Use Mobile Proxies For IG Scraping
ProxyShare.io
ProxyShare.io

Posted on

How To Use Mobile Proxies For IG Scraping

Scraping social media accounts is quite challenging, especially Instagram. Even when limiting to public profiles, these platforms make it very hard for data engineers to extract and use data from their users. And it's not because they want to protect privacy, but rather because they value data for their own purposes. In today's article, we're introducing a way to scrape public Instagram profiles.

Is scraping Instagram legal?

Discussing legal aspects of web scraping is an endless topic. According to T&C, scraping Instagram posts, comments or media is not permitted. However, scraping public data is. Which means that if you limit your scraping activity to publicly available data from instagram and you stay away from copyright protected content, you're good to go!

How hard is it to scrape public Instagram profiles?

Instagram is very well protected. Even though it doesn't throw captchas at you, or "Forbidden" like pages, it will pick up even the slightest scraping intent and try to prevent it. The usual way they prevent scraping is by redirecting your requests to their log-in page.

Setup a scraper for Instagram profiles

In the following section of this blog post, we're going to create a simple web scraper for Instagram profiles. The technologies we're going to use for this project are:

  • NodeJS
  • Playwright
  • Typescript

We're going to focus on the scraper only and not necessarily on feeding it profiles (ie. building an API). You can then use the result and integrate it in your own project, such that in the end, you will have a complete Instagram profile scraper.

Setting up the browser

Assuming you're somehow familiar with NodeJS and TypeScript, open your entry file (ours is index.ts) and let's get started by initialising a new browser:

import { Browser, Page, chromium } from "playwright";
import dotenv from "dotenv";

const scrape: () => Promise<void> = async(): Promise<void> => {

    /* Load environment variables */
    dotenv.config(); 

    const server:   string | undefined = process.env.PROXY_SERVER;
    const username: string | undefined = process.env.PROXY_USERNAME;
    const password: string | undefined = process.env.PROXY_PASSWORD;
    if (!server || !username || !password) 
        throw new Error("Proxy credentials not provided");

    let browser:    Browser | undefined;
    let page:       Page | undefined;
    try {
        browser = await chromium.launch({
            headless: process.env.NODE_ENV === "production",
            proxy: { server, username, password },
            args: [
                "--no-zygote",
                "--no-sandbox",
                "--disable-gpu",
                "--disable-web-security",
                "--disable-dev-shm-usage",
                "--ignore-certificate-errors",
                "--disable-site-isolation-trials",
                "--allow-running-insecure-content",
            ]
        });
        page = await browser.newPage();
    } catch (e: unknown) { 
        console.error((<Error>e).message);
    } finally {
        if (page) await page.close();
        if (browser) await browser.close();
    }

}

(async () => {
    await scrape();
})();
Enter fullscreen mode Exit fullscreen mode

Please note that we've already added support for proxies in our code. That is because, without proxies, Instagram will block your request. Furthermore, bad quality proxies will result in a redirect to Instagram's log in page, as previously discussed.

We encourage you to test with different proxies. We even host a free proxy server, for which you can get login information on our discord or website: https://www.proxyshare.io/

However, we know the free ones won't work with Instagram. In order to get passed this issue, we recommend you use good quality proxies. From experience, we have a very high success rate on our 4G/LTE mobile proxies.

Fetching HTML data from Instagram

Now let's move on. So far, our code only launches a browser, creates a page and exists. Next, we'll have to adjust it to navigate to an Instagram URL.

First, let's modify the scrape function to accept one parameter:

const scrape: (url: string) => Promise<void> = async(url: string): Promise<void> => {
Enter fullscreen mode Exit fullscreen mode

Next, let's use this parameter to navigate to the URL:

/* ... */
        page = await browser.newPage();
        await page.goto(url);
/* ... */
}

(async () => {
    await scrape("https://instagram.com/instagram");
})();
Enter fullscreen mode Exit fullscreen mode

Parsing the HTML

To extract the data from the HTML document, you can either return the content and parse it using an HTML parser (like node-html-parser for example, or use the already open browser to evaluate some JS code inside the page.

We usually recommend the second, because let's be honest; NodeJS is nice and powerful, but what better environment to run JavaScript than an actual browser!?

This being said, let's navigate to Instagram in our own browser open DevTools and identify the selectors we're interested in. Whenever you see randomised CSS classes, it's best you use xPath as a way to select your elements from the DOM.

If you used an incognito browser, you've probably seen the "Allow cookies" pop-up. Let's grab it's xPtah and make sure we accept it, so we don't get any errors later:

/* ... */
        await page.goto(url, { waitUntil: "networkidle" });
        const cookieBtn: ElementHandle<HTMLElement | SVGElement> | null = await page.locator(`xpath=/html/body/div[6]/div[1]/div/div[2]/div/div/div/div/div[2]/div/button[1]`).elementHandle();
        if (cookieBtn) cookieBtn.click();
/* ... */
Enter fullscreen mode Exit fullscreen mode

Now, if we want to select elements by xPath inside our browser, let's register a JavaScript function that will allow us to do so:

/* Declare window object to prevent TypeScript errors */
declare const window: Window &
typeof globalThis & {
    getElementByXpath: (xpath: string) => HTMLElement | null;
}

const scrape: (url: string) => Promise<Record<string, string> | null> = async(url: string): Promise<Record<string, string> | null> => {
/* ... */
        data = await page.evaluate(() => {
            window.getElementByXpath = (xpath: string): HTMLElement | null => {
                return <HTMLElement | null>document.evaluate(xpath, document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null)?.singleNodeValue;
            };
/* ... */
Enter fullscreen mode Exit fullscreen mode

All we have to do now is to actually scrape the data from this page. For the purpose of this tutorial, we're going to extract:

  • the username
  • the total number of posts
  • the total number of followors
  • the total number of following
  • the external URL
/* ... */
        data = await page.evaluate(() => {
            window.getElementByXpath = (xpath: string): HTMLElement | null => {
                return <HTMLElement | null>document.evaluate(xpath, document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null)?.singleNodeValue;
            };
            const usernameEl: HTMLElement | null = window.getElementByXpath(`/html/body/div[2]/div/div/div[2]/div/div/div[1]/div[2]/section/main/div/header/section/div[1]/div[1]/h2`);
            const username: string | null = usernameEl?.textContent || "Not found";
            const postsel: HTMLElement | null = window.getElementByXpath(`/html/body/div[2]/div/div/div[2]/div/div/div[1]/div[2]/section/main/div/header/section/ul/li[1]/button/span/span`);
            const posts: string | null = postsel?.textContent || "Not found";
            const followersEl: HTMLElement | null = window.getElementByXpath(`/html/body/div[2]/div/div/div[2]/div/div/div[1]/div[2]/section/main/div/header/section/ul/li[2]/button/span/span`);
            const followers: string | null = followersEl?.textContent || "Not found";
            const followingEl: HTMLElement | null = window.getElementByXpath(`/html/body/div[2]/div/div/div[2]/div/div/div[1]/div[2]/section/main/div/header/section/ul/li[3]/button/span/span`);
            const following: string | null = followingEl?.textContent || "Not found";
            const urlEl: HTMLElement | null = window.getElementByXpath(`/html/body/div[2]/div/div/div[2]/div/div/div[1]/div[2]/section/main/div/header/section/div[3]/div[3]/div/a`);
            const url: string | null = urlEl?.getAttribute("href") || "Not found";
            return { username, posts, followers, following, url };
        });
/* ... */
Enter fullscreen mode Exit fullscreen mode

The full script

For a better overview, here is the entire script we've been building in this tutorial:

import { Browser, ElementHandle, Page, chromium } from "playwright";
import dotenv from "dotenv";

/* Declare window object to prevent TypeScript errors */
declare const window: Window &
typeof globalThis & {
    getElementByXpath: (xpath: string) => HTMLElement | null;
}

const scrape: (url: string) => Promise<Record<string, string> | null> = async(url: string): Promise<Record<string, string> | null> => {

    /* Load environment variables */
    dotenv.config(); 

    /* Check if proxy credentials are provided */
    const server:   string | undefined = process.env.PROXY_SERVER;
    const username: string | undefined = process.env.PROXY_USERNAME;
    const password: string | undefined = process.env.PROXY_PASSWORD;
    if (!server || !username || !password) 
        throw new Error("Proxy credentials not provided");

    /* Initialize browser and page */
    let browser:    Browser | undefined;
    let page:       Page | undefined;
    let data:       Record<string, string> | null = null;
    try {
        browser = await chromium.launch({
            headless: process.env.NODE_ENV === "production",
            proxy: { server, username, password },
            args: [
                "--no-zygote",
                "--no-sandbox",
                "--disable-gpu",
                "--disable-web-security",
                "--disable-dev-shm-usage",
                "--ignore-certificate-errors",
                "--disable-site-isolation-trials",
                "--allow-running-insecure-content",
            ]
        });
        page = await browser.newPage();
        /* Navigate to Instagram profile */
        await page.goto(url, { waitUntil: "networkidle" });
        /* Accept cookies */
        const cookieBtn: ElementHandle<HTMLElement | SVGElement> | null = await page.locator(`xpath=/html/body/div[6]/div[1]/div/div[2]/div/div/div/div/div[2]/div/button[1]`).elementHandle();
        if (cookieBtn) cookieBtn.click();
        /* Wait for 100ms while the page loads */
        await page.waitForTimeout(100);
        /* Scrape data */
        data = await page.evaluate(() => {
            window.getElementByXpath = (xpath: string): HTMLElement | null => {
                return <HTMLElement | null>document.evaluate(xpath, document, null, XPathResult.FIRST_ORDERED_NODE_TYPE, null)?.singleNodeValue;
            };
            const usernameEl: HTMLElement | null = window.getElementByXpath(`/html/body/div[2]/div/div/div[2]/div/div/div[1]/div[2]/section/main/div/header/section/div[1]/div[1]/h2`);
            const username: string | null = usernameEl?.textContent || "Not found";
            const postsel: HTMLElement | null = window.getElementByXpath(`/html/body/div[2]/div/div/div[2]/div/div/div[1]/div[2]/section/main/div/header/section/ul/li[1]/button/span/span`);
            const posts: string | null = postsel?.textContent || "Not found";
            const followersEl: HTMLElement | null = window.getElementByXpath(`/html/body/div[2]/div/div/div[2]/div/div/div[1]/div[2]/section/main/div/header/section/ul/li[2]/button/span/span`);
            const followers: string | null = followersEl?.textContent || "Not found";
            const followingEl: HTMLElement | null = window.getElementByXpath(`/html/body/div[2]/div/div/div[2]/div/div/div[1]/div[2]/section/main/div/header/section/ul/li[3]/button/span/span`);
            const following: string | null = followingEl?.textContent || "Not found";
            const urlEl: HTMLElement | null = window.getElementByXpath(`/html/body/div[2]/div/div/div[2]/div/div/div[1]/div[2]/section/main/div/header/section/div[3]/div[3]/div/a`);
            const url: string | null = urlEl?.getAttribute("href") || "Not found";
            return { username, posts, followers, following, url };
        });
    } catch (e: unknown) { 
        /* Handle errors */
        console.error((<Error>e).message);
    } finally {
        /* Close browser and page */
        if (page) await page.close();
        if (browser) await browser.close();
    }
    /* Return scraped data */
    return data;
}

(async () => {
    const response: Record<string, string> | null = await scrape("https://instagram.com/instagram");
    console.log(response);
})();
Enter fullscreen mode Exit fullscreen mode

Conclusions

Scraping public data from Instagram profiles is no easy task, but it is doable. For large scale operations, to maximise your success, we recommend you use good quality proxies. At ProxyShare.io, we offer 4G/LTE mobile proxies that will do great with scraping Instagram (and any other social media platform). Furthermore, as we know how much resources such a project consumes, we don't charge by bandwidth usage and we offer our customers unlimited bandwidth!

Top comments (0)