agenthustler

Posted on Apr 9

How to Scrape Instagram in 2026: Profiles, Posts, and Hashtags

#webscraping #instagram #python #automation

Instagram is one of the hardest platforms to scrape in 2026. Meta has locked down its APIs, aggressively blocks automated requests, and regularly changes its internal endpoints. But the data is incredibly valuable — for social media analytics, influencer research, brand monitoring, and market research.

This guide covers what actually works right now, from the official API to browser-based scraping with Playwright.

The Official Instagram Graph API: What You Get (and What You Don't)

Meta's Instagram Graph API is the sanctioned way to access Instagram data. But it has significant limitations.

What the API Can Do

Access your own business/creator account's posts, stories, and insights
Get basic public profile information
Search hashtags (limited to 30 unique hashtags per 7 days per account)
Read comments on your own posts
Publish content to business accounts

What the API Cannot Do

Access any private profile data
Scrape followers/following lists of other accounts
Get posts from public profiles you don't own (without the profile's authorization)
Search for users or explore content freely
Access stories from other accounts

Basic Setup

import requests

ACCESS_TOKEN = "YOUR_LONG_LIVED_TOKEN"
BASE_URL = "https://graph.instagram.com/v19.0"

def get_my_profile():
    url = f"{BASE_URL}/me"
    params = {
        "fields": "id,username,account_type,media_count",
        "access_token": ACCESS_TOKEN
    }
    response = requests.get(url, params=params)
    return response.json()

def get_my_media(limit=25):
    url = f"{BASE_URL}/me/media"
    params = {
        "fields": "id,caption,media_type,media_url,timestamp,like_count,comments_count,permalink",
        "limit": limit,
        "access_token": ACCESS_TOKEN
    }
    response = requests.get(url, params=params)
    return response.json()

Hashtag Search

One of the more useful API features for research — but heavily rate-limited:

def search_hashtag(hashtag_name):
    # Step 1: Get hashtag ID
    url = f"{BASE_URL}/ig_hashtag_search"
    params = {
        "q": hashtag_name,
        "user_id": "YOUR_USER_ID",
        "access_token": ACCESS_TOKEN
    }
    response = requests.get(url, params=params)
    hashtag_id = response.json()["data"][0]["id"]

    # Step 2: Get recent media for that hashtag
    url = f"{BASE_URL}/{hashtag_id}/recent_media"
    params = {
        "user_id": "YOUR_USER_ID",
        "fields": "id,caption,media_type,permalink,timestamp",
        "access_token": ACCESS_TOKEN
    }
    response = requests.get(url, params=params)
    return response.json()

Remember: you're limited to 30 unique hashtag searches per 7-day window. Plan carefully.

Scraping Public Profiles with Requests

Instagram's web interface loads data through internal GraphQL endpoints. While these are undocumented and change frequently, they're the foundation of most scraping approaches.

Important disclaimer: Scraping Instagram outside their API may violate their Terms of Service. This information is for educational purposes. Always review and respect platform terms before scraping.

Profile Data via Web Endpoints

import requests
import json
import time

class InstagramScraper:
    def __init__(self):
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                          "AppleWebKit/537.36 (KHTML, like Gecko) "
                          "Chrome/124.0.0.0 Safari/537.36",
            "X-IG-App-ID": "936619743392459",
            "X-Requested-With": "XMLHttpRequest"
        })

    def get_profile(self, username):
        url = f"https://www.instagram.com/api/v1/users/web_profile_info/"
        params = {"username": username}

        response = self.session.get(url, params=params)

        if response.status_code == 200:
            data = response.json()["data"]["user"]
            return {
                "username": data["username"],
                "full_name": data["full_name"],
                "bio": data["biography"],
                "followers": data["edge_followed_by"]["count"],
                "following": data["edge_follow"]["count"],
                "posts_count": data["edge_owner_to_timeline_media"]["count"],
                "is_private": data["is_private"],
                "is_verified": data["is_verified"],
                "profile_pic": data["profile_pic_url_hd"],
                "external_url": data.get("external_url")
            }
        elif response.status_code == 404:
            return {"error": "Profile not found"}
        else:
            return {"error": f"Status {response.status_code}"}

scraper = InstagramScraper()
profile = scraper.get_profile("natgeo")
print(f"{profile['username']}: {profile['followers']:,} followers")

Getting Recent Posts

def get_user_posts(self, username, max_posts=50):
    profile_data = self._get_raw_profile(username)

    if not profile_data or profile_data.get("is_private"):
        return []

    edges = profile_data.get("edge_owner_to_timeline_media", {}).get("edges", [])

    posts = []
    for edge in edges[:max_posts]:
        node = edge["node"]
        posts.append({
            "id": node["id"],
            "shortcode": node["shortcode"],
            "url": f"https://www.instagram.com/p/{node['shortcode']}/",
            "caption": self._extract_caption(node),
            "likes": node.get("edge_liked_by", {}).get("count", 0),
            "comments": node.get("edge_media_to_comment", {}).get("count", 0),
            "timestamp": node["taken_at_timestamp"],
            "is_video": node["is_video"],
            "display_url": node["display_url"]
        })

    return posts

def _extract_caption(self, node):
    edges = node.get("edge_media_to_caption", {}).get("edges", [])
    if edges:
        return edges[0]["node"]["text"]
    return ""

Browser-Based Scraping with Playwright

When requests-based approaches get blocked (and they will — Instagram is aggressive about this), browser automation is the next step. Playwright renders the full page like a real browser, making it much harder to detect.

import asyncio
from playwright.async_api import async_playwright
import json

async def scrape_profile_playwright(username):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            viewport={"width": 1280, "height": 800},
            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                       "AppleWebKit/537.36 (KHTML, like Gecko) "
                       "Chrome/124.0.0.0 Safari/537.36"
        )
        page = await context.new_page()

        profile_data = {}

        async def handle_response(response):
            if "web_profile_info" in response.url:
                try:
                    data = await response.json()
                    profile_data.update(data)
                except:
                    pass

        page.on("response", handle_response)

        await page.goto(f"https://www.instagram.com/{username}/")
        await page.wait_for_load_state("networkidle")

        rendered_data = await page.evaluate("""
            () => {
                const meta = document.querySelector('meta[property="og:description"]');
                return meta ? meta.content : null;
            }
        """)

        await browser.close()

        return {
            "api_data": profile_data,
            "og_description": rendered_data
        }

result = asyncio.run(scrape_profile_playwright("natgeo"))

Scrolling Through Posts

To get more than the initial page of posts, you need to scroll:

async def scrape_posts_with_scroll(username, target_count=100):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()
        page = await context.new_page()

        posts_data = []
        seen_urls = set()

        async def capture_posts(response):
            if "/graphql/query" in response.url:
                try:
                    data = await response.json()
                    edges = (data.get("data", {})
                            .get("user", {})
                            .get("edge_owner_to_timeline_media", {})
                            .get("edges", []))
                    for edge in edges:
                        url = edge["node"].get("shortcode")
                        if url and url not in seen_urls:
                            seen_urls.add(url)
                            posts_data.append(edge["node"])
                except:
                    pass

        page.on("response", capture_posts)

        await page.goto(f"https://www.instagram.com/{username}/")
        await page.wait_for_timeout(3000)

        while len(posts_data) < target_count:
            await page.evaluate("window.scrollBy(0, 1000)")
            await page.wait_for_timeout(2000)

            at_bottom = await page.evaluate("""
                () => window.innerHeight + window.scrollY >= document.body.scrollHeight - 100
            """)
            if at_bottom:
                break

        await browser.close()
        return posts_data[:target_count]

Scraping Hashtag Pages

Hashtag exploration is valuable for content strategy and trend analysis:

async def scrape_hashtag(tag, max_posts=50):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()
        page = await context.new_page()

        posts = []

        async def capture_hashtag_data(response):
            if "graphql" in response.url or "api/v1/tags" in response.url:
                try:
                    data = await response.json()
                    sections = data.get("data", {}).get("recent", {}).get("sections", [])
                    for section in sections:
                        medias = section.get("layout_content", {}).get("medias", [])
                        for media in medias:
                            node = media.get("media", {})
                            if node:
                                posts.append({
                                    "id": node.get("pk"),
                                    "shortcode": node.get("code"),
                                    "caption": (node.get("caption", {}) or {}).get("text", ""),
                                    "likes": node.get("like_count", 0),
                                    "comments": node.get("comment_count", 0),
                                    "owner": node.get("user", {}).get("username")
                                })
                except:
                    pass

        page.on("response", capture_hashtag_data)

        await page.goto(f"https://www.instagram.com/explore/tags/{tag}/")
        await page.wait_for_timeout(5000)

        for _ in range(5):
            await page.evaluate("window.scrollBy(0, 1000)")
            await page.wait_for_timeout(2000)

        await browser.close()
        return posts[:max_posts]

results = asyncio.run(scrape_hashtag("webdevelopment"))

Handling Instagram's Anti-Bot Detection

Instagram is one of the most aggressive platforms when it comes to blocking scrapers. Here's what you'll face:

Common Blocks

Login walls: Instagram shows a login popup after viewing a few profiles
Rate limiting: Too many requests from one IP triggers temporary blocks
Challenge pages: CAPTCHA-like verification screens
Account lockouts: If you're using a logged-in session, the account can get suspended

Mitigation Strategies

import random
import time

class ResilientScraper:
    def __init__(self, proxies=None):
        self.proxies = proxies or []
        self.request_count = 0
        self.min_delay = 3
        self.max_delay = 8

    def _get_proxy(self):
        if self.proxies:
            return random.choice(self.proxies)
        return None

    def _rate_limit(self):
        delay = random.uniform(self.min_delay, self.max_delay)
        time.sleep(delay)
        self.request_count += 1

        if self.request_count % 20 == 0:
            pause = random.uniform(30, 60)
            print(f"Cooling down for {pause:.0f}s after {self.request_count} requests")
            time.sleep(pause)

    def scrape_profile(self, username):
        self._rate_limit()
        proxy = self._get_proxy()

        session = requests.Session()
        if proxy:
            session.proxies = {"http": proxy, "https": proxy}

        session.headers.update({
            "User-Agent": self._random_user_agent(),
            "Accept-Language": "en-US,en;q=0.9",
            "Accept": "text/html,application/xhtml+xml"
        })

        try:
            response = session.get(
                f"https://www.instagram.com/api/v1/users/web_profile_info/",
                params={"username": username},
                timeout=15
            )

            if response.status_code == 429:
                print("Rate limited — backing off")
                time.sleep(300)
                return self.scrape_profile(username)

            return response.json()
        except Exception as e:
            print(f"Error scraping {username}: {e}")
            return None

    def _random_user_agent(self):
        agents = [
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"
        ]
        return random.choice(agents)

Practical Use Case: Social Media Analytics Dashboard

Let's put it all together with a real use case — building a competitive analytics tracker:

import json
from datetime import datetime

def analyze_competitor(username):
    scraper = InstagramScraper()

    profile = scraper.get_profile(username)
    if "error" in profile:
        return profile

    posts = scraper.get_user_posts(username, max_posts=30)

    if not posts:
        return {"profile": profile, "analytics": "No public posts"}

    total_likes = sum(p["likes"] for p in posts)
    total_comments = sum(p["comments"] for p in posts)
    avg_likes = total_likes / len(posts)
    avg_comments = total_comments / len(posts)

    engagement_rate = (
        (avg_likes + avg_comments) / profile["followers"] * 100
        if profile["followers"] > 0 else 0
    )

    if len(posts) >= 2:
        newest = posts[0]["timestamp"]
        oldest = posts[-1]["timestamp"]
        days_span = (newest - oldest) / 86400
        posts_per_week = len(posts) / (days_span / 7) if days_span > 0 else 0
    else:
        posts_per_week = 0

    return {
        "profile": profile,
        "analytics": {
            "avg_likes": round(avg_likes),
            "avg_comments": round(avg_comments),
            "engagement_rate": f"{engagement_rate:.2f}%",
            "posts_per_week": round(posts_per_week, 1),
            "top_post": max(posts, key=lambda p: p["likes"]),
            "analyzed_posts": len(posts)
        }
    }

competitors = ["competitor1", "competitor2", "competitor3"]
for comp in competitors:
    print(f"\nAnalyzing @{comp}...")
    result = analyze_competitor(comp)
    print(json.dumps(result["analytics"], indent=2))

When to Use a Managed Scraping Solution

Building and maintaining Instagram scrapers is a significant time investment. The platform changes constantly, proxies get burned, and login sessions expire. For ongoing projects, a managed solution is often the pragmatic choice.

I maintain an Instagram Scraper on Apify that handles proxy rotation, session management, and adapts to Instagram's frequent changes automatically. It supports profiles, posts, hashtags, and comments — all through a simple API.

When does it make sense to build your own vs. use a managed tool?

Factor	Build Your Own	Managed Solution
Volume	<100 profiles/day	100+ profiles/day
Maintenance time	You have time to fix breakages	You need reliability
Proxy costs	You already have proxies	Proxies included
Data needs	Very custom extraction	Standard profile/post data

Key Takeaways

The official API is limited but stable. Use it for your own account analytics and basic hashtag research.
Requests-based scraping works but breaks often. Instagram changes endpoints every few weeks. Budget time for maintenance.
Playwright is more resilient because it renders the full page, but it's slower and resource-intensive.
Rate limiting is non-negotiable. Random delays, proxy rotation, and cool-down periods aren't optional — they're required for any scraper that needs to run for more than a day.
Private profiles are off-limits. There's no ethical or reliable way to scrape private account data. Don't try.
Always have a backup plan. Instagram scraping is an arms race. Whatever works today might not work next week. Design your systems to handle failures gracefully.

The landscape changes fast, but the fundamentals stay the same: respect rate limits, handle errors, and keep your scraping code modular so you can swap out broken components without rewriting everything.

What's your experience with Instagram scraping? Found any approaches I didn't cover? Let me know in the comments.

DEV Community