DEV Community

agenthustler
agenthustler

Posted on

How to Scrape Instagram in 2026: Profiles, Posts, and Hashtags

Instagram is one of the hardest platforms to scrape in 2026. Meta has locked down its APIs, aggressively blocks automated requests, and regularly changes its internal endpoints. But the data is incredibly valuable — for social media analytics, influencer research, brand monitoring, and market research.

This guide covers what actually works right now, from the official API to browser-based scraping with Playwright.

The Official Instagram Graph API: What You Get (and What You Don't)

Meta's Instagram Graph API is the sanctioned way to access Instagram data. But it has significant limitations.

What the API Can Do

  • Access your own business/creator account's posts, stories, and insights
  • Get basic public profile information
  • Search hashtags (limited to 30 unique hashtags per 7 days per account)
  • Read comments on your own posts
  • Publish content to business accounts

What the API Cannot Do

  • Access any private profile data
  • Scrape followers/following lists of other accounts
  • Get posts from public profiles you don't own (without the profile's authorization)
  • Search for users or explore content freely
  • Access stories from other accounts

Basic Setup

import requests

ACCESS_TOKEN = "YOUR_LONG_LIVED_TOKEN"
BASE_URL = "https://graph.instagram.com/v19.0"

def get_my_profile():
    url = f"{BASE_URL}/me"
    params = {
        "fields": "id,username,account_type,media_count",
        "access_token": ACCESS_TOKEN
    }
    response = requests.get(url, params=params)
    return response.json()

def get_my_media(limit=25):
    url = f"{BASE_URL}/me/media"
    params = {
        "fields": "id,caption,media_type,media_url,timestamp,like_count,comments_count,permalink",
        "limit": limit,
        "access_token": ACCESS_TOKEN
    }
    response = requests.get(url, params=params)
    return response.json()
Enter fullscreen mode Exit fullscreen mode

Hashtag Search

One of the more useful API features for research — but heavily rate-limited:

def search_hashtag(hashtag_name):
    # Step 1: Get hashtag ID
    url = f"{BASE_URL}/ig_hashtag_search"
    params = {
        "q": hashtag_name,
        "user_id": "YOUR_USER_ID",
        "access_token": ACCESS_TOKEN
    }
    response = requests.get(url, params=params)
    hashtag_id = response.json()["data"][0]["id"]

    # Step 2: Get recent media for that hashtag
    url = f"{BASE_URL}/{hashtag_id}/recent_media"
    params = {
        "user_id": "YOUR_USER_ID",
        "fields": "id,caption,media_type,permalink,timestamp",
        "access_token": ACCESS_TOKEN
    }
    response = requests.get(url, params=params)
    return response.json()
Enter fullscreen mode Exit fullscreen mode

Remember: you're limited to 30 unique hashtag searches per 7-day window. Plan carefully.

Scraping Public Profiles with Requests

Instagram's web interface loads data through internal GraphQL endpoints. While these are undocumented and change frequently, they're the foundation of most scraping approaches.

Important disclaimer: Scraping Instagram outside their API may violate their Terms of Service. This information is for educational purposes. Always review and respect platform terms before scraping.

Profile Data via Web Endpoints

import requests
import json
import time

class InstagramScraper:
    def __init__(self):
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                          "AppleWebKit/537.36 (KHTML, like Gecko) "
                          "Chrome/124.0.0.0 Safari/537.36",
            "X-IG-App-ID": "936619743392459",
            "X-Requested-With": "XMLHttpRequest"
        })

    def get_profile(self, username):
        url = f"https://www.instagram.com/api/v1/users/web_profile_info/"
        params = {"username": username}

        response = self.session.get(url, params=params)

        if response.status_code == 200:
            data = response.json()["data"]["user"]
            return {
                "username": data["username"],
                "full_name": data["full_name"],
                "bio": data["biography"],
                "followers": data["edge_followed_by"]["count"],
                "following": data["edge_follow"]["count"],
                "posts_count": data["edge_owner_to_timeline_media"]["count"],
                "is_private": data["is_private"],
                "is_verified": data["is_verified"],
                "profile_pic": data["profile_pic_url_hd"],
                "external_url": data.get("external_url")
            }
        elif response.status_code == 404:
            return {"error": "Profile not found"}
        else:
            return {"error": f"Status {response.status_code}"}

scraper = InstagramScraper()
profile = scraper.get_profile("natgeo")
print(f"{profile['username']}: {profile['followers']:,} followers")
Enter fullscreen mode Exit fullscreen mode

Getting Recent Posts

def get_user_posts(self, username, max_posts=50):
    profile_data = self._get_raw_profile(username)

    if not profile_data or profile_data.get("is_private"):
        return []

    edges = profile_data.get("edge_owner_to_timeline_media", {}).get("edges", [])

    posts = []
    for edge in edges[:max_posts]:
        node = edge["node"]
        posts.append({
            "id": node["id"],
            "shortcode": node["shortcode"],
            "url": f"https://www.instagram.com/p/{node['shortcode']}/",
            "caption": self._extract_caption(node),
            "likes": node.get("edge_liked_by", {}).get("count", 0),
            "comments": node.get("edge_media_to_comment", {}).get("count", 0),
            "timestamp": node["taken_at_timestamp"],
            "is_video": node["is_video"],
            "display_url": node["display_url"]
        })

    return posts

def _extract_caption(self, node):
    edges = node.get("edge_media_to_caption", {}).get("edges", [])
    if edges:
        return edges[0]["node"]["text"]
    return ""
Enter fullscreen mode Exit fullscreen mode

Browser-Based Scraping with Playwright

When requests-based approaches get blocked (and they will — Instagram is aggressive about this), browser automation is the next step. Playwright renders the full page like a real browser, making it much harder to detect.

import asyncio
from playwright.async_api import async_playwright
import json

async def scrape_profile_playwright(username):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context(
            viewport={"width": 1280, "height": 800},
            user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                       "AppleWebKit/537.36 (KHTML, like Gecko) "
                       "Chrome/124.0.0.0 Safari/537.36"
        )
        page = await context.new_page()

        profile_data = {}

        async def handle_response(response):
            if "web_profile_info" in response.url:
                try:
                    data = await response.json()
                    profile_data.update(data)
                except:
                    pass

        page.on("response", handle_response)

        await page.goto(f"https://www.instagram.com/{username}/")
        await page.wait_for_load_state("networkidle")

        rendered_data = await page.evaluate("""
            () => {
                const meta = document.querySelector('meta[property="og:description"]');
                return meta ? meta.content : null;
            }
        """)

        await browser.close()

        return {
            "api_data": profile_data,
            "og_description": rendered_data
        }

result = asyncio.run(scrape_profile_playwright("natgeo"))
Enter fullscreen mode Exit fullscreen mode

Scrolling Through Posts

To get more than the initial page of posts, you need to scroll:

async def scrape_posts_with_scroll(username, target_count=100):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()
        page = await context.new_page()

        posts_data = []
        seen_urls = set()

        async def capture_posts(response):
            if "/graphql/query" in response.url:
                try:
                    data = await response.json()
                    edges = (data.get("data", {})
                            .get("user", {})
                            .get("edge_owner_to_timeline_media", {})
                            .get("edges", []))
                    for edge in edges:
                        url = edge["node"].get("shortcode")
                        if url and url not in seen_urls:
                            seen_urls.add(url)
                            posts_data.append(edge["node"])
                except:
                    pass

        page.on("response", capture_posts)

        await page.goto(f"https://www.instagram.com/{username}/")
        await page.wait_for_timeout(3000)

        while len(posts_data) < target_count:
            await page.evaluate("window.scrollBy(0, 1000)")
            await page.wait_for_timeout(2000)

            at_bottom = await page.evaluate("""
                () => window.innerHeight + window.scrollY >= document.body.scrollHeight - 100
            """)
            if at_bottom:
                break

        await browser.close()
        return posts_data[:target_count]
Enter fullscreen mode Exit fullscreen mode

Scraping Hashtag Pages

Hashtag exploration is valuable for content strategy and trend analysis:

async def scrape_hashtag(tag, max_posts=50):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()
        page = await context.new_page()

        posts = []

        async def capture_hashtag_data(response):
            if "graphql" in response.url or "api/v1/tags" in response.url:
                try:
                    data = await response.json()
                    sections = data.get("data", {}).get("recent", {}).get("sections", [])
                    for section in sections:
                        medias = section.get("layout_content", {}).get("medias", [])
                        for media in medias:
                            node = media.get("media", {})
                            if node:
                                posts.append({
                                    "id": node.get("pk"),
                                    "shortcode": node.get("code"),
                                    "caption": (node.get("caption", {}) or {}).get("text", ""),
                                    "likes": node.get("like_count", 0),
                                    "comments": node.get("comment_count", 0),
                                    "owner": node.get("user", {}).get("username")
                                })
                except:
                    pass

        page.on("response", capture_hashtag_data)

        await page.goto(f"https://www.instagram.com/explore/tags/{tag}/")
        await page.wait_for_timeout(5000)

        for _ in range(5):
            await page.evaluate("window.scrollBy(0, 1000)")
            await page.wait_for_timeout(2000)

        await browser.close()
        return posts[:max_posts]

results = asyncio.run(scrape_hashtag("webdevelopment"))
Enter fullscreen mode Exit fullscreen mode

Handling Instagram's Anti-Bot Detection

Instagram is one of the most aggressive platforms when it comes to blocking scrapers. Here's what you'll face:

Common Blocks

  • Login walls: Instagram shows a login popup after viewing a few profiles
  • Rate limiting: Too many requests from one IP triggers temporary blocks
  • Challenge pages: CAPTCHA-like verification screens
  • Account lockouts: If you're using a logged-in session, the account can get suspended

Mitigation Strategies

import random
import time

class ResilientScraper:
    def __init__(self, proxies=None):
        self.proxies = proxies or []
        self.request_count = 0
        self.min_delay = 3
        self.max_delay = 8

    def _get_proxy(self):
        if self.proxies:
            return random.choice(self.proxies)
        return None

    def _rate_limit(self):
        delay = random.uniform(self.min_delay, self.max_delay)
        time.sleep(delay)
        self.request_count += 1

        if self.request_count % 20 == 0:
            pause = random.uniform(30, 60)
            print(f"Cooling down for {pause:.0f}s after {self.request_count} requests")
            time.sleep(pause)

    def scrape_profile(self, username):
        self._rate_limit()
        proxy = self._get_proxy()

        session = requests.Session()
        if proxy:
            session.proxies = {"http": proxy, "https": proxy}

        session.headers.update({
            "User-Agent": self._random_user_agent(),
            "Accept-Language": "en-US,en;q=0.9",
            "Accept": "text/html,application/xhtml+xml"
        })

        try:
            response = session.get(
                f"https://www.instagram.com/api/v1/users/web_profile_info/",
                params={"username": username},
                timeout=15
            )

            if response.status_code == 429:
                print("Rate limited — backing off")
                time.sleep(300)
                return self.scrape_profile(username)

            return response.json()
        except Exception as e:
            print(f"Error scraping {username}: {e}")
            return None

    def _random_user_agent(self):
        agents = [
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"
        ]
        return random.choice(agents)
Enter fullscreen mode Exit fullscreen mode

Practical Use Case: Social Media Analytics Dashboard

Let's put it all together with a real use case — building a competitive analytics tracker:

import json
from datetime import datetime

def analyze_competitor(username):
    scraper = InstagramScraper()

    profile = scraper.get_profile(username)
    if "error" in profile:
        return profile

    posts = scraper.get_user_posts(username, max_posts=30)

    if not posts:
        return {"profile": profile, "analytics": "No public posts"}

    total_likes = sum(p["likes"] for p in posts)
    total_comments = sum(p["comments"] for p in posts)
    avg_likes = total_likes / len(posts)
    avg_comments = total_comments / len(posts)

    engagement_rate = (
        (avg_likes + avg_comments) / profile["followers"] * 100
        if profile["followers"] > 0 else 0
    )

    if len(posts) >= 2:
        newest = posts[0]["timestamp"]
        oldest = posts[-1]["timestamp"]
        days_span = (newest - oldest) / 86400
        posts_per_week = len(posts) / (days_span / 7) if days_span > 0 else 0
    else:
        posts_per_week = 0

    return {
        "profile": profile,
        "analytics": {
            "avg_likes": round(avg_likes),
            "avg_comments": round(avg_comments),
            "engagement_rate": f"{engagement_rate:.2f}%",
            "posts_per_week": round(posts_per_week, 1),
            "top_post": max(posts, key=lambda p: p["likes"]),
            "analyzed_posts": len(posts)
        }
    }

competitors = ["competitor1", "competitor2", "competitor3"]
for comp in competitors:
    print(f"\nAnalyzing @{comp}...")
    result = analyze_competitor(comp)
    print(json.dumps(result["analytics"], indent=2))
Enter fullscreen mode Exit fullscreen mode

When to Use a Managed Scraping Solution

Building and maintaining Instagram scrapers is a significant time investment. The platform changes constantly, proxies get burned, and login sessions expire. For ongoing projects, a managed solution is often the pragmatic choice.

I maintain an Instagram Scraper on Apify that handles proxy rotation, session management, and adapts to Instagram's frequent changes automatically. It supports profiles, posts, hashtags, and comments — all through a simple API.

When does it make sense to build your own vs. use a managed tool?

Factor Build Your Own Managed Solution
Volume <100 profiles/day 100+ profiles/day
Maintenance time You have time to fix breakages You need reliability
Proxy costs You already have proxies Proxies included
Data needs Very custom extraction Standard profile/post data

Key Takeaways

  1. The official API is limited but stable. Use it for your own account analytics and basic hashtag research.

  2. Requests-based scraping works but breaks often. Instagram changes endpoints every few weeks. Budget time for maintenance.

  3. Playwright is more resilient because it renders the full page, but it's slower and resource-intensive.

  4. Rate limiting is non-negotiable. Random delays, proxy rotation, and cool-down periods aren't optional — they're required for any scraper that needs to run for more than a day.

  5. Private profiles are off-limits. There's no ethical or reliable way to scrape private account data. Don't try.

  6. Always have a backup plan. Instagram scraping is an arms race. Whatever works today might not work next week. Design your systems to handle failures gracefully.

The landscape changes fast, but the fundamentals stay the same: respect rate limits, handle errors, and keep your scraping code modular so you can swap out broken components without rewriting everything.


What's your experience with Instagram scraping? Found any approaches I didn't cover? Let me know in the comments.

Top comments (0)