Instagram is one of the hardest platforms to scrape in 2026. Meta has locked down its APIs, aggressively blocks automated requests, and regularly changes its internal endpoints. But the data is incredibly valuable — for social media analytics, influencer research, brand monitoring, and market research.
This guide covers what actually works right now, from the official API to browser-based scraping with Playwright.
The Official Instagram Graph API: What You Get (and What You Don't)
Meta's Instagram Graph API is the sanctioned way to access Instagram data. But it has significant limitations.
What the API Can Do
- Access your own business/creator account's posts, stories, and insights
- Get basic public profile information
- Search hashtags (limited to 30 unique hashtags per 7 days per account)
- Read comments on your own posts
- Publish content to business accounts
What the API Cannot Do
- Access any private profile data
- Scrape followers/following lists of other accounts
- Get posts from public profiles you don't own (without the profile's authorization)
- Search for users or explore content freely
- Access stories from other accounts
Basic Setup
import requests
ACCESS_TOKEN = "YOUR_LONG_LIVED_TOKEN"
BASE_URL = "https://graph.instagram.com/v19.0"
def get_my_profile():
url = f"{BASE_URL}/me"
params = {
"fields": "id,username,account_type,media_count",
"access_token": ACCESS_TOKEN
}
response = requests.get(url, params=params)
return response.json()
def get_my_media(limit=25):
url = f"{BASE_URL}/me/media"
params = {
"fields": "id,caption,media_type,media_url,timestamp,like_count,comments_count,permalink",
"limit": limit,
"access_token": ACCESS_TOKEN
}
response = requests.get(url, params=params)
return response.json()
Hashtag Search
One of the more useful API features for research — but heavily rate-limited:
def search_hashtag(hashtag_name):
# Step 1: Get hashtag ID
url = f"{BASE_URL}/ig_hashtag_search"
params = {
"q": hashtag_name,
"user_id": "YOUR_USER_ID",
"access_token": ACCESS_TOKEN
}
response = requests.get(url, params=params)
hashtag_id = response.json()["data"][0]["id"]
# Step 2: Get recent media for that hashtag
url = f"{BASE_URL}/{hashtag_id}/recent_media"
params = {
"user_id": "YOUR_USER_ID",
"fields": "id,caption,media_type,permalink,timestamp",
"access_token": ACCESS_TOKEN
}
response = requests.get(url, params=params)
return response.json()
Remember: you're limited to 30 unique hashtag searches per 7-day window. Plan carefully.
Scraping Public Profiles with Requests
Instagram's web interface loads data through internal GraphQL endpoints. While these are undocumented and change frequently, they're the foundation of most scraping approaches.
Important disclaimer: Scraping Instagram outside their API may violate their Terms of Service. This information is for educational purposes. Always review and respect platform terms before scraping.
Profile Data via Web Endpoints
import requests
import json
import time
class InstagramScraper:
def __init__(self):
self.session = requests.Session()
self.session.headers.update({
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36",
"X-IG-App-ID": "936619743392459",
"X-Requested-With": "XMLHttpRequest"
})
def get_profile(self, username):
url = f"https://www.instagram.com/api/v1/users/web_profile_info/"
params = {"username": username}
response = self.session.get(url, params=params)
if response.status_code == 200:
data = response.json()["data"]["user"]
return {
"username": data["username"],
"full_name": data["full_name"],
"bio": data["biography"],
"followers": data["edge_followed_by"]["count"],
"following": data["edge_follow"]["count"],
"posts_count": data["edge_owner_to_timeline_media"]["count"],
"is_private": data["is_private"],
"is_verified": data["is_verified"],
"profile_pic": data["profile_pic_url_hd"],
"external_url": data.get("external_url")
}
elif response.status_code == 404:
return {"error": "Profile not found"}
else:
return {"error": f"Status {response.status_code}"}
scraper = InstagramScraper()
profile = scraper.get_profile("natgeo")
print(f"{profile['username']}: {profile['followers']:,} followers")
Getting Recent Posts
def get_user_posts(self, username, max_posts=50):
profile_data = self._get_raw_profile(username)
if not profile_data or profile_data.get("is_private"):
return []
edges = profile_data.get("edge_owner_to_timeline_media", {}).get("edges", [])
posts = []
for edge in edges[:max_posts]:
node = edge["node"]
posts.append({
"id": node["id"],
"shortcode": node["shortcode"],
"url": f"https://www.instagram.com/p/{node['shortcode']}/",
"caption": self._extract_caption(node),
"likes": node.get("edge_liked_by", {}).get("count", 0),
"comments": node.get("edge_media_to_comment", {}).get("count", 0),
"timestamp": node["taken_at_timestamp"],
"is_video": node["is_video"],
"display_url": node["display_url"]
})
return posts
def _extract_caption(self, node):
edges = node.get("edge_media_to_caption", {}).get("edges", [])
if edges:
return edges[0]["node"]["text"]
return ""
Browser-Based Scraping with Playwright
When requests-based approaches get blocked (and they will — Instagram is aggressive about this), browser automation is the next step. Playwright renders the full page like a real browser, making it much harder to detect.
import asyncio
from playwright.async_api import async_playwright
import json
async def scrape_profile_playwright(username):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context(
viewport={"width": 1280, "height": 800},
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
)
page = await context.new_page()
profile_data = {}
async def handle_response(response):
if "web_profile_info" in response.url:
try:
data = await response.json()
profile_data.update(data)
except:
pass
page.on("response", handle_response)
await page.goto(f"https://www.instagram.com/{username}/")
await page.wait_for_load_state("networkidle")
rendered_data = await page.evaluate("""
() => {
const meta = document.querySelector('meta[property="og:description"]');
return meta ? meta.content : null;
}
""")
await browser.close()
return {
"api_data": profile_data,
"og_description": rendered_data
}
result = asyncio.run(scrape_profile_playwright("natgeo"))
Scrolling Through Posts
To get more than the initial page of posts, you need to scroll:
async def scrape_posts_with_scroll(username, target_count=100):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
posts_data = []
seen_urls = set()
async def capture_posts(response):
if "/graphql/query" in response.url:
try:
data = await response.json()
edges = (data.get("data", {})
.get("user", {})
.get("edge_owner_to_timeline_media", {})
.get("edges", []))
for edge in edges:
url = edge["node"].get("shortcode")
if url and url not in seen_urls:
seen_urls.add(url)
posts_data.append(edge["node"])
except:
pass
page.on("response", capture_posts)
await page.goto(f"https://www.instagram.com/{username}/")
await page.wait_for_timeout(3000)
while len(posts_data) < target_count:
await page.evaluate("window.scrollBy(0, 1000)")
await page.wait_for_timeout(2000)
at_bottom = await page.evaluate("""
() => window.innerHeight + window.scrollY >= document.body.scrollHeight - 100
""")
if at_bottom:
break
await browser.close()
return posts_data[:target_count]
Scraping Hashtag Pages
Hashtag exploration is valuable for content strategy and trend analysis:
async def scrape_hashtag(tag, max_posts=50):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()
posts = []
async def capture_hashtag_data(response):
if "graphql" in response.url or "api/v1/tags" in response.url:
try:
data = await response.json()
sections = data.get("data", {}).get("recent", {}).get("sections", [])
for section in sections:
medias = section.get("layout_content", {}).get("medias", [])
for media in medias:
node = media.get("media", {})
if node:
posts.append({
"id": node.get("pk"),
"shortcode": node.get("code"),
"caption": (node.get("caption", {}) or {}).get("text", ""),
"likes": node.get("like_count", 0),
"comments": node.get("comment_count", 0),
"owner": node.get("user", {}).get("username")
})
except:
pass
page.on("response", capture_hashtag_data)
await page.goto(f"https://www.instagram.com/explore/tags/{tag}/")
await page.wait_for_timeout(5000)
for _ in range(5):
await page.evaluate("window.scrollBy(0, 1000)")
await page.wait_for_timeout(2000)
await browser.close()
return posts[:max_posts]
results = asyncio.run(scrape_hashtag("webdevelopment"))
Handling Instagram's Anti-Bot Detection
Instagram is one of the most aggressive platforms when it comes to blocking scrapers. Here's what you'll face:
Common Blocks
- Login walls: Instagram shows a login popup after viewing a few profiles
- Rate limiting: Too many requests from one IP triggers temporary blocks
- Challenge pages: CAPTCHA-like verification screens
- Account lockouts: If you're using a logged-in session, the account can get suspended
Mitigation Strategies
import random
import time
class ResilientScraper:
def __init__(self, proxies=None):
self.proxies = proxies or []
self.request_count = 0
self.min_delay = 3
self.max_delay = 8
def _get_proxy(self):
if self.proxies:
return random.choice(self.proxies)
return None
def _rate_limit(self):
delay = random.uniform(self.min_delay, self.max_delay)
time.sleep(delay)
self.request_count += 1
if self.request_count % 20 == 0:
pause = random.uniform(30, 60)
print(f"Cooling down for {pause:.0f}s after {self.request_count} requests")
time.sleep(pause)
def scrape_profile(self, username):
self._rate_limit()
proxy = self._get_proxy()
session = requests.Session()
if proxy:
session.proxies = {"http": proxy, "https": proxy}
session.headers.update({
"User-Agent": self._random_user_agent(),
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml"
})
try:
response = session.get(
f"https://www.instagram.com/api/v1/users/web_profile_info/",
params={"username": username},
timeout=15
)
if response.status_code == 429:
print("Rate limited — backing off")
time.sleep(300)
return self.scrape_profile(username)
return response.json()
except Exception as e:
print(f"Error scraping {username}: {e}")
return None
def _random_user_agent(self):
agents = [
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36"
]
return random.choice(agents)
Practical Use Case: Social Media Analytics Dashboard
Let's put it all together with a real use case — building a competitive analytics tracker:
import json
from datetime import datetime
def analyze_competitor(username):
scraper = InstagramScraper()
profile = scraper.get_profile(username)
if "error" in profile:
return profile
posts = scraper.get_user_posts(username, max_posts=30)
if not posts:
return {"profile": profile, "analytics": "No public posts"}
total_likes = sum(p["likes"] for p in posts)
total_comments = sum(p["comments"] for p in posts)
avg_likes = total_likes / len(posts)
avg_comments = total_comments / len(posts)
engagement_rate = (
(avg_likes + avg_comments) / profile["followers"] * 100
if profile["followers"] > 0 else 0
)
if len(posts) >= 2:
newest = posts[0]["timestamp"]
oldest = posts[-1]["timestamp"]
days_span = (newest - oldest) / 86400
posts_per_week = len(posts) / (days_span / 7) if days_span > 0 else 0
else:
posts_per_week = 0
return {
"profile": profile,
"analytics": {
"avg_likes": round(avg_likes),
"avg_comments": round(avg_comments),
"engagement_rate": f"{engagement_rate:.2f}%",
"posts_per_week": round(posts_per_week, 1),
"top_post": max(posts, key=lambda p: p["likes"]),
"analyzed_posts": len(posts)
}
}
competitors = ["competitor1", "competitor2", "competitor3"]
for comp in competitors:
print(f"\nAnalyzing @{comp}...")
result = analyze_competitor(comp)
print(json.dumps(result["analytics"], indent=2))
When to Use a Managed Scraping Solution
Building and maintaining Instagram scrapers is a significant time investment. The platform changes constantly, proxies get burned, and login sessions expire. For ongoing projects, a managed solution is often the pragmatic choice.
I maintain an Instagram Scraper on Apify that handles proxy rotation, session management, and adapts to Instagram's frequent changes automatically. It supports profiles, posts, hashtags, and comments — all through a simple API.
When does it make sense to build your own vs. use a managed tool?
| Factor | Build Your Own | Managed Solution |
|---|---|---|
| Volume | <100 profiles/day | 100+ profiles/day |
| Maintenance time | You have time to fix breakages | You need reliability |
| Proxy costs | You already have proxies | Proxies included |
| Data needs | Very custom extraction | Standard profile/post data |
Key Takeaways
The official API is limited but stable. Use it for your own account analytics and basic hashtag research.
Requests-based scraping works but breaks often. Instagram changes endpoints every few weeks. Budget time for maintenance.
Playwright is more resilient because it renders the full page, but it's slower and resource-intensive.
Rate limiting is non-negotiable. Random delays, proxy rotation, and cool-down periods aren't optional — they're required for any scraper that needs to run for more than a day.
Private profiles are off-limits. There's no ethical or reliable way to scrape private account data. Don't try.
Always have a backup plan. Instagram scraping is an arms race. Whatever works today might not work next week. Design your systems to handle failures gracefully.
The landscape changes fast, but the fundamentals stay the same: respect rate limits, handle errors, and keep your scraping code modular so you can swap out broken components without rewriting everything.
What's your experience with Instagram scraping? Found any approaches I didn't cover? Let me know in the comments.
Top comments (0)