Scraping post interactions (comments, likes, replies, timestamps, user handles) lets you analyze engagement quality, sentiment, and growth. Below is a practical, safe-by-design approach you can implement with the reference code in this repo:
https://github.com/Instagram-Automations/instagram-post-scraper
Ground Rules (Read First)
Respect terms & law: Always follow Instagram’s Terms of Use and local laws. Prefer first-party APIs where possible.
Be transparent: If you process user data, disclose it and store only what you need.
Throttle requests: Human-like pacing reduces the chance of blocks.
For an implementation baseline, review the examples in the repo: instagram-post-scraper
.
Approaches (Choose Based on Your Needs)
1) Official/Partner APIs (Safest)
For Business/Creator accounts you control, use Meta’s Graph APIs to fetch comments and replies reliably.
Pros: Stable, policy-compliant, structured data.
Cons: Limited to authorized assets; no public-wide coverage.
2) Lightweight HTML Parsing (Fastest Setup)
For public posts, fetch the post page, extract embedded JSON (sharedData / GraphQL) and parse:
Post metadata: id, shortcode, caption, owner
Edges: comment text, user, timestamp, likes on comments
Add pagination by following end_cursor tokens.
See pagination patterns in the repo’s utilities: GitHub code
.
3) Headless Browser (Most Robust)
Use Playwright/Puppeteer to render dynamic content and infinite-scroll the comments/likers drawer.
Rotate proxies, set viewport & language, and inject think-time between actions.
Ideal when HTML endpoints change or require user interaction to reveal more items.
Reference flows: example scripts in the repo
.
What to Capture (Minimal Clean Schema)
Post
post_id, shortcode, owner_username, caption, taken_at, like_count, comment_count
Comment
comment_id, post_id, author_username, text, created_at, like_count, parent_id (for replies)
Liker
post_id, username, fetched_at
The repo includes patterns for mapping these fields cleanly:
https://github.com/Instagram-Automations/instagram-post-scraper
Anti-Block & Reliability Checklist
Rotate IPs/ASN: Residential/DC proxies with pool rotation.
Session hygiene: Reuse authenticated sessions cautiously; refresh cookies when needed.
Human cadence: Randomized delays, typed-like interactions in headless runs.
Retry/backoff: Exponential backoff on 429/403; circuit-break on repeated errors.
Pagination guards: Stop when has_next_page=false or duplicate edge IDs appear.
Deduplicate: Use (post_id, comment_id) or (post_id, username, created_at) as keys.
Basic Workflow (Step-by-Step)
Input: Provide a shortcode or full post URL.
Fetch: Request the post page or open it in a headless session.
Extract: Parse embedded JSON for edges (comments/likers).
Paginate: Follow end_cursor until exhausted or you reach a limit.
Normalize: Map fields to the schema above; strip emojis/HTML safely.
Store: Save to SQLite/Postgres/CSV with unique indices.
Monitor: Log rate limits, error codes, and cursor positions to resume.
You’ll find working patterns for steps 2–6 inside:
instagram-post-scraper (GitHub)
Tips for Comments vs. Likers
Comments: Often available via GraphQL edges; replies nest under threaded_comments. Ensure recursion depth and parent IDs are handled.
Likers: Typically revealed after clicking the “likes” count/modal. With headless browsing, scroll the modal and harvest batches until no new entries are found.
Quick Win Ideas
Sentiment & keywords: Run basic NLP on comments to rank posts by audience mood.
Creator vetting: Cross-reference commenters with follower counts to spot engaged micro-influencers.
Campaign QA: Compare like/comment velocity before/after promotions.
Final Note & CTA
You can adapt and ship a production-ready interaction scraper by starting from the examples and patterns here:
https://github.com/Instagram-Automations/instagram-post-scraper
Explore the code, copy a starter, and extend it for your stack—pagination, dedupe, and anti-block are already outlined. Dive in: instagram-post-scraper on GitHub
.
Top comments (0)