DEV Community

Tsvetan Gerginov
Tsvetan Gerginov

Posted on

Building a Resilient Instagram Scraper With Selenium — What Mimicking Human Behavior Actually Looks Like

Up front: this is a personal/research tool for downloading from public Instagram profiles. Use it responsibly and within Instagram's Terms of Service and your local laws. This post is about the engineering — specifically, what it takes to make browser automation behave less like a bot — not about evading anyone.

Scraping any modern social platform is less a parsing problem and more a behavioral one. The HTML is the easy part. The hard part is that the site is actively watching how you act, and the moment you act like a script — instant scroll to the bottom, requests at machine speed, no pauses — you hit a challenge page and you're done.

I built InstagramWrapperPostScraper as a Python + Selenium tool that drives a real Microsoft Edge browser to download photos, videos, and captions from public profiles. The interesting engineering isn't "how do I find the image URL" — it's "how do I make a browser automation script move through a page the way a person would." MIT licensed, Python 3.10+.

Why a real browser instead of an API or HTTP

There are three broad ways to pull data off Instagram, and they fail differently:

  • API approaches run into rate limits fast and require credentials/tokens that get throttled
  • Plain HTTP scraping is brittle and trivially detectable — no JS execution, obvious request patterns
  • Driving a real browser (this approach) executes the actual page JS, renders like a human's session, and can keep working through temporary rate-limit blocks

The tradeoff: a real browser is slower and heavier. But for a personal-scale download tool, reliability beats speed.

The actually-interesting part: human-like behavior

The most recent version (0.0.2) is almost entirely about making the scroll behavior look human, and this is the part I'd point any automation person to. A naive scraper does scrollTo(bottom) and fires requests as fast as the network allows. This one deliberately doesn't:

  • Randomized scroll steps — it scrolls 50–90% of the viewport at a time, not straight to the bottom
  • Occasional scroll-ups — sometimes it scrolls back up, the way a human re-reads something
  • Random pauses — 2–5 seconds between actions instead of hammering
  • Longer initial waits — 4–7 seconds when first opening a profile (bumped up from 3–5s)
  • Periodic challenge checks — every 10 scrolls it checks whether a rate-limit/challenge page has appeared

That last point connects to the other 0.0.2 improvement: a dedicated _is_challenge_page() method that recognizes captcha/challenge pages by checking the URL plus DOM selectors, rather than naively grepping the page source. Source-string matching gives false positives the moment Instagram tweaks copy; checking structure is more robust.

There's also better end-of-profile detection — it retries scroll up/down 5 times before concluding it's actually reached the bottom, instead of giving up after one attempt — and a carousel retry path that handles duplicate slide URLs and skips blocked slides.

Clean output structure

One thing I cared about: the downloads should be usable, not a flat dump of files. Each post gets its own folder, carousels keep their slide order, and every post's caption is saved alongside the media:

downloads/
└── username/
    ├── images/
    │   ├── post_1/
    │   │   ├── username_1.jpg
    │   │   └── description.txt
    │   └── post_2/
    │       ├── username_2_01.jpg   ← carousel slide 1
    │       ├── username_2_02.jpg   ← carousel slide 2
    │       └── description.txt
    └── videos/
        └── post_3/
            ├── username_3.mp4
            └── description.txt
Enter fullscreen mode Exit fullscreen mode

Honest limitations

I'd rather you know the walls before you hit them. Straight from the README:

  • Public profiles only — private profiles need the scraper account to follow them
  • Edge only — no Chrome or Firefox support; it relies on Edge WebDriver
  • Instagram UI changes break selectors — when that happens, update Selenium/Edge and retry. This is the permanent tax on scraping anything you don't control
  • Rate limits still apply — on very large profiles (1000+ posts) expect pauses and retries; the human-like behavior reduces blocks, it doesn't make you invincible
  • No proxy support — every request comes from your real IP

That last two are deliberately on the label. This isn't a tool that pretends to be undetectable, and I'd be suspicious of any that did.

The takeaway worth stealing

Even if you never touch Instagram, the general lesson ports to any Selenium/Playwright automation: the gap between "works once on my machine" and "works repeatedly" is almost entirely about timing and behavioral realism. Randomized waits, partial scrolls, structural (not string-based) state detection, and retry-with-backoff are the difference between a script that runs and a script that keeps running.

Links:

If you've built browser automation that has to survive a hostile, frequently-changing site, I'd like to hear which behavioral tricks actually moved the needle for you.

Top comments (0)