DEV Community

Cover image for Web Scraping — Scrape data from your instagram page with Nodejs, Playwright and Firebase.

Web Scraping — Scrape data from your instagram page with Nodejs, Playwright and Firebase.

Divine Hycenth on June 24, 2020

An introduction to web scraping with playwright, nodejs and firebase. Prerequisites If you want to follow along this tutorial...
Collapse
 
aleccc profile image
Aleccc • Edited

Worth noting that the documentation has alternatives to hard-coding the wait times (page.waitForTimeout). Some commands like fill and click have auto waits built-in. Or you can explicitly wait for an object to appear in the DOM.

// Playwright waits for #search element to be in the DOM
await page.fill('#search', 'query');
Enter fullscreen mode Exit fullscreen mode
// Wait for #search to appear in the DOM.
await page.waitForSelector('#search', { state: 'attached' });
Enter fullscreen mode Exit fullscreen mode

https://playwright.dev/path=docs%2Fcore-concepts.md&q=auto-waiting#version=master

Collapse
 
dnature profile image
Divine Hycenth

Hi Aleccc,

I've updated the article to use this approach as recommended in the docs. Thank you for pointing that out :)

Collapse
 
johnnyhuynhdev profile image
Johnny Dev

Hi Divine,
Just note that currently this approach only works on localhost with firebase serve, it would fail when you deploy it to the Cloud. In my observation, Firebase can't figure out where the binary browsers used for scraping are stored, therefore can't initialize the browsers. I am still finding a way to modify this behaviour. Do you have any ideas?

Collapse
 
dnature profile image
Divine Hycenth

Hi Johnny,

I apologize for my late response.
What you said is true and I haven't figured out a way to make it run on firebase cloud. I will be glad to know if you've figured that out :).

Thank you for your patience.

Collapse
 
amm297 profile image
amm297

Hi, any o you find a solution for this bug?

Collapse
 
restyler profile image
restyler

Nice writetup! If someone decides to launch this script on datacenter, I would definitely recommend using some clean (preferably residential) proxies to avoid your accounts being flagged and save your cookies to re-use them later (this was actually mentioned here in comments).

I've recently published a simple tutorial on Instagram scraping and discovering micro-influencers via Node.js and MySQL.


Good luck!
Collapse
 
andyajhis profile image
andyajhis

Do u know, how to save login session after browser close and want to scraping again and again ?

Collapse
 
spidydev profile image
Cicada1033➿

go to your project directory
using the terminal,run the command below,
npx playwright open --save-storage websitename.json

a browser will open,now navigate to the website and sign-in/solve captcha,
then close browser. You will notice that a file "websitename.json" has been created.

now in playwright,set you browser context using this code below

const context = await browser.newContext({
storageState: "websitename.json"
});

you are now automatically logged in. :)

Collapse
 
dnature profile image
Divine Hycenth

I haven't tried that and maybe it's possible but i'm sure it's not going to work if you randomly spin up browsers using Playwright.
Let me know if you are able to do that :)

Collapse
 
sleywill_45 profile image
Alex Serebriakov

scaling headless browsers is genuinely hard. concurrency limits, page crashes, worker restarts...

snapapi.pics handles the scaling layer. you just fire the API requests, they deal with the browser fleet

Collapse
 
oshliaer profile image
Alexander Ivanov

Why not Functions Framework? Is the use of Firebase tools really necessary?