An introduction to web scraping with playwright, nodejs and firebase.
Prerequisites
If you want to follow along this tutorial...
For further actions, you may consider blocking this person and/or reporting abuse
Worth noting that the documentation has alternatives to hard-coding the wait times (page.waitForTimeout). Some commands like
fillandclickhave auto waits built-in. Or you can explicitly wait for an object to appear in the DOM.https://playwright.dev/path=docs%2Fcore-concepts.md&q=auto-waiting#version=master
Hi Aleccc,
I've updated the article to use this approach as recommended in the docs. Thank you for pointing that out :)
Hi Divine,
Just note that currently this approach only works on localhost with firebase serve, it would fail when you deploy it to the Cloud. In my observation, Firebase can't figure out where the binary browsers used for scraping are stored, therefore can't initialize the browsers. I am still finding a way to modify this behaviour. Do you have any ideas?
Hi Johnny,
I apologize for my late response.
What you said is true and I haven't figured out a way to make it run on firebase cloud. I will be glad to know if you've figured that out :).
Thank you for your patience.
Hi, any o you find a solution for this bug?
Nice writetup! If someone decides to launch this script on datacenter, I would definitely recommend using some clean (preferably residential) proxies to avoid your accounts being flagged and save your cookies to re-use them later (this was actually mentioned here in comments).
I've recently published a simple tutorial on Instagram scraping and discovering micro-influencers via Node.js and MySQL.
How to scrape Instagram followers with Node.js, put results to MySQL, and discover micro-influencers
restyler ・ Oct 3 ・ 9 min read
Good luck!
Do u know, how to save login session after browser close and want to scraping again and again ?
go to your project directory
using the terminal,run the command below,
npx playwright open --save-storage websitename.json
a browser will open,now navigate to the website and sign-in/solve captcha,
then close browser. You will notice that a file "websitename.json" has been created.
now in playwright,set you browser context using this code below
const context = await browser.newContext({
storageState: "websitename.json"
});
you are now automatically logged in. :)
I haven't tried that and maybe it's possible but i'm sure it's not going to work if you randomly spin up browsers using Playwright.
Let me know if you are able to do that :)
scaling headless browsers is genuinely hard. concurrency limits, page crashes, worker restarts...
snapapi.pics handles the scaling layer. you just fire the API requests, they deal with the browser fleet
Why not Functions Framework? Is the use of Firebase tools really necessary?