The parent puppeteer script runs in the Node.JS runtime, whereas the callback to page.evaluate runs in the Chromium browser runtime, headlessly by default ("headless" basically just means that it runs in the background, so you can't visibly see it running). Passing complex data between runtimes is often not possible, because the different runtimes don't know how to interpret it, and DOM elements (DOM is the way the browser interprets HTML) are internally very complex. So to simplify the message passing, Puppeteer uses a serialized format that both runtimes can easily understand. The drawback is that any data that can't be converted to this serialized format is lost.
You can think of "serialized" as meaning something like flat, like a string of letters or binary digits. JSON is a typical serialization format and is useful because it allows the "flattening" of "deep" structures. For example, the JavaScript object { a: { b: 1 } } nests b within a, yet it can be serialized to the JSON string {"a":{"b":1}}. Why is this flat? Well, it's simply the character {, followed by ", followed by a, etc., so it can be read left-to-right; even though the object it represents is a tree structure.
Puppeteer does much of this JSON serialization "under the hood", so you often don't need to worry about it; but JSON can't serialize DOM nodes, because they contain circular structures, e.g. *{ a: { b: *{ a: ... } } } (where * represents a reference to the exact same object). So you need to return only things that JSON can represent — strings, numbers, booleans, null, arrays, and objects containing other JSON-able stuff.
constelementData=awaitpage.evaluate(()=>{constel=document.querySelector('h1')return{textContent:el.textContent,// string — OKchildElementCount:el.childElementCount,// number — OKclassName:el.className,// string — OKouterHTML:el.outerHTML,// string — OK}})console.log(elementData)// {// textContent: 'Posted on Mar 15'// childElementCount: 1,// className: 'fs-xs color-base-60'// outerHTML: '<p class="fs-xs color-base-60">Posted on <time datetime="2022-03-15T02:18:47Z" class="date-no-year" title="Tuesday, March 15, 2022, 2:18:47 AM">Mar 15</time></p>',// }
I think Puppeteer and Selenium both suffer from the same problem -- Selenium has been around longer, likely before websites were more complicated with single page magic and dynamic content. The last thing I remembered is that if you query for something before it exists/mounts/renders you'll get nothing back, so you need to wait/poll for it to be available. I thought Puppeteer has helpers around this, or this is why people start to reach for additional libraries on top of these tools to help with this problem. It's been a while since I've touched Puppeteer or Selenium, but I do remember the pains of working with them in single page applications.
@ctsstc yeah, Puppeteer gives you various APIs, such as page.waitForSelector, to deal with that, but it can be finnicky knowing exactly what you need to wait for and avoiding race conditons.
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
Thank you Lionel-rowe! ( Still confused lol)
The parent puppeteer script runs in the Node.JS runtime, whereas the callback to
page.evaluate
runs in the Chromium browser runtime, headlessly by default ("headless" basically just means that it runs in the background, so you can't visibly see it running). Passing complex data between runtimes is often not possible, because the different runtimes don't know how to interpret it, and DOM elements (DOM is the way the browser interprets HTML) are internally very complex. So to simplify the message passing, Puppeteer uses a serialized format that both runtimes can easily understand. The drawback is that any data that can't be converted to this serialized format is lost.You can think of "serialized" as meaning something like flat, like a string of letters or binary digits. JSON is a typical serialization format and is useful because it allows the "flattening" of "deep" structures. For example, the JavaScript object
{ a: { b: 1 } }
nestsb
withina
, yet it can be serialized to the JSON string{"a":{"b":1}}
. Why is this flat? Well, it's simply the character{
, followed by"
, followed bya
, etc., so it can be read left-to-right; even though the object it represents is a tree structure.Puppeteer does much of this JSON serialization "under the hood", so you often don't need to worry about it; but JSON can't serialize DOM nodes, because they contain circular structures, e.g.
*{ a: { b: *{ a: ... } } }
(where*
represents a reference to the exact same object). So you need to return only things that JSON can represent — strings, numbers, booleans, null, arrays, and objects containing other JSON-able stuff.Omg @lionelrowe
Big thank you to explain this with full of kind detail!
This is really easy to understand! You are the best!👍👍👍👍👍
I think Puppeteer and Selenium both suffer from the same problem -- Selenium has been around longer, likely before websites were more complicated with single page magic and dynamic content. The last thing I remembered is that if you query for something before it exists/mounts/renders you'll get nothing back, so you need to wait/poll for it to be available. I thought Puppeteer has helpers around this, or this is why people start to reach for additional libraries on top of these tools to help with this problem. It's been a while since I've touched Puppeteer or Selenium, but I do remember the pains of working with them in single page applications.
Edit: puppeteer.github.io/puppeteer/docs...
@ctsstc yeah, Puppeteer gives you various APIs, such as
page.waitForSelector
, to deal with that, but it can be finnicky knowing exactly what you need to wait for and avoiding race conditons.