DEV Community

Discussion on: Working alone is so exhausting so I created my own assistant

Collapse
 
happping_min profile image
Min

Thank you Lionel-rowe! ( Still confused lol)

Collapse
 
lionelrowe profile image
lionel-rowe

The parent puppeteer script runs in the Node.JS runtime, whereas the callback to page.evaluate runs in the Chromium browser runtime, headlessly by default ("headless" basically just means that it runs in the background, so you can't visibly see it running). Passing complex data between runtimes is often not possible, because the different runtimes don't know how to interpret it, and DOM elements (DOM is the way the browser interprets HTML) are internally very complex. So to simplify the message passing, Puppeteer uses a serialized format that both runtimes can easily understand. The drawback is that any data that can't be converted to this serialized format is lost.

You can think of "serialized" as meaning something like flat, like a string of letters or binary digits. JSON is a typical serialization format and is useful because it allows the "flattening" of "deep" structures. For example, the JavaScript object { a: { b: 1 } } nests b within a, yet it can be serialized to the JSON string {"a":{"b":1}}. Why is this flat? Well, it's simply the character {, followed by ", followed by a, etc., so it can be read left-to-right; even though the object it represents is a tree structure.

Puppeteer does much of this JSON serialization "under the hood", so you often don't need to worry about it; but JSON can't serialize DOM nodes, because they contain circular structures, e.g. *{ a: { b: *{ a: ... } } } (where * represents a reference to the exact same object). So you need to return only things that JSON can represent — strings, numbers, booleans, null, arrays, and objects containing other JSON-able stuff.

const elementData = await page.evaluate(() => {
    const el = document.querySelector('h1')

    return {
        textContent: el.textContent, // string — OK
        childElementCount: el.childElementCount, // number — OK
        className: el.className, // string — OK
        outerHTML: el.outerHTML, // string — OK
    }
})

console.log(elementData)
// {
//     textContent: 'Posted on Mar 15'
//     childElementCount: 1,
//     className: 'fs-xs color-base-60'
//     outerHTML: '<p class="fs-xs color-base-60">Posted on <time datetime="2022-03-15T02:18:47Z" class="date-no-year" title="Tuesday, March 15, 2022, 2:18:47 AM">Mar 15</time></p>',
// }
Enter fullscreen mode Exit fullscreen mode
Thread Thread
 
happping_min profile image
Min

Omg @lionelrowe
Big thank you to explain this with full of kind detail!
This is really easy to understand! You are the best!👍👍👍👍👍

Collapse
 
ctsstc profile image
Cody Swartz • Edited

I think Puppeteer and Selenium both suffer from the same problem -- Selenium has been around longer, likely before websites were more complicated with single page magic and dynamic content. The last thing I remembered is that if you query for something before it exists/mounts/renders you'll get nothing back, so you need to wait/poll for it to be available. I thought Puppeteer has helpers around this, or this is why people start to reach for additional libraries on top of these tools to help with this problem. It's been a while since I've touched Puppeteer or Selenium, but I do remember the pains of working with them in single page applications.

Edit: puppeteer.github.io/puppeteer/docs...

Thread Thread
 
lionelrowe profile image
lionel-rowe

@ctsstc yeah, Puppeteer gives you various APIs, such as page.waitForSelector, to deal with that, but it can be finnicky knowing exactly what you need to wait for and avoiding race conditons.