DEV Community

Discussion on: Working alone is so exhausting so I created my own assistant

lionelrowe profile image
lionel-rowe • Edited

As I understand is what Puppeteer gives me is not an HTML element but something else.. ( is there anybody explain a bit easy for me? i am helpless lol)

The function passed as a callback runs in the context of the headless browser, with all the relevant web APIs available. So within that function, document.querySelector gives a real live DOM element that you can manipulate however you like:

const isItARealDiv = await page.evaluate(() => {
    return document.querySelector("div") instanceof HTMLDivElement

isItARealDiv // true
Enter fullscreen mode Exit fullscreen mode

The problem is that when passing the return value back to the parent script, everything is serialized and then deserialized again — something similar to JSON.parse(JSON.serialize(result)). DOM elements can't be properly serialized, getting converted to undefined, so you need to return only the serializable data that you need (text content, specific attributes, inner/outer HTML, etc).

happping_min profile image

Thank you Lionel-rowe! ( Still confused lol)

lionelrowe profile image

The parent puppeteer script runs in the Node.JS runtime, whereas the callback to page.evaluate runs in the Chromium browser runtime, headlessly by default ("headless" basically just means that it runs in the background, so you can't visibly see it running). Passing complex data between runtimes is often not possible, because the different runtimes don't know how to interpret it, and DOM elements (DOM is the way the browser interprets HTML) are internally very complex. So to simplify the message passing, Puppeteer uses a serialized format that both runtimes can easily understand. The drawback is that any data that can't be converted to this serialized format is lost.

You can think of "serialized" as meaning something like flat, like a string of letters or binary digits. JSON is a typical serialization format and is useful because it allows the "flattening" of "deep" structures. For example, the JavaScript object { a: { b: 1 } } nests b within a, yet it can be serialized to the JSON string {"a":{"b":1}}. Why is this flat? Well, it's simply the character {, followed by ", followed by a, etc., so it can be read left-to-right; even though the object it represents is a tree structure.

Puppeteer does much of this JSON serialization "under the hood", so you often don't need to worry about it; but JSON can't serialize DOM nodes, because they contain circular structures, e.g. *{ a: { b: *{ a: ... } } } (where * represents a reference to the exact same object). So you need to return only things that JSON can represent — strings, numbers, booleans, null, arrays, and objects containing other JSON-able stuff.

const elementData = await page.evaluate(() => {
    const el = document.querySelector('h1')

    return {
        textContent: el.textContent, // string — OK
        childElementCount: el.childElementCount, // number — OK
        className: el.className, // string — OK
        outerHTML: el.outerHTML, // string — OK

// {
//     textContent: 'Posted on Mar 15'
//     childElementCount: 1,
//     className: 'fs-xs color-base-60'
//     outerHTML: '<p class="fs-xs color-base-60">Posted on <time datetime="2022-03-15T02:18:47Z" class="date-no-year" title="Tuesday, March 15, 2022, 2:18:47 AM">Mar 15</time></p>',
// }
Enter fullscreen mode Exit fullscreen mode
Thread Thread
happping_min profile image

Omg @lionelrowe
Big thank you to explain this with full of kind detail!
This is really easy to understand! You are the best!👍👍👍👍👍

ctsstc profile image
Cody Swartz • Edited

I think Puppeteer and Selenium both suffer from the same problem -- Selenium has been around longer, likely before websites were more complicated with single page magic and dynamic content. The last thing I remembered is that if you query for something before it exists/mounts/renders you'll get nothing back, so you need to wait/poll for it to be available. I thought Puppeteer has helpers around this, or this is why people start to reach for additional libraries on top of these tools to help with this problem. It's been a while since I've touched Puppeteer or Selenium, but I do remember the pains of working with them in single page applications.


Thread Thread
lionelrowe profile image

@ctsstc yeah, Puppeteer gives you various APIs, such as page.waitForSelector, to deal with that, but it can be finnicky knowing exactly what you need to wait for and avoiding race conditons.