loading...

Adding a little feature to the headless_chrome crate in Rust

atroche profile image Alistair Roche ・6 min read

This is a brief account of how I fulfilled a small feature request a user made against the headless_chrome crate. There’s nothing particularly fancy or interesting about the change, but I thought a write-up might make it easier for others to contribute in future. It could be also be interesting for anyone curious about how Puppeteer and Chrome DevTools work under the hood.

For context, headless_chrome is a Rust crate for driving Chrome using the Chrome DevTools Protocol, the same protocol that all of DevTools uses to interface with the browser.

If you give Chrome a special command line flag ( --remote-debugging-port ), it’ll open up a WebSocket and let you call methods by sending JSON to it.

You might’ve heard of Puppeteer, which is the equivalent Node library to headless_chrome, and is maintained by the Chrome DevTools team.

(Side note: There’s a more mature (and probably generally better) cross-browser Rust crate called Fantoccini that uses WebDriver. It also supports async / await! The main reason to use headless_chrome is if you want to do the same things as DevTools can. My personal use case involves recording code coverage information, inspecting and modifying network requests and responses, and opening and driving multiple “incognito windows” in the same browser.)

So what was the feature request? It’s called “Easy way to search within descendants of an element”. Here’s the body:

Hi,
I'm trying to write a web scraper using this library, and along with #73 , it would be very useful to be able to run find_element(s) on an Element, so I could say find_elements("tr") then do find_element(x) to find specific columns of the table.

In the DevTools console (and when using JavaScript in the browser generally) you can do this:

And also this:

Right now headless_chrome only supports the first one. If you wanted the equivalent of the first, you’d have to use a CSS selector, like this:


Sometimes (e.g. when the element you want to query inside doesn’t have an easy-to-use ID, or the only way to identify it is via its text content) this is annoying. You might just want to walk the tree of elements yourself rather than constructing a CSS selector on the fly to do it for you. There’s a good reason Puppeteer, Fantoccini and other "drive the browser" libraries support it (along with the HTML DOM API, of course).

In headless_chrome, Tab.find_element looks like this:

pub fn find_element(&self, selector: &str) -> Fallible<Element<'_>> {
    trace!("Looking up element via selector: {}", selector);

    let root_node_id = self.get_document()?.node_id;
    self.run_query_selector_on_node(root_node_id, selector)
}

Where Tab.run_query_selector_on_node looks like this:

pub fn run_query_selector_on_node(
    &self,
    node_id: NodeId,
    selector: &str,
) -> Fallible<Element<'_>> {
    let node_id = self
        .call_method(dom::methods::QuerySelector { node_id, selector })
        .map_err(NoElementFound::map)?
        .node_id;

    Element::new(&self, node_id)
}

That dom::methods::QuerySelector thing is a struct representing the parameters to this method in the Chrome DevTools Protocol: DOM.querySelector. It takes a ‘node id’ (just an integer) and a CSS selector as a string, and returns a node id.

Tab.find_element is the equivalent of document.querySelector, and so to specify to the protocol’s QuerySelector method that we want to search over the entire document, we pass it the node ID of the document (i.e. the root node of the tree that is the DOM). Tab.find_elements is very similar, except it uses DOM.querySelectorAll and returns a Vector of Elements.

The Element struct just contains the element’s node id (so we can identify it when calling protocol methods) and a reference back to Tab (stored under the confusing-and-probably-should-change name of parent) which allows use to call methods on Tab, e.g. in Element.click:

pub fn click(&self) -> Fallible<&Self> {
    trace!("Clicking element {:?}", &self);

    self.scroll_into_view()?;

    let midpoint = self.get_midpoint()?;

    self.parent.click_point(midpoint)?;
    Ok(self)
}

All the methods on Element (including the ones called above, like scroll_into_view and get_midpoint) end up calling Tab to actually call protocol methods over the wire.

So, adding find_element and find_elements should be pretty straightforward, then, given that the DOM.QuerySelector protocol method already requires you to specify the id of the node you want to search under (the document / root node, in the case of Tab.find_element). That straightforwardness is probably a good thing, given that this post (with all its introductory context) is already a bit long.

Before we dive into implementing the new methods on Element, let’s write some tests. I like to use doctests on the methods themselves when I can, because it kills two birds (documenting and testing) with one stone. E.g. here’s the rustdoc output for Tab.find_element:

And I can run that test like this:

cargo test --doc -- Tab::find_element

We should be able to adapt it for Element.find_element.

By the way, the HTML being served up by the file server in that example above looks like this:

<body>
    <div>
    <div id="foobar"></div>
    </div>
    <div id="position-test">
        <div id="within"></div>
        <div id="strictly-above"></div>
        <div id="strictly-below"></div>
        <div id="strictly-left"></div>
        <div id="strictly-right"></div>
    </div>
</body>

So how about we make a test that grabs the ‘position-test’ div and looks within it for a div with ID strictly-above? Something like this:

let containing_element = initial_tab.navigate_to(&file_server.url())?
    .wait_until_navigated()?
    .find_element("div#position-test")?;
let inner_element = containing_element.find_element("#strictly-above")?;
let attrs = inner_element.get_attributes()?.unwrap();
assert_eq!(attrs["id"], "strictly-above");

And with a dummy implementation of Element.find_element, we compile successfully but fail the test (because of the arbitrarily set node_id):

pub fn find_element(&self, selector: &str) -> Fallible<Self> {
    let node_id = 5;
    Element::new(self.parent, node_id)
}

Let’s try this instead:

pub fn find_element(&self, selector: &str) -> Fallible<Self> {
    self.parent
        .run_query_selector_on_node(self.node_id, selector)
}

And run it like this:

> cargo test --doc -- Element::find_element --nocapture           
  Compiling headless_chrome v0.9.0 (/home/alistair/code/rust/headless_chrome)
   Finished dev [unoptimized + debuginfo] target(s) in 3.64s
  Doc-tests headless_chrome

running 1 test
test src/browser/tab/element/mod.rs - browser::tab::element::Element::find_element (line 77) ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 12 filtered out

Good news! I think that’s worthy of a commit.

And now a doctest for Element.find_elements:

let containing_element = initial_tab.navigate_to(&file_server.url())?
    .wait_until_navigated()?
    .find_element("div#position-test")?;
let inner_divs = containing_element.find_elements("div")?;
assert_eq!(inner_divs.len(), 5);

With an implementation like this:

pub fn find_elements(&self, selector: &str) -> Fallible<Vec<Self>> {
    self.parent
        .run_query_selector_all_on_node(self.node_id, selector)
}

Which also requires adding a new method to Tab, as a companion to Tab.run_query_selector_on_node:

pub fn run_query_selector_all_on_node(
    &self,
    node_id: NodeId,
    selector: &str,
) -> Fallible<Vec<Element<'_>>> {
    let node_ids = self
        .call_method(dom::methods::QuerySelectorAll { node_id, selector })
        .map_err(NoElementFound::map)?
        .node_ids;

    node_ids
        .iter()
        .map(|node_id| Element::new(&self, *node_id))
        .collect()
}

Okay, that passes:

> cargo test --doc -- Element::find_elements --nocapture                    
   Compiling headless_chrome v0.9.0 (/home/alistair/code/rust/headless_chrome)
    Finished dev [unoptimized + debuginfo] target(s) in 3.06s
   Doc-tests headless_chrome

running 1 test
test src/browser/tab/element/mod.rs - browser::tab::element::Element::find_elements (line 114) ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 13 filtered out

And that should be enough to close the issue! I've updated the changelog and opened a PR.

Unfortunately my continuous integration setup is a bit unreliable at the moment — I have problems chiefly with Travis timing out because of the way it caches huge amounts of build artifacts. I’ve been putting it off, but not being able to have it reliably, automatically test across stable & nightly and Linux / Windows / Mac is really starting to get to me. Could be fodder for the next blog post 🙂

If you have any questions or are thinking about contributing, feel free to reach out. And definitely let me know if there are things I'm doing that I could be doing better.

Discussion

pic
Editor guide