DEV Community

Cover image for Web Scraping: Use ARIA attributes to crawl accessible components
Marie Creel
Marie Creel

Posted on

Web Scraping: Use ARIA attributes to crawl accessible components

If you're a developer working on the front end of a web application, chances are that you've been asked to take web accessibility into account when building a new custom component. While using semantic HTML can help address many issues with accessibility and should be our first step when building accessible components, there are more complex JavaScript components which require ARIA implementation to be fully accessible. However, these ARIA attributes aren't just useful for assistive technology users; you can also leverage these attributes to scrape data from server-generated content using a headless browser.

When are ARIA attributes necessary?

Consider the combobox role. If you've ever laughed at a silly suggested search when typing a query into Google, then you've directly interacted with a combobox. Essentially, a combobox is a text input that is associated with a list of suggested values. When you type into the text input, a list of links appears below the input, and these links likely autocomplete the phrase you're typing. You can click on one of the links to autocomplete your query, or you can use the arrow keys to move up and down through the list to select an option.

To make the combobox accessible to all users, ARIA attributes must be added to the different components of the combobox and changed throughout the course of interaction so that assistive technology users know when the results appear, which result they've selected, and how to interact with those results. Assistive technologies will then access those ARIA attributes and (hopefully) communicate relevant information to the user, though this is highly dependent on what browser and assistive tech combo the user is using.

A screenshot of Google predictive search results for the incomplete query "how do stars". A list of results appears below the text input.

Comboboxes are pretty ubiquitous across the web. However, there isn't a <combobox> element that allows us to build one using semantic HTML. You could build a combobox using <input> and <datalist> elements, but as of right now most browsers do not support images or links in <datalist> elements. This is a problem for some developers, since they are often implementing a custom component which requires certain features beyond text content. eCommerce sites may want to display images of recommended products, links for current promotions, or a number of other options which require more complex markup than the <select>, <input>, and <datalist> elements can provide. Or, more commonly in my experience, a developer may be refactoring an existing component for accessibility and may be required to maintain as much of the original markup as possible to keep the project at an appropriate scale.

Okay, but what do ARIA attributes do?

ARIA attributes most commonly describe two types of information about an element:

There are other types of information that can be communicated through ARIA attributes as well, and this information is hugely important for ensuring the accessibility of custom components on your site. As developers, we can also leverage these attributes, especially the attributes that describe relationships between elements, to get the data we need from a custom component when scraping a site.

For example, let's say we want collect the top five suggested search results for a particular search string on Google, how do we get that information programmatically? We require some information about the relationship between the search input and the predictive search results to get the data we need, and ARIA attributes, if implemented correctly, can give us that information for free!

When should I use ARIA attributes to scrape a site?

TL;DR: If the information you need is presented in an component made accessible using ARIA attributes, then you could use ARIA attributes to scrape the page. I talk a bit about a specific instance where I used ARIA attributes to scrape a combobox below.

My foray into web scraping began with a personal project near and dear to my closet; I'm currently building a React Native port of Lolibrary's search engine. Lolibrary is a nonprofit organization that documents the history of a particular sub-style of Japanese alternative fashion, lolita fashion. There are dozens of fashion brands that have been selling unique frocks since the early 2000s, and hardcore Harajuku historians regularly archive the original selling price, stock photos, measurements, and more for different releases in the Lolibrary database. Lolibrary is also an important resource to ensure you're not being scammed when buying pieces secondhand, since the lolita fashion secondhand market is rife with scalpers and poorly-made replicas. For all intents and purposes, Lolibrary is considered the primary reference library for lolita fashion, and for that reason it's an important resource for the lolita fashion community.

Example reference page for a dress on Lolibrary.org. There's an image of the dress, along with the release year, original price, garment measurements, and lots of additional information.

I've always wished that there was a Lolibrary app so that I could search for particular releases without visiting the mobile site on my phone (sorry Lolibrary devs 😔), so I began working on a pocket app port of the search engine that can fetch search results using the Lolibrary search API earlier this year. I've got the basic functionality working well, however I have not yet implemented search filters like category, color, year, etc. The values for each of these filters are locked behind a password protected API, and are otherwise only available on the search screen in the combobox elements for each filter.

Screenshot of search filter comboboxes on Lolibrary.org. There are multiple filters: category, brand, features, colorway, tags, and year

As is typical of comboboxes, the connected list of filter values is empty and hidden until the user interacts with each filter input, and the filter values are added to the dropdown list of options using JavaScript. I thought it could be possible that these values were fetched using a request to the Lolibrary search API, but when I monitored the network tab in devtools while interacting with these comboboxes, I didn't see any requests sent. Upon further inspection, I realized that the app was built using Vue, and the values for each filter were likely fetched and stored somewhere in the props during a server-side rendering step.

At this point, I came to the conclusion that I would have to collect the filter data without the Lolibrary API if I wanted to use it. I decided that I would build my own API to serve Lolibrary filter values, and I would scrape Lolibrary to get the filter information. Because displaying the filter values required interaction, it wasn't possible to scrape the page using a package like cheerio, so I decided to use puppeteer instead.

Show me some code!

Note: you can see the full source code on my GitHub. The entire source code is < 100 lines, so it's not a long read.

To start the scraping process, I inspected the combobox elements on the Lolibrary search page to identify which selectors I would need to target on the page. The general structure of the combobox for each filter looks like this:

<div class="input-group pb-2">
    <label class="control-label">Category</label> 
    <div dir="auto" class="v-select vs--multiple vs--searchable" style="width: 100%;"> 
        <div id="vs1__combobox" role="combobox" aria-expanded="false" aria-owns="vs1__listbox" aria-label="Search for option" class="vs__dropdown-toggle">
            <div class="vs__selected-options"> 
                <input placeholder="Tap to filter" aria-autocomplete="list" aria-labelledby="vs1__combobox" aria-controls="vs1__listbox" type="search" autocomplete="off" class="vs__search">
            </div> 
            <div class="vs__actions">
                <button type="button" title="Clear Selected" aria-label="Clear Selected" class="vs__clear" style="display: none;">
                    <svg xmlns="http://www.w3.org/2000/svg" width="10" height="10">
                        <path d="M6.895455 5l2.842897-2.842898c.348864-.348863.348864-.914488 0-1.263636L9.106534.261648c-.348864-.348864-.914489-.348864-1.263636 0L5 3.104545 2.157102.261648c-.348863-.348864-.914488-.348864-1.263636 0L.261648.893466c-.348864.348864-.348864.914489 0 1.263636L3.104545 5 .261648 7.842898c-.348864.348863-.348864.914488 0 1.263636l.631818.631818c.348864.348864.914773.348864 1.263636 0L5 6.895455l2.842898 2.842897c.348863.348864.914772.348864 1.263636 0l.631818-.631818c.348864-.348864.348864-.914489 0-1.263636L6.895455 5z">
                        </path>
                    </svg>
                </button> 
                <svg xmlns="http://www.w3.org/2000/svg" width="14" height="10" role="presentation" class="vs__open-indicator">
                    <path d="M9.211364 7.59931l4.48338-4.867229c.407008-.441854.407008-1.158247 0-1.60046l-.73712-.80023c-.407008-.441854-1.066904-.441854-1.474243 0L7 5.198617 2.51662.33139c-.407008-.441853-1.066904-.441853-1.474243 0l-.737121.80023c-.407008.441854-.407008 1.158248 0 1.600461l4.48338 4.867228L7 10l2.211364-2.40069z">
                    </path>
                </svg> 
                <div class="vs__spinner" style="display: none;">Loading...</div>
            </div>
        </div> 
        <ul id="vs1__listbox" role="listbox" style="display: none; visibility: hidden;">
        </ul> 
    </div> 
    <!---->
</div>
Enter fullscreen mode Exit fullscreen mode

From this snippet, I'm interested in three selectors:

  • .input-group > label.control-label: this is the name of the filter associated with the combobox. This string will be the key we use to access the values for each filter, so we need to store it in a hash and send it to our database along with the associated filter values.
  • .v-select > .vs__dropdown-toggle[role="combobox"]: this is the combobox wrapper div, and it has role="combobox", so I know from the combobox role specification that it will have many useful ARIA attributes attached. The input we need to interact with is contained within this div as well.
  • ul[role="listbox"]: I'm really interested in using the id on this element as a selector for the filter values. I will get the id by grabbing the aria-owns attribute from the combobox element.

To start, I store the first two selectors in variables. I want to build a map with the filter names as keys and empty arrays as the values so that I can easily push the filter values into the array. I also want to associate each filter name with the appropriate listbox ID, so I will grab the ID from the listbox as well.

  const filterNameSelector = ".input-group > label.control-label";
  const filterComboboxSelector =
    ".v-select > .vs__dropdown-toggle[role='combobox']";
  // ...
  // get the filter names
  const filtersHandle = await page
    .waitForSelector(filterNameSelector)
    .then(() => {
      return page.$$(filterNameSelector);
    });
  // set up the filter map
  for (i = 0; i < filtersHandle.length; i++) {
    // key for each filter
    const header = await filtersHandle[i].evaluate((node) => node.innerText);
    const listboxId = await filtersHandle[i].evaluate((node) => {
      // the next sibling should be the div that contains both the combobox and listbox
      const sibling = node.nextElementSibling;
      // the listbox appears after the combobox in the DOM
      const id = sibling.children[1].id;
      return id;
    });
    filters[header.toLowerCase()] = { values: [], listboxId: listboxId };
  }
Enter fullscreen mode Exit fullscreen mode

After this step, we have an object that looks something like this:

{
  category: { values: [], listboxId: 'vs1__listbox' },
  brand: { values: [], listboxId: 'vs2__listbox' },
  features: { values: [], listboxId: 'vs3__listbox' },
  colorway: { values: [], listboxId: 'vs4__listbox' },
  tags: { values: [], listboxId: 'vs5__listbox' },
  year: { values: [], listboxId: 'vs6__listbox' }
}
Enter fullscreen mode Exit fullscreen mode

In the second half, we need to interact with the input and scrape the values that appear within the listbox. This is where the ARIA attributes on the combobox and input elements become useful:

  // interact with the filter comboboxes to get filter values
  const filterComboboxesHandle = await page
    .waitForSelector(filterComboboxSelector)
    .then(() => {
      return page.$$(filterComboboxSelector);
    });
  for (i = 0; i < filterComboboxesHandle.length; i++) {
    const ariaOwns = await filterComboboxesHandle[i].evaluate(
      (node) => node.attributes["aria-owns"].nodeValue
    );
    // focus on the input
    await page.waitForSelector(`input[aria-controls='${ariaOwns}']`);
    await page.click(`input[aria-controls='${ariaOwns}']`);
    let filterName = "";
    for (const key of Object.keys(filters)) {
      // compare the ariaOwns attribute with the listbox ID we collected earlier
      if (filters[key].listboxId === ariaOwns) {
        filterName = key;
        break;
      }
    }
    // now that the listbox is visible, we can select it and scrape the values
    const filterListboxHandle = await page
      .waitForSelector(`#${ariaOwns}`, { visible: true })
      .then(() => {
        return page.$(`#${ariaOwns}`);
      });
    const filterValues = await filterListboxHandle.evaluate((node) => {
      let values = [];
      for (const child of node.children) {
        values.push(child.textContent.trim());
      }
      return values;
    });
    filters[filterName].values = filterValues;
    // click another element to clear browser focus.
    // if we don't do this, the focus will be stuck in the first input,
    // so the second listbox will never show when we click it.
    await page.click(".card-header");
  }
Enter fullscreen mode Exit fullscreen mode

Let's break this down:

  1. Use the combobox selector we defined earlier to grab all the combobox elements on the page with page.$$(filtersComboboxSelector).
  2. For each combobox, we grab the aria-owns attribute using vanilla JS. Then, we iterate over the filters in the filters hash and compare aria-owns to the listboxId stored in the filter-specific hash.
  3. Interact with the input element that controls the listbox we're interested in. The aria-controls attribute should match the listbox ID from the previous step. If we don't interact with the input, the listbox will remain invisible and empty (it's quite shy! 😭).
  4. If the aria-owns attribute matches the listbox ID for a particular filter, we scrape the text contents of the listbox, remove white space, and push the contents to the values array for that specific filter.

All in all, using the ARIA attributes was a neat way for me to identify relationships between elements I was scraping and correctly label the data I needed.

Conclusion

Learning about web accessibility and how ARIA attributes work is worthwhile in its own right. I encourage you to learn the basics of web accessibility so that everyone, regardless of cognitive or physical ability, can have fair and equal access your applications. However, if you need an extra push, I hope this article emphasizes how ARIA attributes enable software, such as screen readers and web scrapers, to access the content made accessible by those attributes.

Top comments (0)