DEV Community

Benjamin Mock
Benjamin Mock

Posted on

How to programmatically extract data from a webpage (e.g your dev.to reading list)

Sometimes you need to get the information presented on a webpage in a structured form out of that page. Let's have a look at your dev.to reading list for example. Let's imagine, that we want to extract a list of the articles listed there. So let's scroll through our reading list, open the console and let's get going!

Let's inspect the elements, that contain the links to the articles.

Alt Text

We can see, that the anchor, which contains the link has a class item. So let's try to grab all elements of that page with class="item.

document.querySelectorAll('.item')

This will return a NodeList with the selected elements.

Alt Text

Next, we want to convert this NodeList to an array, because it's easier to iterate on that. We use an Array.from for that:

Array.from(document.querySelectorAll('.item'))

We now have an array with the selected DOM elements, that contain all the necessary information. To get an array of just the links, we can simply access the href property of our DOM elements.

Array.from(document.querySelectorAll('.item'))
  .map(a => a.href)

But it would be nicer, to also have the title. So let's have a look at the DOM structure again:

Alt Text

We can see, that the title is contained in a div with the class item-title inside of the already selected anchor. So we can use another querySelector on that anchor to get the title:

Array.from(document.querySelectorAll('.item')).map(a => ({
  href: a.href,
  title: a.querySelector('.item-title').innerText,
}))

To access the text content of a DOM node we can use the innerText prop.

Alt Text

Well done! We now have all the information of our reading list as structured content.

If you want to get your links as a pastable markdown snippet, you wouldn't return an array in the map function, but a string in the structure [title](href). Afterwards you can use reduce to boil the array down to just one string, that contains the links as a list.

Array.from(document.querySelectorAll('.item'))
  .map(a => `[${a.querySelector('.item-title').innerText}](${a.href})`)
  .reduce((acc, e) => `${acc}\n* ${e}`, '')

Top comments (0)