DEV Community

Cover image for A JavaScript scraper for the Wikipedia Academy Award List.
Everton Tenorio
Everton Tenorio

Posted on

3 1 1 1 1

A JavaScript scraper for the Wikipedia Academy Award List.

Scraping the Academy Award winners listed on Wikipedia with cheerio and saving them to a CSV file.

Today, a simple demonstration of how to scrape data using JavaScript with the cheerio library. For this, we'll use the list of Academy Award winners directly from Wikipedia.

First, install the necessary packages:

npm install cheerio axios
Enter fullscreen mode Exit fullscreen mode

The URL used is:

const url = 'https://en.wikipedia.org/wiki/List_of_Academy_Award%E2%80%93winning_films';
Enter fullscreen mode Exit fullscreen mode

Next, we'll load the HTML using the load function and prepare two variables to hold the columns and the necessary information from the table:

const { data: html } = await axios.get(url);
const $ = cheerio.load(html); 

const theadData = [];
const tableData = [];
Enter fullscreen mode Exit fullscreen mode

table

Now we'll select and manipulate the elements as we traverse the DOM, which are Cheerio objects returned in the $ function:

$('tbody').each((i, column) => { 
    const columnData = [];
    $(column)
      .find('th')
      .each((j, cell) => {
      columnData.push($(cell).text().replace('\n',''));
    });
    theadData.push(columnData)
  }) 

  tableData.push(theadData[0]) 

$('table tr').each((i, row) => {
    const rowData = []; 
    $(row)
      .find('td')
      .each((j, cell) => {
        rowData.push($(cell).text().trim());
      });

    if (rowData.length) tableData.push(rowData)
  })
Enter fullscreen mode Exit fullscreen mode

Glad you still know jQuery...

Finally, save the data as it is, even without processing the data 😅 into a .csv spreadsheet with fs.writeFileSync.

Note, I used ";" as the delimiter.

const csvContent = tableData
    .map((row) => row.join(';')) 
    .join('\n');

fs.writeFileSync('academy_awards.csv', csvContent, 'utf-8');
Enter fullscreen mode Exit fullscreen mode

running

node scraper.js
Enter fullscreen mode Exit fullscreen mode

cheerio csv

I’ve written other tutorials here on dev.to about scraping, with Go and Python, and If this article helped you or you enjoyed it, consider contributing: donate

Billboard image

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

Top comments (0)

👋 Kindness is contagious

Immerse yourself in a wealth of knowledge with this piece, supported by the inclusive DEV Community—every developer, no matter where they are in their journey, is invited to contribute to our collective wisdom.

A simple “thank you” goes a long way—express your gratitude below in the comments!

Gathering insights enriches our journey on DEV and fortifies our community ties. Did you find this article valuable? Taking a moment to thank the author can have a significant impact.

Okay