Scraping Soccer Data with Nodejs and Puppeteer

#scraping #sports #node

I was recently in a situation of needing sports data - soccer data in particular. I am writing this post, as I had to overcome some initial problems. You should be able to follow my thoughts and my path that lead me to a solution.

For this tutorial, I came across fleshscore.com, a website that provides plenty of leagues and covers fixtures and live matches.

I started with the following basic script:

const axios = require('axios');
    //performing a GET request
axios.get('https://www.flashscore.com/')
    .then(response => {

    //handling the success
    const html = response.data;
    console.log(response.data);
})
//handling error
.catch( error => {
  console.log(error);
});

To investigate what is being returned by the script, I forwarded all returned content into a test.html file.

node scraper.js > test.html

After I had opened the HTML-file inside my browser, I quickly realized that all match information, that was shown on the original website, was missing. This was not a big surprise, as I expected the content to be rendered by javascript.

As the script above is written in nodejs, I started to play around with puppeteer, which is a node library that provides a high-level API to control headless Chrome or Chromium.

After some time, I ended up with the following piece of code:

const puppeteer = require ('puppeteer');

//initiating Puppeteer
puppeteer
    .launch ()
    .then (async browser => {

        //opening a new page and navigating to Fleshscore
        const page = await browser.newPage ();
        await page.goto ('https://www.flashscore.com/');
        await page.waitForSelector ('body');

        //manipulating the page's content
        let grabMatches = await page.evaluate (() => {
        let allLiveMatches = document.body.querySelectorAll ('.event__match--oneLine');

        //storing the post items in an array then selecting for retrieving content
        scrapeItems = [];
        allLiveMatches.forEach (item => {

            let postDescription = '';
                try {
                    let homeTeam = item.querySelector ('.event__participant--home').innerText;
                    let awayTeam = item.querySelector ('.event__participant--away').innerText;
                    let currentHomeScore = item.querySelector('.event__scores.fontBold span:nth-of-type(1)').innerText;
                    let currentAwayScore = item.querySelector('.event__scores.fontBold span:nth-of-type(2)').innerText;
                    scrapeItems.push ({
                        homeTeam: homeTeam,
                        awayTeam: awayTeam,
                        currentHomeScore: currentHomeScore,
                        currentAwayScore: currentAwayScore,
                    });
                } catch (err) {}

            });
            let items = {
                "liveMatches": scrapeItems,
            };
            return items;
        });
        //outputting the scraped data
        console.log (grabMatches);
        //closing the browser
        await browser.close ();
    })
    //handling any errors
    .catch (function (err) {
        console.error (err);
    });

Now I ran the script again with the following command:

node scraper.js

As you can see I retrieved a beautiful list of JSON data.
Now, of course, there is plenty of work that could be spent to sort the data by the league, country, etc. etc.

For my use case, this snippet was enough. If you aim for more serious scraping, you may as well pick a general sports- or soccer API (I.e. sportdataapi.com, xmlsoccer.com.

Happy Scraping :-)

Top comments (1)

Cleber Fernandes • Apr 14 '22

Hello,
How to insert this fetched data in a databases ... using mysql.