DEV Community

loading...
Cover image for How to Build an API To Perform Web Scraping in Spring Boot

How to Build an API To Perform Web Scraping in Spring Boot

Arghya Ghosh
I am a student, designer, and beginner developer.
・5 min read

Many online services do not offer APIs to give access to their public data. At the same time, they might have all this data available on their website. In such a circumstance, why not scrape it?

Web scraping is a complicated subject and — to perform it consistently — you may need an equally complex solution. If this is your case, I recommend this article, where I delved into building a robust, advanced, and modern rotating IP and User-Agent web scraping script.

In most cases, you do not need such a sophisticated system. In fact, an API that is capable of scraping data on-the-fly from a template-consistent website should be enough.

Let’s see how to build such an API to scrape data from a particular website in Spring Boot.

Please, note that code will be written in Kotlin, but the same result can be achieved in Java as well.
...

1. Adding the Required Dependencies

First, you need a library to perform web scraping in Spring Boot.

Since Kotlin is interoperable with Java, you can use any Java web scraping library. Out of the many options that are available, I highly recommend jsoup.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for fetching URLs and extracting and manipulating data, using the best of HTML5 DOM methods and CSS selectors. — jsoup: Java HTML Praser

So, you need to add jsoup to your project’s dependencies.

If you are a Gradle user, add this dependency to your project’s build file:

compile "org.jsoup:jsoup:1.13.1"
Enter fullscreen mode Exit fullscreen mode

Otherwise, if you are a Maven user, add the following dependency to your project’s build POM:

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.13.1</version>
</dependency>
Enter fullscreen mode Exit fullscreen mode

Now, you have all you need to start scraping data in Spring Boot.

2. Defining Your Scraping Logic

Since your scraping logic is based on how your target web page is structured, you must define it according to your goals. Be aware that every time the template of this page is changed, your logic should be updated accordingly.

The main advantage of defining an API to perform such an operation is that data is scraped on-the-fly. This means that every time the API is called, up-to-date data is always returned.

In this tutorial, I am going to show how to build an API whose goal is to scrape the COVID-19 pandemic by country and territory Wikipedia page. Its purpose is to retrieve statistics on COVID-19 and return them in a human-readable format.

Firstly, you need to create a new connection to your target web page through the [connect](https://jsoup.org/apidocs/org/jsoup/Jsoup.html#connect(java.lang.String) method. Please, note that you might be required to set a valid user agent, a specific set of [headers](https://jsoup.org/apidocs/org/jsoup/Connection.html#headers(java.util.Map), or [cookies](https://jsoup.org/apidocs/org/jsoup/Connection.html#cookies(java.util.Map) to prevent your connection to be refused.

Secondly, you can call the get() method to fetch and parse the desired HTML file. This would be represented by a Document object, which offers whatever you need to navigate through DOM to find, extract, and ]manipulate data. You can get HTML elements either by using DOM traversal or CSS selectors.

This is what your scraping logic will look like:
CovidDataDto.kt

class CovidDataDto {
    var country : String? = null
    var cases : Int? = null
    var deaths : Int? = null
    var recoveries : Int? = null

    constructor(
        country : String?,
        cases : Int?,
        deaths : Int?,
        recoveries : Int?
    ) {
        this.country = country
        this.cases = cases
        this.deaths = deaths
        this.recoveries = recoveries
    }

    constructor()
}
Enter fullscreen mode Exit fullscreen mode

ScrapeCovidData.kt

fun retrieveCovidData() : List<CovidDataDto> {
    val covidDataList = ArrayList<CovidDataDto>()

    try {
        // retrieving the desired web page
        val webPage = Jsoup
            .connect("https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory")
            .get()

        val tbody = webPage
            .getElementById("thetable")
            .getElementsByTag("tbody")[0]

        val rows = tbody
            .children()
            .drop(2) // dropping the headers

        for (row in rows) {
            val country = row
                .getElementsByTag("a")[0]
                .text()

            val tds = row
                .getElementsByTag("td")

            // skipping the footer
            if (tds.size < 3)
                continue

            val cases = tds[0].text().replace(",", "").toIntOrNull()
            val deaths = tds[1].text().replace(",", "").toIntOrNull()
            val recoveries = tds[2].text().replace(",", "").toIntOrNull()

            covidDataList.add(
                CovidDataDto(
                    country,
                    cases,
                    deaths,
                    recoveries
                )
            )
        }
    } catch (e : HttpStatusException) {
        // an error occurred while connecting to the page

        // logging errors
        // ...

        throw e
    }

    return covidDataList
}
Enter fullscreen mode Exit fullscreen mode

As you can see, CovidDataDto is only a DTO class used to carry data. Keep in mind that when dealing with such APIs, it may be useful to return CSV content directly. Spring Boot allows you to do so as described here.

What really matters is the retrieveCovidData method, where the scraping logic lies. Thanks to jsoup, retrieving the desired data by navigating through the DOM of the downloaded web page is straightforward and no further explanation is required.

Based on my experience, consider that while connecting to your target web page and downloading it, many errors may occur. In order to make your code more robust, I strongly recommend adding retry logic, as described here.

3. Putting It All Together

Let’s create a controller and define an API to test out the scraping logic defined above.

CovidDataController.kt

@RestController
@RequestMapping("/covid")
class CovidDataController {
    @GetMapping("data")
    fun getCovidData() : ResponseEntity<List<CovidDataDto>> {
        return ResponseEntity(
            retrieveCodivData(),
            HttpStatus.OK
        )
    }
}
Enter fullscreen mode Exit fullscreen mode

Now, by reaching http://localhost/covid/data you will get the following response:

[
  {
    "country": "United States",
    "cases": 28897871,
    "deaths": 518720
  },
  {
    "country": "India",
    "cases": 11096731,
    "deaths": 157051,
    "recoveries": 10775169
  },
  {
    "country": "Brazil",
    "cases": 10551259,
    "deaths": 255018,
    "recoveries": 9411033
  },
  ...
  {
    "country": "Vanuatu",
    "cases": 1,
    "deaths": 0,
    "recoveries": 1
  }
]
Enter fullscreen mode Exit fullscreen mode

Et voila! Your API that scrapes COVID-19 data on-the-fly is ready!

Conclusion

In this article, we looked at how to build an API to scrape data on-the-fly from a precise web page in Spring Boot and Kotlin. This is especially useful when dealing with online services that do not offer APIs to retrieve their public data, and you want its up-to-date version.
Thanks for reading! I hope that you found this article helpful. Feel free to reach out to me with any questions, comments, or suggestions.

Discussion (0)

Forem Open with the Forem app