Welcome back to my tutorial about web scraping with Nuxt 3. This is a second part of the four volume series:
- Introduction and setting up
- Backend scraping service in Nuxt
- Displaying the results on frontend
- Going live on Netlify
In this part you will learn how to extract data from a website and return the result from a backend service using Nuxt 3.
Backend in Nuxt
That’s another cool thing about Nuxt – you don’t need any special configuration to allow server-side part of it. It is there ready for you. Just create a JavaScript file /server/api/hello.ts
exporting following method:
export default defineEventHandler(() => {
return 'hello'
})
And it will become available at localhost:3000/api/test
. When you hit this URL in your browser, you’ll see a “hello” text printed out. With a slight modification, we can start returning JSON object instead of plain text:
export default defineEventHandler(() => {
return {
hello: 'world'
}
})
Congratz! You have learned creating backend services in Nuxt! Ofc, there is much more, but as you can see, the basic concept is ridiculously easy to implement. It is worth noting that server-side code runs in the context of the runtime server and not inside client’s browser. Thanks to the nature of the architecture, you even have access to Node.js req
and res
objects + some other features coming from underlying h3 http framework. One important aspect of this is that “what happens on the server, stays on the server” and unless you expose the data through the returned object, they are sealed from a grim and dark world of public internet. But that’s another topic worth having its own article one day in the future. We still have a scraper to write.
Fetching the data from remote source
We now know a tool that will serve “the number of the last JEP available” when someone opens it, so we just need to do the actual job and scrape the piece of information from the official website.
I decided to name my new Nuxt API service “peek.ts”, because it will be peeking to JEP Index.
First let’s get the data so we can start digging into them. Since life in Nuxt is (usually) easy, this whole operation is nothing but a one liner:
const jepHTMLData = await $fetch<string>('https://openjdk.org/jeps/0')
What we are doing here is that we’re utilizing the $fetch
function that ships naturally within Nuxt and it this most basic use case it does what the name implies – it fetches for the data from the URL given as a parameter. Since $fetch
is declared as an asynchronous operation, we are awaiting its result before moving further. If we don’t do that and let JavaScript code continue, the next line will probably execute way before $fetch
finishes its HTTP request and there would be no data to work with!
The last aspect - <string>
- is a cool TS-related feature that allows us to tell the TypeScript engine we will be getting string data back from the call. Thanks to that TypeScript knows jepHTMLData
will contain string data. Providing you have your eslint properly configured, you’d start getting red squiggly lines every time you’ll try to treat it as something else than JavaScript string. This can warn you in advance about potential bugs that would appear once you start your application.
Ok. So right now, we have jepHTMLData
, a loooong string containing the whole source code from the initial <!DOCTYPE
till the final </html>
tag. What’s next?
DOM-parsing the data
Now we can continue the dumb way or the smart way. The dumb would involve a lot of .substring
and .indexof
invokes while we would try to extract the desired data from the HTML string manually. To be honest, maybe it would spare me some headaches, because I actually spend most of the time on this project trying to choose the right npm-based DOM parsing tool. But it as worth it.
As you may have already guessed, to leverage powers of some mature HTML (or XML in general) parser, feed it with our HTML string and then navigate within the DOM model our tool automatically build for us, would be the more mature approach to such task. The one downside of having a service running on server is that we cannot just use the DOMParser
that ships built-in in browsers. And Node.js doesn’t have its native solution. So let the search for suitable npm package commence.
Surprisingly for me there doesn’t seem to be any “no-brainer” JS library for this case. I gave up trying two possibilities for their hostile and undocumented nature leaving me unable to understand them enough to actually start using them, before I dig into node-html-parser
which then become my champion.
The problem with DOM-parsers is they inflate the former concise HTML tag into a wild object structure nested in each other, so you immediately lost track of what’s going on. Without a proper documentation it becomes really hard to traverse through the DOM. Fortunately node-html-parser
offers mimic of JS document’s native functions like querySelector
and querySelectorAll
, which work exactly as their counterparts – you feed them with CSS selector and they yield a list of matching HTMLElements. So let’s use them and finish our quest for the last JEP number.
Now we need to step aside and analyze the source page to figure out what we are looking for. There used to be just easier, but the page structure was changed during spring 2024. Now the data on the JEP Index are organized into couple of HTML tables. We will be interested in In-flight JEPs, a list of JEPs that are currently being procesed, and Submitted JEPs, which are “drafts” that already passed the entry level on discussion and gained "Submitted" status. They are identified by a bit cryptic number corresponding to the issue number from JDK Bug System Jira where they first emerge when someone starts dreaming of implementing them. Apart from last JEP number I decided to broad the task and get the last draft number as well, once we are in it.
Checking the source code we can notice those <table>
elements with data we seek have CSS class .jeps
. Thanks to the conservative nature of the JEP Index page the inner structure of those tables is also pretty straightforward, and it is not overwhelmed with a swarm of chaotic divs unlike many modern sites created via UI frameworks. Once we grab the tables, we can navigate through their rows and read the value from a child’s <td>
which is even marked with the special class .jep
(like if the creators foresee us trying to dig for it). Formerly rows were ordered so the most recent numbers came last. Now the In-flight JEPs table has the latest value first, while Submitted JEPs doesn't seem to be ordered at all. So it makes the task a little harder, but we'll come through it.
Back to our code. It all starts with parsing the input via simple parse
method:
const jepPage = parse(jepHTMLData)
This turns former raw string fetched from the remote source into a root HTMLElement
with all the content nested inside as childNodes
. We can apply selector method on it easily to delve deeper into virtual DOM-tree.
Let's start with reading "In-flight JEPs" table:
// entries are listed in couple of <table class="jeps"> elements
const tables = jepPage.querySelectorAll('.jeps')
// first table contains "Process JEPs" => skip
// second table contains "In-flight JEPs"
// - those already have proper JEP number,
// so the "latest JEP" would be here
const jepTable = tables[1]
Now let’s get all the table rows and jump right to the first one:
// entries are ordered "newest first" - data will be in first <tr>
const latestJEPRow = jepTable?.querySelectorAll('tr')?.at(0)
Notice the optional chaining syntax (“?” after each part of the expression). In case you don’t already know it, this is a great way of avoiding runtime errors with trying to access attributes or methods on an undefined variable. Written like this JavaScript would evaluate whole expression as undefined
without triggering an error in case any part in the optional chaining is undefined
. Normally this won’t happen, but what if there would be a problem on OpenJDK website and our scraper would get foul page with no tables or tables with no rows? Better safe than sorry…
Alright, folks, now we are really getting close. We got the target rows, now filter out target cells:
// the target number resides inside (the only one) <td class="jep">
// element nested inside <tr>'s chlidren
const latestJEPCell = latestJEPRow?.querySelector('.jep')
The final step is but a piece of cake:
// finally the value is a sole TextNode inside <td> we just grabbed
// and we can access it directly via .text attribute
const latestJEPNo = latestJEPCell?.text
Getting last draft number is a bit harder, because the table is unordered. But since the draft numbers are based on Jira tickets, we can tell the values keep increasing. So we just need to go through all the rows and track the highest number:
// third table contains "Submitted JEPs"
// - they passed the first level and are likely to be worked on,
// so we use this one
const draftTable = tables[2]
// entries are not clearly ordered
// - we need to find correct row by going through all
let lastDraftNo = -1
const draftTableRows = draftTable?.querySelectorAll('tr')
draftTableRows?.forEach((row) => {
// get current value
const draftNoCell = row?.querySelector('.jep')
const draftNo = draftNoCell?.text
// if it is higher than previous max, set it
if (draftNo && parseInt(draftNo) > lastDraftNo) {
lastDraftNo = parseInt(draftNo)
}
})
Returning results
Yay! We have the data we came for. Now the last step is to return them. First, lets do it in a plain way:
return {
lastJEPNo,
lastDraftNo
}
When we now call http://localhost:3000/api/peek we should see something like:
(valid at November 12th, 2023)
This could be it, but as a proponent of TypeScript I suggest we take one step further and introduce a type to conclusively describe data we will return.
I defined a following type:
export type JEPData = {
fetched: Date,
lastJEP: number,
lastDraft: number,
}
Apart from the number of the last JEP and JEP Draft (and notice that I really make them numbers and not strings) I also decided to add the date of fetching. Maybe some of our callers will find it handy because he/she would be able to store our response and only re-fetch data after a certain period would pass or he/she could start building a time series. Who knows?
Then we only need to modify our return clause a little bit:
return {
fetched: new Date(),
lastJEP: parseInt(lastJEPNo || '-1'), // defined as string
lastDraft: lastDraftNo // already a number
}
Because I deliberately changed format from string (extracted from the HTML data) to a number (more logical when we want to talk about numbers), we must deal with the parsing. Since there is a non-zero chance that values scraped from the website would be undefined, we must provide a default value. TypeScript knows that and would complain if you’d try to omit it:
A default of -1
seems reasonable because it would also indicate that something went wrong during processing. But it would be a good idea to document such behavior of your service. There are also other options like not returning anything at all or enhancing the response type with status
message and explain the potential problem there. The choice is yours.
Conclusion
We managed to build our first Nuxt backend service and saw how to handle webscraping smoothly in it. Our application now returns data and we could even stop here. But to explore a little bit about Nuxt frontend as well, we will work on Displaying the results on frontend in the next tutorial.
Again, if you didn't understand something mentioned in this article or have more questions to ask or problems to solve, reach me out in the comments.
Top comments (0)