Straight from the reddit
This is another request from reddit. The first request was about getting games in early access from steamdb. Here’s the link to the write up. It’s from u/McLambo who is allegedly using it for a school project. I’m a big fan of roller coasters and this was a fun scrape to do.
I also want to give a shout out to Duane Marden (follow him on twitter, @rcdbdotcom) who develops and maintains Roller Coaster Database. I really like and respect people like him, who have these awesome side projects that provide a lot of cool value to the world just for fun. I tried to be respectful and careful while doing this scrape.
Scrape prep
Our goal with this scrape was to get all North American roller coasters. You can easily adjust the code for any continent or the whole world.
The first thing I did was see if I could find any direct urls that use query parameters so I could navigate directly to where I wanted to go. I was not disappointed. The search structure was very simple to use. Navigating to
https://rcdb.com/r.htm?ol=1&ot=2 displayed the first page of all roller coasters in North America. Query param ol
is the region while ot
is the type of thing we are searching for (roller coaster, amusement park, elements, etc).
In order to paginate, you simple provide a page
query param, like https://rcdb.com/r.htm?page=2&ot=2&ol=1
. This makes paginating through very simple. The only kind of tricky part is figuring out when you are done. Luckily the total amount of roller coasters are provided at the top and we just plucked that number, divided it by the total number of coasters per page (24) and rounded up. At the time of this writing, there are 3,061 total coasters in North America that have existed according to rcdb.com, or 128 pages of results.
The code
Our weapons of choice for this scrape are Axios and Cheerio. Axios once again proved to be incredibly reliable and easy to use in order to make the http requests. I used cheerio to parse the html and it did 95% of the work.
The first step I took was to calculate how many pages I’d have so I knew how many total loops I’d need to do. I did this by, like I said earlier, plucking out the total number of coasters and then dividing by the amount per page (24).
const domain = 'https://rcdb.com';
const region = 'r.htm?ol=1&ot=2';
const axiosResponse = await axios.get(`${domain}/${region}`);
const $ = cheerio.load(axiosResponse.data);
const totalRollerCoasters = parseInt($('.int').text());
// Assume 24 per page
const totalPages = Math.ceil(totalRollerCoasters / 24);
console.log('total pages', totalPages);
Once I had the number of loops I would need (128 in this case) I just started looping. At the start of each loop, I’d navigate to that page and get the amount of rows on the page and then loop through those. I’d pluck the link and send it into my getDetails()
function. That function returns an object full of roller coaster details and that is what I pushed into my rollerCoasters
array.
const rollerCoasters: any[] = [];
for (let page = 1; page < totalPages; page++) {
const axiosResponsePaginated = await axios.get(`${domain}/${region}&page=${page}`);
const paginated$ = cheerio.load(axiosResponsePaginated.data);
const rows = paginated$('#report tbody tr');
for (let i = 0; i < rows.length; i++) {
const row$ = cheerio.load(rows[i]);
const link = row$('td:nth-of-type(2) a').attr('href');
if (link) {
const rollerCoaster = await getDetails(`${domain}${link}`);
console.log('link', link, rollerCoaster);
rollerCoasters.push(rollerCoaster);
}
}
}
getDetails()
From this point forward it’s just leveraging cheerio to pluck the information. A few items were kind of tricky and those are the ones I want to address here.
Operating information was the first part that was a bit tricky. The html looked something like this if the coaster was still running:
and like this if the coaster was removed:
The tricky part about these two items is that there isn’t any element directly surrounding certain items, like “Removed” so how do I identify the difference between if a coaster is active or not. The method I took was to just get the text, see if it includes ‘removed’ and if it does, set active to false and then search for the nth-of-type(1)
for the start time and nth-of-type(2)
for the ending time. If it didn’t have ‘removed’, I assumed it was still active and flagged it as such and only collected the start date, which was nth-of-type(1)
.
const featuredHtml = $('#feature').html();
const operatingInfoHtml$ = cheerio.load(`<div>${featuredHtml.split('<br>')[2]}</div>`);
const operatingMess = operatingInfoHtml$('div').text();
if (operatingMess.toLowerCase().includes('removed')) {
rollerCoaster.active = false;
rollerCoaster.started = operatingInfoHtml$('div time:nth-of-type(1)').attr('datetime');
rollerCoaster.ended = operatingInfoHtml$('div time:nth-of-type(2)').attr('datetime');
}
else {
rollerCoaster.active = true;
rollerCoaster.started = operatingInfoHtml$('div time:nth-of-type(1)').attr('datetime');
}
I get the rest of the details pretty simply with css selectors and I won’t list them all here. The track features were laid out very easily and so I could just loop through them, set the th
as the key of my object and the td
as my value. The only exception was if it was the part for elements, which I wanted comma separated.
const rows = $('#statTable tr');
for (let i = 0; i < rows.length; i++) {
const row$ = cheerio.load(rows[i]);
if (row$('th').text().toLowerCase() === 'elements') {
const elements = row$('td a');
const elementsToPush: any[] = [];
for (let elementIndex = 0; elementIndex < elements.length; elementIndex++) {
elementsToPush.push(cheerio.load(elements[elementIndex])('a').text());
}
rollerCoaster[row$('th').text().toLowerCase()] = elementsToPush.join(', ');
}
else {
rollerCoaster[row$('th').text().toLowerCase()] = row$('td').text();
}
}
The end
Finally, I use a nice package that converts json to csv (json2csv) and then write it to a file. Nice and easy.
const csv = json2csv.parse(rollerCoasters);
fs.writeFile('rollerCoasters.csv', csv, async (err) => {
if (err) {
console.log('err while saving file', err);
}
});
Looking for business leads?
Using the techniques talked about here at javascriptwebscrapingguy.com, we’ve been able to launch a way to access awesome business leads. Learn more at Cobalt Intelligence!
The post Jordan Scrapes the Roller Coaster Database appeared first on JavaScript Web Scraping Guy.
Top comments (0)