Lex Martinez

Posted on Dec 5, 2017 • Edited on Nov 16, 2021

Build a Car Price Scraper-Optimizer Using Puppeteer

#puppeteer #node #webscraping #javascript

Originally published on my blog

Puppeteer is an awesome Node.js library which provide us a lot of commands to control a headless (or not) chromium instance and automatize navigation with few lines of code. In this post we are going to use the puppeteer superpowers and build a car information scraper tool for second hand car catalog and choose the best option.

A few days ago I was reading, with my teammate and big friend @mafesernaarbole about Web scraping and different online tools she needed for a personal project. Looking at different articles and repositories we found Puppeteer, which is a high-level API to control headless Chrome over the DevTools Protocol. That great tool woke up our interest and, although at the end of the day it wasn't useful for her, we both said "Hell yeah! We have to do something with this!!". A couple days after, I told her, Puppeteer would be a great topic for my blog's first article... and here I am. I hope you enjoy it.

Our Study Case

The idea is pretty simple, there is a second hand car catalog in our country, Colombia, it's tucarro.com.co. Basically given the make and model of the vehicle tucarro.com.co offers you a list of second hand cars that match and which are for sale over the country. The thing is, the possible customer have to search one by one of that results and analyze which is the best choice (or choices).

So, our focus is to create a small Node.js app for navigating the catalog website, searching as a human would, then we are going to take the first page of results, scrape its information (specifically the car year, kilometers traveled and price... and of course the ad URL). Finally with that information and using some optimization algorithm we are going to offer to customer best choice (or choices) based on price and kilometers traveled.

Disclaimer: This exercise has just academic purposes, not commercial interest. We don’t store anything of extracted data, which belongs to tucarro.com.co. The app's source code is distributed under MIT license, exempting us of any responsibility of derivate work . We highly recommend to use it with responsibly.

Initial Setup

We are about to create a Node.js application so, the first step of course, is create a new npm project in a new directory. With the -y parameter the package.json will be created with default values:

$ npm init -y

And add the puppeteer dependency to your project

$ npm install --save puppeteer

# or, if you prefer Yarn:
$ yarn add puppeteer

Finally in our package.json file, add the following script:

"scripts": {
    "start": "node index.js"
  }

This script simplifies running our app - now we can do it with just npm start command

Important: As you'll see soon, puppeteer needs async and await functions from Node.js core, so we are going to need a recent version of node which supports those functions (for this article we use v9.2.0 version, however since v7.6 both functions are supported)

Let's Rock

With our npm project successfully configured, the next step is, yes, coding , let's create our index.js file. Then here is the skeleton for our puppeteer app

'use strict'

const puppeteer = require('puppeteer')
async function run() {

 const browser = await puppeteer.launch()
 const page = await browser.newPage()

 browser.close()

}
run();

Basically we are importing a puppeteer dependency at line 2, then we open an async function in order to wrap all browser/puppeteer interactions, in the following lines we get an instance for chromium browser and then open a new tab (page) ... at the end in the last lines, we are closing the browser (and its process) and finally running the async function.

Navigating to our target site

Going to a specific website is a simple task using our tab instance (page). We just need to use the goto method:

 await page.goto('https://www.tucarro.com.co/')

Here is how the site looks in the browser

Searching

Our goal is to find and scrape the first page of results without any kind of filter, ergo all makes. To do that we just need to interact with the website and click on Buscar button, we can achieve it using the click method of page instance.

 await page.waitForSelector('.nav-search-submit')
 await page.click('button[type=submit]');

Note, the first line allows our script to wait for a specific element to load. We use that to make sure that Buscar button is rendered in order to click it, the second one just clicks the button and triggers the following screen

The surprise here is the motorcycles were loaded there, so we are going to need use the categories link for vehicles and trucks Carros y Camionetas using of course the same click function, first validating that the link was rendered.

 await page.waitForSelector('#id_category > dd:nth-child(2) > h3 > a')
 await page.click('#id_category > dd:nth-child(2) > h3 > a');

And there we go, now we have our car results page... let's scrape it!

Note: the selectors used in this section was discovered exploring the site html delivered to browser and/or using the copy selector option

Scrape it!

With our results page we just need to iterate over the DOM nodes and extract the information. Fortunately puppeteer can help us with that too.

 await page.waitForSelector('.ch-pagination')
const cars = await page.evaluate(() => {
  const results = Array.from(document.querySelectorAll('li.results-item'));
  return results.map(result => {
     return {
       link: result.querySelector('a').href,
       price: result.querySelector('.ch-price').textContent,
       name: result.querySelector('a').textContent,
       year: result.querySelector('.destaque > strong:nth-child(1)').textContent,
       kms: result.querySelector('.destaque > strong:nth-child(3)').textContent
     }
   });
  return results
 });

 console.log(cars)

In the script above we are using the evaluate method for the results inspection, then with some query selectors we iterate the results list in order to extract the information of each node, producing an output like this for each item/car

{ link: 'https://articulo.tucarro.com.co/MCO-460314674-ford-fusion-2007-_JM',
    price: '$ 23.800.000 ',
    name: ' Ford Fusion V6 Sel At 3000cc',
    year: '2007',
    kms: '102.000 Km' }

Oh yeah! we got the information and with JSON structure, however if we want optimize it, we need to normalize the data - after all the calculations are a bit complicated with those Kms and $ symbols, aren't they?... So we are going to change our results map fragment like this

  return results.map(result => {
     return {
       link: result.querySelector('a').href,
       price: Number((result.querySelector('.ch-price').textContent).replace(/[^0-9-]+/g,"")),
       name: result.querySelector('a').textContent,
       year: Number(result.querySelector('.destaque > strong:nth-child(1)').textContent),
       kms: Number((result.querySelector('.destaque > strong:nth-child(3)').textContent).replace(/[^0-9-]+/g,""))
     }
   });

Sure, Regular Expressions save the day, we have numbers where we want numbers.

Optimization time!!

At this point we already got a taste something of Puppeteer flavors, which was our main goal for this article, in this last section we are going to use a simple heuristic to get the best car choice based on the scraped data. Basically we'll create a heuristic function in order to calculate some score that allow us to rate each vehicle and choose the best option. For that purpose we consider the following points:

For each variable we assign a weight based on the importance for the potential customer then (price has 4, and year and kms has 3 each one).
Given the kms and price should be minimized we are going to use its values as fraction denominator
For calculation easiness we normalize the numeric factors for our variables so, each price would be divided between 1 million, year and kms by 1 thousand

This is the final formula Disclaimer : This is an hypothetical formula, in order to complete this exercise, so it lacks of any mathematical or scientific value in the real life

score = 4 (1/price) + 3 (year) + 3 (1/kms)

And the code snippet with that formula

 let car = {score: 0}
 for (let i = 0; i < cars.length; i++) {
    cars[i].score = (4 * (1/(cars[i].price/1000000))) + (3 * (cars[i].year/1000)) + (3 * (1/(cars[i].kms/1000)))
    if(cars[i].score > car.score){
      car = cars[i]
    }
 }
 console.log(car)

Finally with puppeteer we visit the result link and take a screenshot

 await page.goto(car.link)
 await page.waitForSelector('.gallery__thumbnail')
 await page.screenshot({path: 'result.png', fullPage: true});

and that's it !

Puppeter API documentation could be found right here!

Complete source code for this exercise could be found on this Github repo

Yeah! the optimization section, must be improved with some machine learning technique or optimization algorithm, but that is fabric for another t-shirt

Thanks for reading! comments, suggestions and DMs are welcome!

Top comments (3)

Robin Kretzschmar • Dec 6 '17

Hi Lex, thank you for taking the time to explain puppeteer with a good to understand example!
I stumpled over puppeteer quite some times now but never gave it a chance. Now I consider giving it one for automated testing :)