Jerry Ng

Posted on Mar 21, 2022 • Originally published at jerrynsh.com

How I Scraped Michelin Guide Using Go

#go #programming #showdev #datascience

At the beginning of the automobile era, Michelin, a tire company, created a travel guide, including a restaurant guide. Through the years, Michelin stars have become very prestigious due to their high standards and very strict anonymous testers. Michelin Stars are incredibly coveted. Gaining just one can change a chef's life; losing one, however, can change it as well.

Inspired by this Reddit post, my initial intention was to collect restaurant data from the official Michelin Guide (in CSV file format) so that anyone can map Michelin Guide Restaurants from all around the world on Google My Maps (see an example).

What follows is my thought process on how I collect all restaurant details from the Michelin Guide using Go with Colly. The final dataset is available free to be downloaded here.

Overview

Project goals and planning
How to do no harm to the website
The scraper and code walkthrough

Before we start, I just wanted to point out that this is not a complete tutorial about how to use Colly. Colly is unbelievably easy to use, I’d highly recommend you to go through the official documentation to get started.

Now that that is out of the way, let’s start!

Project Goals

There are 2 main objectives here —

Collect “high-quality” data directly from the official Michelin Guide website
Leave a minimal footprint as possible to the website

So, what does “high-quality” mean? I want anyone to be able to use the data directly without having to perform any form of data munging. Hence, the data collected has to be consistent, accurate, and parsed correctly.

What are we collecting

Prior to starting this web-scraping project, I made sure that there are no existing APIs that provide these data; at least as of the time of writing this.

After scanning through the main page along with a couple of restaurant detail pages, I eventually settled for the following (i.e. as my column headers):

Name
Address
Location
MinPrice
MaxPrice
Currency
Longitude
Latitude
PhoneNumber
Url (Link of the restaurant on guide.michelin.com)
WebsiteUrl (The restaurant's own website)
Award (1 to 3 MICHELIN Stars and Bib Gourmand)

In this scenario, I am leaving out the restaurant description (see "MICHELIN Guide’s Point Of View”) as I don’t find them particularly useful. Having that said, feel free to submit a PR if you’re interested! I’d be more than happy to work with you.

On the other hand, having the restaurants’ address, longitude, and latitude are particularly useful when it comes to mapping them out on maps.

Here’s an example of our restaurant model:

// model.go

type Restaurant struct {
    Name        string
    Address     string
    Location    string
    MinPrice    string
    MaxPrice    string
    Currency    string
    Cuisine     string
    Longitude   string
    Latitude    string
    PhoneNumber string
    Url         string
    WebsiteUrl  string
    Award       string
}

Estimations

Let’s do a quick estimation of the scraper. Firstly, what is the total number of restaurants that are expected to be present in our dataset?

Looking at the website’s data, there should be a total of 6,502 restaurants (rows).

With each page containing 20 restaurants, our scraper will be visiting about ~325 pages; the last page of each category might not contain 20 restaurants.

Right tool, for the right job

Today, there is a handful of tools, frameworks, and libraries out there for web scraping or data extraction. Heck, there’s even a tonne of Web Scraping SaaS (e.g. Octoparse) in the market that requires no code at all.

Personally, I prefer to build my own scraper due to flexibility reasons. On top of that, using a SaaS often comes with a price along with its second (often unspoken) cost — its own learning curve!

Developer Tools (DevTool)

Part of the process of selecting the right library or frameworks for web scraping was to perform DevTooling on the pages.

The first step that I often take after opening up the DevTool was to immediately disable JavaScript and do a quick refresh of the page. This helps me to quickly identify how content is being rendered on the website.

Open Chrome DevTool → Cmd/Ctrl + Shift + P → Disable JavaScript

Generally speaking, there are 2 main distinctions of how content is being generated/rendered on a website:

Server-side rendering
JavaScript rendering (i.e. dynamically-loaded content)

Easy for us, the Michelin Guide website content is loaded using server-side rendering.

What if the site is rendered using JavaScript

Sidetrack for a moment — what if the site content is rendered using JavaScript? Then, we won't be able to scrape the desired data directly. Instead, we would need to check the ‘Network’ tab to see if it’s making any HTTP API calls to retrieve the content data.

Otherwise, we would need to use a JavaScript rendering (headless) browser such as Splash or Selenium to scrape the content.

Go Colly vs. Scrapy vs. Selenium

My initial thought was to use Scrapy — a feature-rich and extensible web scraping framework with Python. However, using Scrapy in this scenario seems like overkill to me goal was rather simple and does not require any complex features such as using a handling JavaScript rendering, middlewares, data pipelines, etc.

With this in mind, I decided to use Colly, a fast and elegant web scraping framework for Golang due to its simplicity and the great developer experience it provides.

Lastly, I’m not a fan of web scraping tools such as Selenium or Puppeteer due to their relative “chunkiness” and speed. Though, they are a lifesaver when it comes to scenarios where you need to scrape JavaScript rendered websites that do not fetch data through HTTP API.

Minimizing Footprint

The first rule of web scraping — do no harm to the website. I’ll highly recommend you to read these scraping tips provided by Colly. Essentially, these tips are pretty much tool agnostic.

Cache your responses, always

During development, it’s often inevitable to retry requests. Colly provides us the capability to cache our responses with ease. With caching, we can:

Greatly reduce the load to the website
Have a much better development experience as retrying with cache is way faster

// app/app.go

// ...

cacheDir := filepath.Join(cachePath)

c := colly.NewCollector(
    colly.CacheDir(cacheDir),
    colly.AllowedDomains(allowedDomain),
)

Add delays between requests

When traversing through multiple pages (~325 in our case), it’s always a good idea to add delay in between requests. This allows the website to process our requests without being overloaded; we want to absolutely avoid causing any form of disruption to the site.

// app/app.go

// ...

c.Limit(&colly.LimitRule{
    Delay:       2 * time.Second,
    RandomDelay: 2 * time.Second,
})

Adding delays could also help to mitigate anti-scraping measures such as IP banning.

The Scraper

In this section, I’ll run through only the important parts (and considerations) of the scraper code.

Selectors

Personally, I prefer to use XPath to query elements of an HTML page in order to extract data. If you’re into web scraping, I’d highly recommend you to learn XPath; it will make your life a lot easier. Here’s my favorite cheat sheet for using XPath.

To avoid the cluttering of long ugly XPath within our main application code, I often like to put them into a separate file. You could of course make use of CSS selectors instead.

Entry point

To start building our scraper application, we start by identifying our entry point, i.e. the starting URLs. In our case, I’ve chosen the all-restaurants main page” (filtered by the type of Award/Distinction) as the starting URLs.

// app/const.go

// ...

type startUrl struct {
    Award string
    Url   string
}

var urls = []startUrl{
    {"3 MICHELIN Stars", "https://guide.michelin.com/en/restaurants/3-stars-michelin/"},
    {"2 MICHELIN Stars", "https://guide.michelin.com/en/restaurants/2-stars-michelin/"},
    {"1 MICHELIN Star", "https://guide.michelin.com/en/restaurants/1-star-michelin/"},
    {"Bib Gourmand", "https://guide.michelin.com/en/restaurants/bib-gourmand"},
}

Why not simply start from guide.michelin.com/en/restaurants?

By intentionally stating my starting URLs based on the types of Michelin Award, I would not need to extract the Michelin Award of the restaurant from the HTML. Rather, I could just fill the Award column directly based on my starting URL; one less XPath to maintain (yay)!

Collectors

Our scraper application consists of 2 collectors —

One (collector) to parse information such as location, longitude, and latitude from the main (starting) page
Another (detailCollector) to collect details such as an address, price, phone number, etc. from each individual restaurant. Also writes data in rows into our output CSV file.

How to pass context across Colly collectors

As we are only writing to our CSV file at detailCollector level, we will need to pass our extracted data from collector to detailCollector. Here’s how I did it:

// app/const.go

// ...

app.collector.OnXML(restaurantXPath, func(e *colly.XMLElement) {
    url := e.Request.AbsoluteURL(e.ChildAttr(restaurantDetailUrlXPath, "href"))

    location := e.ChildText(restaurantLocationXPath)
    longitude := e.ChildAttr(restaurantXPath, "data-lng")
    latitude := e.ChildAttr(restaurantXPath, "data-lat")

    e.Request.Ctx.Put("location", location)
    e.Request.Ctx.Put("longitude", longitude)
    e.Request.Ctx.Put("latitude", latitude)

    app.detailCollector.Request(e.Request.Method, url, nil, e.Request.Ctx, nil)
})

With this, the location, longitude, and latitude information can be passed down to our detailCollector via Context (reference).

Parsers

I’ve written a couple of utility parsers to extract specific information from the extracted raw strings. As they are rather straightforward, I will not go through them.

Finally, our entire scraper app looks like this:

package app

import (
    "encoding/csv"
    "net/http"

    "os"
    "path/filepath"
    "time"

    "github.com/gocolly/colly"
    "github.com/gocolly/colly/extensions"
    "github.com/ngshiheng/michelin-my-maps/model"
    "github.com/ngshiheng/michelin-my-maps/util/logger"
    "github.com/ngshiheng/michelin-my-maps/util/parser"

    log "github.com/sirupsen/logrus"
)

type App struct {
    collector       *colly.Collector
    detailCollector *colly.Collector
    writer          *csv.Writer
    file            *os.File
    startUrls       []startUrl
}

func New() *App {
    // Initialize csv file and writer
    file, err := os.Create(filepath.Join(outputPath, outputFileName))
    if err != nil {
        log.WithFields(log.Fields{"file": file}).Fatal("cannot create file")
    }

    writer := csv.NewWriter(file)

    csvHeader := model.GenerateFieldNameSlice(model.Restaurant{})
    if err := writer.Write(csvHeader); err != nil {
        log.WithFields(log.Fields{
            "file":      file,
            "csvHeader": csvHeader,
        }).Fatal("cannot write header to file")
    }

    // Initialize colly collectors
    cacheDir := filepath.Join(cachePath)

    c := colly.NewCollector(
        colly.CacheDir(cacheDir),
        colly.AllowedDomains(allowedDomain),
    )

    c.Limit(&colly.LimitRule{
        Parallelism: parallelism,
        Delay:       delay,
        RandomDelay: randomDelay,
    })

    extensions.RandomUserAgent(c)
    extensions.Referer(c)

    dc := c.Clone()

    return &App{
        c,
        dc,
        writer,
        file,
        urls,
    }
}

// Crawl Michelin Guide Restaurants information from app.startUrls
func (app *App) Crawl() {
    defer logger.TimeTrack(time.Now(), "crawl")
    defer app.file.Close()
    defer app.writer.Flush()

    app.collector.OnResponse(func(r *colly.Response) {
        log.Info("visited ", r.Request.URL)
        r.Request.Visit(r.Ctx.Get("url"))
    })

    app.collector.OnScraped(func(r *colly.Response) {
        log.Info("finished ", r.Request.URL)
    })

    // Extract url of each restaurant from the main page and visit them
    app.collector.OnXML(restaurantXPath, func(e *colly.XMLElement) {
        url := e.Request.AbsoluteURL(e.ChildAttr(restaurantDetailUrlXPath, "href"))

        location := e.ChildText(restaurantLocationXPath)
        longitude := e.ChildAttr(restaurantXPath, "data-lng")
        latitude := e.ChildAttr(restaurantXPath, "data-lat")

        e.Request.Ctx.Put("location", location)
        e.Request.Ctx.Put("longitude", longitude)
        e.Request.Ctx.Put("latitude", latitude)

        app.detailCollector.Request(e.Request.Method, url, nil, e.Request.Ctx, nil)
    })

    // Extract and visit next page links
    app.collector.OnXML(nextPageArrowButtonXPath, func(e *colly.XMLElement) {
        e.Request.Visit(e.Attr("href"))
    })

    // Extract details of each restaurant and write to csv file
    app.detailCollector.OnXML(restaurantDetailXPath, func(e *colly.XMLElement) {
        url := e.Request.URL.String()
        websiteUrl := e.ChildAttr(restarauntWebsiteUrlXPath, "href")

        name := e.ChildText(restaurantNameXPath)

        address := e.ChildText(restaurantAddressXPath)

        priceAndCuisine := e.ChildText(restaurantpriceAndCuisineXPath)
        price, cuisine := parser.SplitUnpack(priceAndCuisine, "•")
        price = parser.TrimWhiteSpaces(price)

        minPrice, maxPrice, currency := parser.ParsePrice(price)

        phoneNumber := e.ChildText(restarauntPhoneNumberXPath)
        formattedPhoneNumber := parser.ParsePhoneNumber(phoneNumber)

        restaurant := model.Restaurant{
            Name:        name,
            Address:     address,
            Location:    e.Request.Ctx.Get("location"),
            MinPrice:    minPrice,
            MaxPrice:    maxPrice,
            Currency:    currency,
            Cuisine:     cuisine,
            Longitude:   e.Request.Ctx.Get("longitude"),
            Latitude:    e.Request.Ctx.Get("latitude"),
            PhoneNumber: formattedPhoneNumber,
            Url:         url,
            WebsiteUrl:  websiteUrl,
            Award:       e.Request.Ctx.Get("award"),
        }

        log.Debug(restaurant)

        if err := app.writer.Write(model.GenerateFieldValueSlice(restaurant)); err != nil {
            log.Fatalf("cannot write data %q: %s\n", restaurant, err)
        }
    })

    // Start scraping
    for _, url := range app.startUrls {
        ctx := colly.NewContext()
        ctx.Put("award", url.Award)
        app.collector.Request(http.MethodGet, url.Url, nil, ctx, nil)
    }

    // Wait until threads are finished
    app.collector.Wait()
    app.detailCollector.Wait()
}

Feel free to check out the full source code here.

Closing Remark

Initially, I wanted to map every single Michelin awarded restaurant on Google My Maps via its API. Unfortunately, not only does My Maps not have any API, it only allows up to 2,000 data points. Consequently, to build a map, you will have to manually import our CSV on My Maps.

As a foodie myself, the project was incredibly fun to build. What was more rewarding for me was seeing people making good use of the dataset on Kaggle.

If you happen to map out the restaurants or performed any form of data analytic work with the dataset, feel free to share it with me! Before we end, if you have any questions at all, feel free to reach out.

That’s all for today, thank you for reading!

This article was originally published on jerrynsh.com

The Essential Toolkit for Front-end Developers

Take a user-centric approach to front-end monitoring that evolves alongside increasingly complex frameworks and single-page applications.

Get The Kit

Top comments (3)

Jerry Ng • Mar 22 '22

Exactly! I'm glad i'm not the only one who thinks of it this way

Jacob Kim • Mar 21 '22

Cool article! I was trying to learn how to use Colly, and this post gave me some ideas. I'm a foodie as well so I might make something similar for my project :)