DEV Community

Cover image for How to build a web scraper with Go and Colly
Div Rhino
Div Rhino

Posted on • Updated on

How to build a web scraper with Go and Colly

Originally posted on divrhino.com

Sometimes some things just don't have an API. In those kinds of cases, you can always just write a little web scraper to help you get the data you need. In this tutorial, we're going to learn how to build a web scraper. We will also learn how to save our scraped data into a JSON file. We're going to be working with Go and the Colly package. The Colly package will allow us to crawl, scrape and traverse the DOM.

Prerequisites

To follow along, you will need to have Go installed.

Setting up project directory

Let's get started. First, change into the directory where our projects are stored. In my case this would be the "Sites" folder, it may be different for you. Here we will create our project folder called rhino-scraper

cd Sites
mkdir rhino-scraper
cd rhino-scraper
Enter fullscreen mode Exit fullscreen mode

In our rhino-scraper project folder, we'll create our main.go file. This will be the entry point of our app.

touch main.go
Enter fullscreen mode Exit fullscreen mode

Initialising go modules

We will be using go modulesto handle dependencies in our project.

Running the following command will create a go.mod file.

go mod init example.com/rhino-scraper
Enter fullscreen mode Exit fullscreen mode

We're going to be using the colly package to build our webscraper, so let's install that now by running:

go get github.com/gocolly/colly
Enter fullscreen mode Exit fullscreen mode

You will notice that running the above command created a go.sum file. This file holds a list of the checksum and versions for our direct and indirect dependencies. It is used to validate the checksum of each dependency to confirm that none of them have been modified.

In the main.go file we created earlier, let's set up a basic package main and func main().

package main

func main() {}
Enter fullscreen mode Exit fullscreen mode

Analysing the target page structure

For this tutorial we will be scraping some rhino facts from FactRetriever.com.

Below is a screenshot taken from the target page. We can see that each fact has a simple structure consisting of an id and a description.

page-structure_hu2fb9eaeb907d6ec6d2a148cebdee07e9_213730_1024x0_resize_q75_box

Creating the fact struct

In our main.go file, we can write a Fact struct type to represent the structure of a rhino fact. A fact will have:

  • an ID that will be of type int, and
  • a description that will be of type string.

The Fact struct type, the ID field and the Description field are all capitalised because we want them to be available outside of package main.

package main

type Fact struct {
    ID          int    `json:"id"`
    Description string `json:"description"`
}

func main() {}
Enter fullscreen mode Exit fullscreen mode

Inside of func main, we will create an empty slice to hold our facts. We will initialise it with length zero and append to it as we go. This slice will only be able to hold Facts.

package main

type Fact struct {
    ID          int    `json:"id"`
    Description string `json:"description"`
}

func main() {
    allFacts := make([]Fact, 0)
}
Enter fullscreen mode Exit fullscreen mode

Using the Colly package

We will be importing a package called colly to provide us with the methods and functionality we'll need to build our web scraper.

package main

import "github.com/gocolly/colly"

type Fact struct {
    ID          int    `json:"id"`
    Description string `json:"description"`
}

func main() {
    allFacts := make([]Fact, 0)
}
Enter fullscreen mode Exit fullscreen mode

Using the colly package, let's create a new collector and set it's allowed domains to be factretriever.com

package main

import "github.com/gocolly/colly"

type Fact struct {
    ID          int    `json:"id"`
    Description string `json:"description"`
}

func main() {
    allFacts := make([]Fact, 0)

    collector := colly.NewCollector(
        colly.AllowedDomains("factretriever.com", "www.factretriever.com"),
    )
}
Enter fullscreen mode Exit fullscreen mode

HTML structure of a list of facts

If we inspect the HTML structure, we will see that the facts are list items inside an unordered list that has the class of factsList. Each fact list item has been assigned an id. We will use this id later.

html-structure_hu3eaec02008db353f5463b267f8d7bd51_359400_1024x0_resize_q75_box

Now that we know what the HTML structure is like, we can write some code to traverse the DOM. The colly package makes use of a library called goQuery to interact with the DOM. goQuery is like jQuery, but for Golang.

Below is the code so far. We will go over the new lines, step-by-step

package main

import (
    "fmt"
    "log"
    "strconv"

    "github.com/gocolly/colly"
)

type Fact struct {
    ID          int    `json:"id"`
    Description string `json:"description"`
}

func main() {
    allFacts := make([]Fact, 0)

    collector := colly.NewCollector(
        colly.AllowedDomains("factretriever.com", "www.factretriever.com"),
    )

    collector.OnHTML(".factsList li", func(element *colly.HTMLElement) {
        factId, err := strconv.Atoi(element.Attr("id"))
        if err != nil {
            log.Println("Could not get id")
        }

        factDesc := element.Text

        fact := Fact{
            ID:          factId,
            Description: factDesc,
        }

        allFacts = append(allFacts, fact)
    })
}
Enter fullscreen mode Exit fullscreen mode

So, here's what's happening:

  • We import the fmt, log and strconv packages
  • We are using the OnHTML method. It takes two arguments. The first argument is a target selector and the second argument is a callback function that is called everytime a target selector is encountered
  • In the body of the OnHTML, we create a variable to store the ID of each element that is iterated over
  • The ID is currently of type string, so we use strconv.Atoi to convert it to type int
  • The strconv.Atoi method returns an error as it's second return value, so do some basic error handling
  • We create a variable called factDesc to store the description text of each fact. Based on the Fact struct type we established earlier, we are expecting the fact description to be of type string.
  • Here, we create a new Fact struct for every list item we iterate over
  • Then we append the Fact struct to the allFacts slice

Begin crawling and scraping

We want to have some visual feedback to let us know that our scraper is actually visiting the page. Let's do that now.

package main

import (
    "fmt"
    "log"
    "strconv"

    "github.com/gocolly/colly"
)

type Fact struct {
    ID          int    `json:"id"`
    Description string `json:"description"`
}

func main() {
    allFacts := make([]Fact, 0)

    collector := colly.NewCollector(
        colly.AllowedDomains("factretriever.com", "www.factretriever.com"),
    )

    collector.OnHTML(".factsList li", func(element *colly.HTMLElement) {
        factId, err := strconv.Atoi(element.Attr("id"))
        if err != nil {
            log.Println("Could not get id")
        }

        factDesc := element.Text

        fact := Fact{
            ID:          factId,
            Description: factDesc,
        }

        allFacts = append(allFacts, fact)
    })

    collector.OnRequest(func(request *colly.Request) {
        fmt.Println("Visiting", request.URL.String())
    })

    collector.Visit("https://www.factretriever.com/rhino-facts")
}
Enter fullscreen mode Exit fullscreen mode

Here's what's happening:

  • We use fmt.Println to output a Visting message whenever we request a URL
  • We use the Visit() method to give our programme a starting point

If we run our program in the terminal now, by using the command

go run main.go
Enter fullscreen mode Exit fullscreen mode

It will tell us that our collector visited the rhino facts page on Fact retriever.com

Saving our data to JSON

We may want to use our scraped data in another place. So let's save it to a JSON file.

package main

import (
    "fmt"
    "io/ioutil"
    "log"
    "os"
    "strconv"

    "github.com/gocolly/colly"
)

type Fact struct {
    ID          int    `json:"id"`
    Description string `json:"description"`
}

func main() {
    allFacts := make([]Fact, 0)

    collector := colly.NewCollector(
        colly.AllowedDomains("factretriever.com", "www.factretriever.com"),
    )

    collector.OnHTML(".factsList li", func(element *colly.HTMLElement) {
        factId, err := strconv.Atoi(element.Attr("id"))
        if err != nil {
            log.Println("Could not get id")
        }
        factDesc := element.Text

        fact := Fact{
            ID:          factId,
            Description: factDesc,
        }

        allFacts = append(allFacts, fact)
    })

    collector.OnRequest(func(request *colly.Request) {
        fmt.Println("Visiting", request.URL.String())
    })

    collector.Visit("https://www.factretriever.com/rhino-facts")

    writeJSON(allFacts)
}

func writeJSON(data []Fact) {
    file, err := json.MarshalIndent(data, "", " ")
    if err != nil {
        log.Println("Unable to create json file")
        return
    }

    _ = ioutil.WriteFile("rhinofacts.json", file, 0644)
}
Enter fullscreen mode Exit fullscreen mode

Here's what's happening in the code above:

  • We import the ioutil package so we can to write to a file
  • We import the os package
  • The OS package provides an interface to operating system functionality
  • Let's create a function called writeJSON that takes in one parameter of the type slice of fact
  • Inside the function body, let's use MarshalIndent to marshal the data we pass in
  • The MarshalIndent method returns the JSON encoding of data and also returns an error
  • Some error handling. If we get an error here, we will just print a log message saying we were unable to create a JSON file
  • We can then use the WriteFile method it provides us to write our JSON-encoded data to a file called "rhinofacts.json"
  • This file does not exist yet, so the WriteFile method will create it with the permissions code of 0644.

Our WriteJSON function is ready to use. We can call it and pass allFacts to it.

Now if we go back to the terminal and run the command go run main.go, all our scraped rhino facts will be saved in a JSON file called "rhinofacts.json".

Conclusion

In this tutorial, you learnt how to build a web scraper with Go and the Colly package. If you enjoyed this article and you'd like more, consider following Div Rhino on YouTube.

Congratulations, you did great. Keep learning and keep coding!

GitHub logo divrhino / rhino-scraper

Learn how to build a web scraper with Go and colly. Video tutorial available on the Div Rhino YouTube channel.




Top comments (0)