DEV Community

Cover image for Exploring Web Scraping with Go: A Practical Approach
nronz
nronz

Posted on • Updated on

Exploring Web Scraping with Go: A Practical Approach

Introduction

This guide will aim to walk you through the process of creating a web scraper using Go and the Colly web scraping framework.

We will start from the basics of getting Go installed on your computer, and progress step by step with how to start extracting data from the web, how to parse the extracted data for what we want, and finally how to export that data to JSON.

This is intended to be fairly comprehensive and should hopefully serve as a solid foundation for those interested in web scraping with Go and Colly. If you are already familiar with some concepts, feel free to skip sections and navigate with the table of contents.

Something I want to note early on is that some of the code in this project may not be up to best practice, or be the most efficient. This was my first project ever using Golang and it served to help familiarize myself with the language, so if there are mistakes, I apologize and I will gladly fix them if pointed out.

Table of Contents

Ethics

I feel compelled to speak on the ethics involved with web scraping before we proceed. The power of web scraping is vast, and with it comes the potential for the misuse of data. It's important not to violate copyright laws or overload servers with repeated requests. Web scraping resides in a bit of an ethical grey area. Be sure to familiarize yourself with the laws in your country regarding the enforcement of a website's terms of service before you proceed.

Requirements

This guide assumes that the reader has some knowledge in the following areas:

  • HTML

  • CSS

  • Basic to intermediate programming skills

  • Basics of Golang

  • Use of the command line

Installing Go

Whether you are running Linux, Mac, or Windows, getting Go installed is fairly straightforward.

The installation documentation on the official website walks you through the process of getting Go installed better than I can.

Once installed be sure you are able to see your current version of Go by running:

$ go version
Enter fullscreen mode Exit fullscreen mode

Now we are ready to get started!

Project Setup

Create the Directory

First, we need to create the project directory. I recommend creating a new repository on Github or some similar service and cloning the repository locally wherever you choose. You can also initialize the repository locally and push it to a remote repo at a later time.

If you're not familiar with Git, you can still follow along, but I would recommend becoming familiar with the basics. I won't go over them here as that is beyond the scope of this article. Nonetheless, it won't take long to get the hang of the fundamentals.

You can create your directory wherever you choose -- just be sure that you are performing the following commands from within that directory.

Initialize go.mod

Next we will initialize the go.mod file by running the following command:

$ go mod init "project-name"
// example:
// go mod init goscraper
Enter fullscreen mode Exit fullscreen mode

This will create a go.mod file in your project directory. This file records dependencies and helps to create reproducible builds from system to system.

It will also create a file called go.sum which contains a list of checksums for the dependencies of the project.

Installing Dependencies

Now we can install our dependencies. This project only has one -- the Colly framework.

We can install it by running this command in the root of our project directory:

$ go get -u github.com/gocolly/colly
Enter fullscreen mode Exit fullscreen mode

Creating main.go

Now let's create a file called main.go at the root of your project directory and open it up in your preferred text editor.

Let's get the main skeleton of the program set up.

package main

func main() {

}
Enter fullscreen mode Exit fullscreen mode

Let's move on to configuring Colly and actually extracting some data!

Getting Started with Colly

The first thing we need to do is import Colly into our main.go file.

To do this we will add the import at the top of the file, underneath the package name -- and while we're at it we will import a few other packages from the standard library. Now our main.go file looks like this:

package main

import (
    "log"
    "fmt"

    "github.com/gocolly/colly"
)

func main() {

}
Enter fullscreen mode Exit fullscreen mode

Initializing the Collector

In order to begin working with Colly, we need to initialize a Collector object that will handle the network communication and execution of callbacks while a job is running. The collector is what reaches out to the domain, retrieves the response and then runs callbacks on that response such as OnHTML, OnRequest, etc.

Read more about the callbacks available with Colly

The main callback we will be working with is OnHTML, which will allow us to parse the HTML using CSS selectors and extract the data we are interested in.

We can set up the Collector with a simple line of code:

c := colly.NewCollector()
Enter fullscreen mode Exit fullscreen mode

Setting Up Error Handling for Colly

Colly has a number of different callbacks available, one of which is OnError.

This will let us know when something has gone wrong with the response received.

Here is how we can initialize it:

func main() {
    // Create the 'collector' object
    c := colly.NewCollector()

    // Handle our errors
    c.OnError(func(r *colly.Response, err error) {
        log.Printf("Request URL: %s\nError: %v", r.Request.URL, err)
    })
}
Enter fullscreen mode Exit fullscreen mode

Now if there is an issue getting a response, it will print the error out. For the time being this is good enough to point us in the right direction if something goes wrong.

Setting the OnRequest Callback

The OnRequest callback is called any time a request gets made. Let's implement the callback so that every page we visit, it will be printed in the console to provide us with some feedback.

func main() {
    c := colly.NewCollector()

    // Print a message in the console when we visit a URL
    // e.g. "Visiting https://example.com/products"
    c.OnRequest(func(r *colly.Request) {
        log.Println("Visiting", r.URL.string())
    })

    c.OnError(func(r *colly.Response, err error) {
        log.Printf("Request URL: %s\nError: %v", r.Request.URL, err)
    })
}
Enter fullscreen mode Exit fullscreen mode

So now every web page the scraper visits, we will get some feedback in the console letting us know a page is being visited, and it will show its URL.

Awesome!

Setting a User Agent

In order to help ensure the website we are visiting sees the requests from our bot as a valid user and not as a bot, we should set a user agent. The user agent is a string that contains some information about the operating system and the browser that the request is originating from.

Note: Many sites use other methods to prevent bots from scraping their sites such as rate limiting, requiring valid cookies or a session id, requiring certain headers to be present & more.

You can view your current user agent right now by opening a browser, going to Google, and searching "what is my user agent".

Here are a few of the most popular desktop user agents:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36 Edg/113.0.1774.42

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.3
Enter fullscreen mode Exit fullscreen mode

To get a full list of popular user agents you can visit
useragents.me.

We are going to have a few variables that we will end up wanting to store as variables in some kind of configuration to control the search parameters of our bot, URL information, and our user agent.

Extra Credit: Create a function that will take a list of different user agents and randomly choose one for each request.

In order to limit the scope of our variables and avoid using global constants, we will define a config type struct that will hold our configuration items.

In our main.go file, let's add the following code:

type config struct {
    Agent string
}

func main() {
    cfg := config{
        Agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"
    }

    // Rest of code...
}
Enter fullscreen mode Exit fullscreen mode

Setting the URL

So our user agent is set, now we just need to set the URL for the site we would like to extract data from.

The title of this article is "Exploring Web Scraping With Go: A Practical Approach", and I intend to provide a practical example, rather than using the usual "web scraping practice" sites that make it trivial to retrieve the data.

Instead we will be scraping a real estate website in order to get a better view of the available listings and be able to sort and filter the data how we choose. This could be helpful for those looking to buy a new home, or potential investors
trying to find their next investment. I personally find it is much easier to look through listings in a spreadsheet and then visit the link if I find one interesting.

So we need to set our URL. In our case we will be using
https://www.realtor.com.

But we don't just want the base domain. We are going to be scraping listings for new houses for sale, so we will need the URL with all of the parameters & filters we want.

We can obtain this by actually going to the site in your browser, and doing a search for what you want and set up all of the filters how you would like. The URL will reflect those filters and you can simply copy the whole thing.

For this example, this is the full URL for what we will be scraping:

https://www.realtor.com/realestateandhomes-search/Miami_FL/beds-1/baths-1/type-single-family-home/price-100000-na/age-1+/pnd-hide/fc-hide/55pnd-hide/radius-1/sby-6
Enter fullscreen mode Exit fullscreen mode

Let's quickly break it apart and see what's going on.

This is our base search URL:

https://www.realtor.com/realestateandhomes-search/
Enter fullscreen mode Exit fullscreen mode

These are our search parameters:

Miami_FL/beds-1/baths-1/type-single-family-home/price-100000-na/age-1+/pnd-hide/fc-hide/55pnd-hide/radius-1/sby-6
Enter fullscreen mode Exit fullscreen mode

The search parameters are where it can get confusing. In order, we are searching/filtering for the following:

  • located in Miami FL
  • minimum of 1 bedroom and 1 bathroom
  • single family homes only
  • minimum price $100,000 to hide listings with no or low prices
  • house is 1+ years old - hides new construction
  • hides all houses pending sale, foreclosures, and 55+ communities
  • search radius of 1 mile
  • sort by newest listings

Splitting the URL

We have our URL, so let's add it as a part of our configuration. Simply make a new entry in the type struct, and add it to our cfg.

However, we are going to keep the URL split apart for better readability, and to allow ease of changing the parameters if we want to modify the search.

type config struct {
    Agent   string
    BaseUrl string
}

func main() {
    cfg := config{
        Agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36"
        BaseUrl = "https://www.realtor.com/realestateandhomes-search/"
    }

    // our search parameters
    params := "Miami_FL/beds-1/baths-1/type-single-family-home/price-100000-na/age-1+/pnd-hide/fc-hide/55pnd-hide/radius-1/sby-6/"

    // Rest of the code...
}
Enter fullscreen mode Exit fullscreen mode

We can also start pulling out some of those search parameters into the config, allowing us to easily change the search. Remember to update the config type struct as well.

type config struct {
    Agent    string
    Location string
    Radius   string
    HomeType string
    BaseUrl  string
}

func main() {
    cfg := config{
        Agent:    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36",
        Location: "Miami_FL",
        Radius:   "radius-1",
        HomeType: "type-single-family-home",
        BaseUrl:  "https://www.realtor.com/realestateandhomes-search/",
    }

    params := fmt.Sprintf(
        "%s/beds-1/baths-1/%s/price-100000-na/age-1+/pnd-hide/fc-hide/55pnd-hide/%s/sby-6/",
        cfg.Location,
        cfg.HomeType,
        cfg.Radius,
    )
    // Rest of the code...
}
Enter fullscreen mode Exit fullscreen mode

Choosing CSS Selectors

Now we need to go back to the web page in our browser and break out the developer tools. In Chrome and Firefox you should be able to open it with the F12 key, or CTRL + SHIFT + I.

For this example, we will only be grabbing the prices -- however you can use the same methods to extract any piece of data from the page.

Using the element selector in the developer tools, we can click on the price of any listing and see that it is stored inside of a <span> tag.

price-html

We can use the class as a selector and it might work fine. The problem we may likely run into is the fact that a lot of CSS classes are dynamically generated and will change when the site updates. That means our bot will break and we will need to fix the selectors.

With this website, it appears they are also using data-label attributes in addition to classes. Those labels are less likely to change as there is usually some underlying JavaScript code or testing framework that relies on those
attributes. This usually means less of a chance of the selectors breaking.

So now we have our selector. A <span> tag with a data-label="pc-price" attribute.

However, ideally we want to look at the entire <div> where all of the information is contained on the page. That way we can also get other pieces of information other than the price.

Looking at the HTML and moving up a few parent containers finds us this:

parent-container

The <div> with a class of property-wrap contains all of the property information such as price, bedrooms, bathrooms, address, etc. on each listing card on the web page.

Initializing the OnHTML Callback

Now that we have our selector, we can initialize the OnHTML callback. This is called when our collector finds the HTML property we declare, and then it will run the code in the callback.

It will look something like this:

// Select the parent container
c.OnHTML("div.property-wrap", func(e *colly.HTMLElement) {
    // Select the price. ChildText returns just the text
   price := e.ChildText("span[data-label='pc-price']")

   log.Println("Price:", price)
}
Enter fullscreen mode Exit fullscreen mode

There is just one more step remaining before we can extract the data. We need to set the c.Visit() method to visit our URL.

So, near the beginning of the main function, declare a url variable which is set to the cfg.BaseUrl plus our parameters.

func main() {
    // Earlier code omitted...

    // Concatenate cfg.BaseUrl + params
    url := cfg.BaseUrl + params

    // Callbacks omitted for brevity...

    // Visit the URL
    err := c.Visit(url)
    if err != nil {
        log.Fatal(err)
    }
}
Enter fullscreen mode Exit fullscreen mode

Ignoring robots.txt

Whoops I almost forgot, currently the bot will obey the sites robots.txt file which prevents our bot from crawling this page. We can tell the bot to ignore that with the following line:

c.IgnoreRobotsTxt = true
Enter fullscreen mode Exit fullscreen mode

You can add that line just above or below where we set the user agent.

Our code up to this point will look like this:

package main

import (
    "fmt"
    "log"

    "github.com/gocolly/colly"
)

type config struct {
    Agent    string
    Location string
    Radius   string
    HomeType string
    BaseUrl  string
}

func main() {
    cfg := config{
        Agent:    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36",
        Location: "Miami_FL",
        Radius:   "radius-1",
        HomeType: "type-single-family-home",
        BaseUrl:  "https://www.realtor.com/realestateandhomes-search/",
    }

    params := fmt.Sprintf(
        "%s/beds-1/baths-1/%s/price-100000-na/age-1+/pnd-hide/%s/fc-hide/55pnd-hide/sby-6",
        cfg.Location,
        cfg.HomeType,
        cfg.Radius,
    )

    url := cfg.BaseUrl + params

    c := colly.NewCollector()

    c.UserAgent = agent

    c.OnHTML("div.property-wrap", func(e *colly.HTMLElement) {
        price := e.ChildText("span[data-label='pc-price']")

        log.Println("Price:", price)
    })

    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Visiting", r.URL.String())
    })

    c.OnError(func(r *colly.Response, err error) {
        log.Printf("Request URL: %s\nError: %v", r.Request.URL, err)
    })

    err := c.Visit(url)
    if err != nil {
        log.Fatal(err)
    }
}
Enter fullscreen mode Exit fullscreen mode

Now go to your console and run go run main.go

You should see a list of prices print to your console.

first-output

Congrats! You just extracted data from a website!

The whole process was fairly straightforward as well, really highlighting the remarkable power of web scraping as a tool.

Structuring the Data

At this point we need to pause and consider what we ultimately want the program to return to us, which in this situation is a JSON file with the prices of the listings.

Currently we only have the price, but in the future we might want to have the number of bedrooms & bathrooms, square footage, lot size, address, and maybe a link to the listing so we can visit it if we are interested.

With these considerations in mind, let's create a type struct called house that encapsulates all of the features we aim to extract.

Note: I am typing the data as strings even though they are technically integers. Unless there is a need to run mathematical operations on the numbers, treating them as strings allows for better manipulation, and it is extracted from Colly as a string to begin with. You can always convert to an int or float, perform operations on the numbers, and then convert back to a string.

type house struct {
    Price string `json:"price"`
    // Beds string `json:"beds"`
    // Baths string `json:"baths"`
    // etc...
}
Enter fullscreen mode Exit fullscreen mode

Now we need to modify our OnHTML callback so it makes use of the struct -- and we should probably create some kind of list to store the struct of data from each listing.

type house struct {
    Price string `json:"price"`
}

func main() {
    // Earlier code omitted for brevity...

    // Create a slice of house{}
    houses := []house{}

    c.OnHTML("div.property-wrap", func(e *colly.HTMLElement) {
        // Create a temporary "house"
        temp := house{}

        // Add the Price property
        temp.Price := e.ChildText("span[data-label='pc-price']")
        // temp.Beds := e.ChildText("selector")
        // ...

        // Append each "house" to the list of "houses"
        houses = append(houses, temp)
    })

    c.OnRequest(func(r *colly.Request) {
        log.Println("Visiting", r.URL.String())
    })

    c.OnError(func(r *colly.Response, err error) {
        log.Printf("Request URL: %s\nError: %v", r.Request.URL, err)
    })

    err := c.Visit(url)
    if err != nil {
        log.Fatal(err)
    }

    // Loop over the slice of houses and print the price
    for _, house := range houses {
        fmt.Printf("Price: %s", house.Price)
    }
}
Enter fullscreen mode Exit fullscreen mode

Now when you run go run main.go you will get an output like this:

output-screenshot

Identical to our output before (minus the log timestamps), except we are looping over our list of houses and printing the prices. Now we now have our data nicely structured and ready to be exported to JSON.

Data Export

Let's create another function, placed above our main function, to handle exporting the data to JSON.

This function will take our list of houses and cfg.Location as arguments, and export a JSON file named with the location.

func writeHousesToJson(houses []house, location string) error {
    // Set the file path, in this case we are creating a scans directory
    // to hold all of our scans, and saving the file using the location
    // as the name of the file. (e.g. Miami_FL.json)
    filePath := filepath.Join("scans", fmt.Sprintf("%s.json", location))
    if err := os.MkdirAll(filepath.Dir(filePath), 0755); err != nil {
        return fmt.Errorf("error creating directory: %w", err)
    }

i   // Creates the filepath
    file, err := os.Create(filePath)
    if err != nil {
        return fmt.Errorf("error creating file: %w", err)
    }

    // Ensures the file gets closed if there is an error, or file operations
    // complete
    defer file.Close()

    // Encode the data the json and save it to the file
    jsonEncoder := json.NewEncoder(file)

    // Format it nicely
    jsonEncoder.SetIndent("", "  ")
    if err := jsonEncoder.Encode(houses); err != nil {
        return fmt.Errorf("error encoding houses to json: %w", err)
    }

    return nil
}
Enter fullscreen mode Exit fullscreen mode

To use this function we will also need to import the encoding/json, os, and path/filepath packages. These are all part of Go's standard library so there is no need to go get any dependencies.

import (
    "encoding/json"
    "fmt"
    "log"
    "os"
    "path/filepath"

    "github.com/gocolly/colly"
)
Enter fullscreen mode Exit fullscreen mode

Now we can simply call this function in our main() function and it will export our data to a JSON file!

Let's remove our previous for loop, and replace it with a call to our new function. We can also add a simple log message to indicate the number of listings we've extracted.

func main() {
    // {Previous code omitted for brevity}

    // Call our function to write to JSON
    writeHousesToJson(houses, location)

    // Log the number of listings extracted from the page
    log.Printf("Extracted: %d listings", len(houses))
}
Enter fullscreen mode Exit fullscreen mode

Running go run main.go again will provide a different result now. If we look in our root directory we will see a folder called scans

directory-screenshot

Inside that folder, you'll find your JSON file.

json-directory

Looking at our JSON file, everything appears in good order and is nicely formatted.

Congratulations! You just made a pretty simple albeit very powerful program capable of extracting data from the internet and exporting it into a common format.

json-screenshot

With this foundation, you can add any fields you want. Just add the field to the house struct and then add the selector to the OnHTML callback.

Data Processing

Most of the data you extract will require some degree of manipulation. Often, the text retrieved from selectors contains data we wish to exclude. For example, the dollar sign $ before the price, or perhaps you've ambitiously
implemented selectors for the bedrooms and bathrooms, only to realize that ChildText() is returning 2bed and 1.5bath when you just wanted the numeric values.

Fortunately, we can process these strings by using the aptly named strings package, which is part of the standard library of Go.

Here's how:

// Values that would be returned by ChildText()
beds := "2bed"
baths := "1.5bath"
price := "$300,000"

beds = strings.TrimSuffix(beds, "bed")
baths = strings.TrimSuffix(baths, "bath")
price = strings.TrimPrefix(price, "$")

fmt.Printf("%s; %s; %s", beds, baths, price)
// 2; 1.5; 300,000
Enter fullscreen mode Exit fullscreen mode

Provided you are familiar with basic string manipulation techniques, processing the data is fairly straightforward.

Pagination

Extracting data from a single web page is straightforward, but we often encounter multiple pages of results.

The approach to handling pagination will vary from site to site due to the sites structure. In this case what we need to do is look for the HTML of the next button, and look at the href link to determine if it is a blank string, or a valid link to the next page.

Here is an example of how it appears in the HTML:

a-tag-screenshot

Below is how we can implement the logic to check for the next page and visit it.

Note: I noticed that on this site, the link to the next page doesn't include the search parameters. To maintain consistency and ensure we are only extracting the desired data, I split the URL at each forward slash "/", and then take sub index 3 which contains our next page (e.g. pg-2). That is then concatenated with our URL to give us the next page while including our search parameters.

// Create another OnHTML() callback
c.OnHTML("a[aria-label='Go to next page']", func(e *colly.HTMLElement) {
    // Grab the href value
    nextPage := e.Attr("href")

    // If the 'href' is not a blank string, process the URL
    if nextPage != "" {

        // Splits the URL at each "/", and selects subindex 3
        nextPage = strings.Split(nextPage, "/")[3]

        // Concatenate the page number with our URL
        nextPageURL := url + nextPage

        // Visit the next page
        err := c.Visit(nextPageURL)
        if err != nil {
            log.Println("Error visiting next page:", err)
        }
    } else {
        log.Printf("Extracted: %d listings", len(houses))

        if err := writeHousesToJson(houses, location); err != nil {
            log.Fatalf("Error while writing files: %v", err)
        }

        os.Exit(0)
    }
})
Enter fullscreen mode Exit fullscreen mode

Final Code

I took the liberty of creating a function, logData(), to handle logging information to the console -- and added a timer to keep track of the runtime, which will print out at completion. Other than that, this is what your final code should look like.

package main

import (
    "encoding/json"
    "fmt"
    "log"
    "os"
    "path/filepath"
    "strings"
    "time"

    "github.com/gocolly/colly"
)

type house struct {
    Price string `json:"price"`
}

type config struct {
    Agent    string
    Location string
    Radius   string
    HomeType string
    BaseUrl  string
}

func writeHousesToJson(houses []house, location string) error {
    filePath := filepath.Join("scans", fmt.Sprintf("%s.json", location))
    if err := os.MkdirAll(filepath.Dir(filePath), 0755); err != nil {
        return fmt.Errorf("error creating directory: %w", err)
    }

    file, err := os.Create(filePath)
    if err != nil {
        return fmt.Errorf("error creating file: %w", err)
    }

    defer file.Close()

    jsonEncoder := json.NewEncoder(file)

    jsonEncoder.SetIndent("", "  ")
    if err := jsonEncoder.Encode(houses); err != nil {
        return fmt.Errorf("error encoding houses to json: %w", err)
    }

    return nil
}

// Handles logging some info at the end of the extraction
func logStats(start time.Time, houses []house) {
    log.Println("Extraction completed successfully!")
    elapsed := time.Since(start)
    log.Printf("Extracted: %d listings", len(houses))
    log.Printf("Elapsed Time: %s", elapsed)
}

func main() {
    cfg := config{
        Agent:    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36",
        Location: "Miami_FL",
        Radius:   "radius-1",
        HomeType: "type-single-family-home",
        BaseUrl:  "https://www.realtor.com/realestateandhomes-search/",
    }

    start := time.Now()
    params := fmt.Sprintf(
        "%s/beds-1/baths-1/%s/price-100000-na/age-1+/pnd-hide/fc-hide/55pnd-hide/%s/sby-6/",
        cfg.Location,
        cfg.HomeType,
        cfg.Radius,
    )

    url := cfg.BaseUrl + params
    c := colly.NewCollector()
    c.UserAgent = agent
    c.IgnoreRobotsTxt = true

    houses := []house{}

    // Get the price of each listings on the page
    c.OnHTML("div.property-wrap", func(e *colly.HTMLElement) {
        temp := house{}
        temp.Price = e.ChildText("span[data-label='pc-price']")

        houses = append(houses, temp)
    })

    // Check the next page button, if 'href' != "" then visit the page and
    // extract the prices of all of the listings
    c.OnHTML("a[aria-label='Go to next page']", func(e *colly.HTMLElement) {
        nextPage := e.Attr("href")

        if nextPage != "" {

            nextPage = strings.Split(nextPage, "/")[3]

            nextPageURL := url + nextPage

            err := c.Visit(nextPageURL)
            if err != nil {
                log.Println("Error visiting next page:", err)
            }
        } else {
            logStats(start, houses)

            if err := writeHousesToJson(houses, location); err != nil {
                log.Fatalf("Error while writing files: %v", err)
            }

            os.Exit(0)
        }

    })

    // Logging statement on each URL request
    c.OnRequest(func(r *colly.Request) {
        log.Println("Visiting", r.URL.String())
    })

    // Handles errors with the request
    c.OnError(func(r *colly.Response, err error) {
        log.Printf("Request URL: %s\nError: %v", r.Request.URL, err)
    })

    err := c.Visit(url)
    if err != nil {
        log.Fatal(err)
    }

    writeHousesToJson(houses, location)

    logStats(start, houses)
}
Enter fullscreen mode Exit fullscreen mode

Considerations

Our scraper works great, but in the world of web scraping it is pretty basic and has some limitations.

Current Limitations

Rate Limiting

If you are scraping a large number of pages, you may trigger some form of rate limiting on the server you are querying for the data. They may even implement an IP block and prevent your IP address from requesting resources from the server altogether.

To help prevent this from happening you a can implement a proxy. The Colly framework provides an easy method of implementing a round robin proxy switcher. This will take a list of proxies you provide and switch between them.

Fixed Queries

Our queries are currently hard coded. This approach isn't ideal, and can be improved by expanding the program to prompt
the user for search parameters, or handle command line arguments -- allowing you to set things such as the location, the minimum price, the radius of the search, etc. This makes the program more dynamic and significantly expands its search capabilities.

Better Storage

We are currently outputting everything to JSON. This is fine for a relatively small amount of data, however if you are in the need of storing vast amounts of data or performing advanced queries on it, then you may want to implement some
form of database. This would allow you to choose between a SQL or NoSQL based solution depending on your needs, and allow much more powerful queries on the data collected.

Further Expansion

To enhance the output capabilities of our program, try adding a function that exports data to CSV format, or taking it a step further and setting up a database.

It is possible to declare more than one Collector object. For instance, you can have one Collector parse the listings, and another Collector can visit each listing and parse more detailed information.

The Colly framework also provides built-in methods for asynchronous requests, random delays, website login handling, and more. These features enable you to construct a complex data extraction machine powered by Go.

Additionally, improving error handling could be beneficial. Given that this is my first program in Go, I acknowledge that my error handling might not be as efficient or effective as it could be. Refactoring it to provide more robust error messages will likely enhance the maintainability of the program as it expands.

Conclusion

In summary, we've installed Go, constructed a bot to extract data, organized this data into a structured format for easy exportation to a JSON file, and introduced a method to handle multiple pages of results.

With this foundational knowledge, I strongly recommend that you conduct your own experiments to further solidify these concepts. Real learning happens in curious exploration, rather than simply following along with guides or tutorials. I see this resource as a guide - a springboard that helps propel you into exploring a subject and enables a deeper dive into the topic.

With that being said, I encourage you to read through the
Colly documentation and experiment with some of the features yourself.

Overall I found Go to be a pleasure to work with and found the Colly framework both powerful and easy to get the hang of.

If you made it this far and enjoyed this guide or found it useful at all, I would appreciate some feedback. This is the first "guide" or "article" I have ever written and sought for it to be informative, unlike a lot of the guides I see nowadays.

References

Good references for learning Go. I used these heavily.

Colly Documentation

Top comments (0)