Scrapfly

Posted on Jul 25, 2024 • Originally published at scrapfly.io on Jul 24, 2024

Web Scraping With Go

#go #http #dataparsing #scrapingframeworks

The Go programming language is a popular statically typed language. Its memory management, garbage collection, and concurrency features make it an excellent choice for building extensive and scalable applications. But what about utilizing these capabilities through web scraping?

In this guide, we’ll take a deep dive into web scraping with Golang. We’ll start by exploring the core concepts of sending HTTP requests, parsing, crawling, and data manipulation. Then, we’ll go through a step-by-step guide to using Go Colly - the most popular package for building Go scrapers.

With Go, you can create a web scraper with only a few lines of code. Let’s get started!

Setup

Let's start by going through the steps required to build and run a Golang web scraper.

Setting Up Environment

To set the Go environment on your machine, start by downloading the Go release according to your operating system. After this, follow the installation guide to configure the PATH environment.

To verify your installation, use the below command to view the installed version:

$ go version
go version go1.20.3 windows/amd64

Installing Required Packages

In this Golang scraping guide, we'll be using a few packages:

goquery: An HTML parsing library that implements common jQuery features in Go, allowing for full-featured DOM tree manipulation.
htmlquery: An HTML querying package that enables parsing with XPath selectors.
colly: A popular scraping framework providing a clear interface for writing web crawlers.

Before installing the above packages, create a new project using the go mod init followed by the project name:

go mod init product-scraper

Next, add the below package requirements to the go.mod file:

module product-scraper

go 1.20

// only add the below code
require (
    github.com/PuerkitoBio/goquery v1.9.2
    github.com/antchfx/htmlquery v1.3.2 
    github.com/gocolly/colly v1.2.0
)

Finally, use the go mod tidy command to actually download all the required dependencies.

Web Scraping With Golang: Core Concepts

Just like any programming language, building an efficient Go scraper to perform web scraping is possible through native HTTP requests and basic parsing capabilities.

In the following sections, we'll review some Go web scraping quick start guides into the core concept required for extracting data.

🙋‍ The _ (underscore) identifier is frequently used throughout this guide. It's to ignore error handling when errors are returned. To process them, use err instead of _ and handle them: if (err != nil) {return err}.

Sending HTTP Requests

HTTP requests are the core of all web scrapers, allowing them to automatically retrieve data from various endpoints. In order to retrieve any endpoint's data, a request must be sent to the server to retrieve the data in the response body:

Illustration of a standard HTTP exchange

Request Method

Sending HTTP requests in a Golang scraper is possible through the native net/http package. Here's how to use it to send a simple GET request:

package main

import (
    "fmt"
    "io"
    "net/http"
    "time"
)

func main() {
    url := "https://httpbin.dev/headers"

    client := &http.Client{
        Timeout: 30 * time.Second, // Define client timeout
    }
    req, _ := http.NewRequest("GET", url, nil) // Create new request
    resp, _ := client.Do(req) // Send the request

    defer resp.Body.Close()

    // Read the HTML
    body, _ := io.ReadAll(resp.Body)
    fmt.Println(string(body))
}

Above, an HTTP client is created with 30s timeout. Then, it's used to define a simple GET request to httpbin.dev/headers to retrieve the basic request details. Finally, we sent the request and read the response body.

To change the HTTP request method, all we have to do is declare the method to use:

func main() {
    url := "https://httpbin.dev/post"

    client := &http.Client{
        Timeout: 30 * time.Second, // Define client timeout
    }
    req, _ := http.NewRequest("POST", url, nil) // Specify request method
    resp, _ := client.Do(req)
    // ....
}

Request Headers

To set header values in a Go webscraper, set them while defining the request:

func main() {
    url := "https://httpbin.dev/headers"

    client := &http.Client{
        Timeout: 30 * time.Second,
    }
    req, _ := http.NewRequest("POST", url, nil)
    req.Header.Set("User-Agent", "Mozilla/5.0 (Android 12; Mobile; rv:109.0) Gecko/113.0 Firefox/113.0")
    req.Header.Set("Cookie", "cookie_key=cookie_value;")
    resp, _ := client.Do(req)
    // ...
}

// "Cookie": [
// "cookie_key=cookie_value"
// ],
// "User-Agent": [
// "Mozilla/5.0 (Android 12; Mobile; rv:109.0) Gecko/113.0 Firefox/113.0"
// ],

Above, we add a cookie header and override the client's User-Agent to match a real web browser.

Headers play a vital role in the web scraping context. Websites and antibot systems use them to detect automated requests, and hence block them. For further details on HTTP headers, refer to our dedicated guide on using headers for web scraping.

Request Body

Finally, here's how to pass a payload to an HTTP request in Go:

func main() {
    url := "https://httpbin.dev/anything"
    payload := "page=1&test=2&foo=bar"

    client := &http.Client{}
    req, _ := http.NewRequest("POST", url, strings.NewReader(payload)) // Pass the payload
    req.Header.Set("Content-Type", "application/x-www-form-urlencoded")
    resp, _ := client.Do(req)
    defer resp.Body.Close()

    body, _ := io.ReadAll(resp.Body)
    fmt.Println(string(body))
}

HTML Parsing

HTML parsing is required to extract data points from a retrieved HTML web page. As mentioned earlier, we'll be using two Golang HTML parsers to extract data:

goquery for evaluating CSS selectors.
htmlquery for evaluating XPath selectors.

For our example, we'll use a mock target website and extract the product data on web-scraping.dev/products:

Product page on web-scraping.dev

We'll parse the above product data with both CSS and XPath selectors to demonstrate both methods. But before this, let's request our target web page to retrieve the HTML:

func scrapeProducts(url string) *http.Response {
    client := &http.Client{}
    req, _ := http.NewRequest("GET", url, nil)
    resp, _ := client.Do(req)
    return resp
}

Above, we define a scrapeProducts() function. It requests the target web page URL and returns its response.

Parsing With CSS Selectors

CSS selectors are by far the most common way to parse HTML documents and they're available in go through the goquery package.

Here's how we can use goquery and css selectors to extrac product data from our example page:

package main

import (
    "encoding/json"
    "fmt"
    "net/http"
    "strconv"

    "github.com/PuerkitoBio/goquery"
)

// Define a type for the extracted data
type Product struct {
    Name string
    Price float64
    Currency string
    Image string
    Desciption string
    Link string
}

func parseProducts(resp *http.Response) []Product {
    // Empty list to save the results
    var products []Product

    // Load the HTML into a document
    doc, _ := goquery.NewDocumentFromReader(resp.Body)

    // Find the element containing the product list
    selector := doc.Find("div.products > div")

    // Iterate over the product list
    for i := range selector.Nodes {
        sel := selector.Eq(i) // Define a selector for each product element
        priceStr := sel.Find("div.price").Text()
        price, _ := strconv.ParseFloat(priceStr, 64) // Change the price data type

        // Create a new Product and append the results
        product := Product{
            Name: sel.Find("a").Text(),
            Price: price,
            Currency: "$",
            Image: sel.Find("img").AttrOr("src", ""),
            Desciption: sel.Find(".short-description").Text(),
            Link: sel.Find("a").AttrOr("href", ""),
        }

        products = append(products, product)
    }
    return products
}

func scrapeProducts(url string) *http.Response {
    // Previous function definition
}

func main() {
    resp := scrapeProducts("https://web-scraping.dev/products")
    products := parseProducts(resp)
    jsonData, _ := json.MarshalIndent(products, "", " ") // Convert to JSON string

    fmt.Println(string(jsonData))
}

Let's break down the execution of the above Golang web scraping logic:

A request is sent to the target web page URL to retrieve its HTML.
A goquery document object is created using the response body.
The goquery Find() method is used to select the HTML element containing the product list. Then, it's iterated over to retrieve the individual product details and define each into a Product object.
The final products result list is logged as JSON.

The above Go scraping script results are saved to the products object. It can be saved to JSON or a CSV file. Here's what it should look like:

[
  {
    "Name": "Box of Chocolate Candy",
    "Price": 24.99,
    "Currency": "$",
    "Image": "https://web-scraping.dev/assets/products/orange-chocolate-box-medium-1.webp",
    "Desciption": "Indulge your sweet tooth with our Box of Chocolate Candy. Each box contains an assortment of rich, flavorful chocolates with a smooth, creamy filling. Choose from a variety of flavors including zesty orange and sweet cherry. Whether you're looking for the perfect gift or just want to treat yourself, our Box of Chocolate Candy is sure to satisfy.",
    "Link": "https://web-scraping.dev/product/1"
  },
  ....
]

For further details on parsing with CSS selectors, refer to our dedicated guide on parsing HTML with CSS selectors.

Parsing With XPath Selector

XPath provides a friendly syntax for selecting HTML elements through virtual attributes and matches while also being much more powerful than CSS selectors. This makes Xpath a great replacement for cases where goquery and CSS selectors aren't enough.

Here's how to use XPath selectors with htmlquery using our product example again:

import (
    // ....

    "github.com/antchfx/htmlquery"
)

// ....

func parseProducts(resp *http.Response) []Product {
    var products []Product

    doc, _ := htmlquery.Parse(resp.Body)
    selector := htmlquery.Find(doc, "//div[@class='products']/div")

    for _, sel := range selector {
        price, _ := strconv.ParseFloat(htmlquery.InnerText(htmlquery.FindOne(sel, ".//div[@class='price']")), 64)
        product := Product{
            Name: htmlquery.InnerText(htmlquery.FindOne(sel, ".//a")),
            Price: price,
            Currency: "$",
            Image: htmlquery.SelectAttr(htmlquery.FindOne(sel, ".//img"), "src"),
            Desciption: htmlquery.InnerText(htmlquery.FindOne(sel, ".//div[@class='short-description']")),
            Link: htmlquery.SelectAttr(htmlquery.FindOne(sel, ".//a"), "href"),
        }
        products = append(products, product)
    }

    return products
}

Since the code for requesting the target web page remains the same, we only update the parseProducts function to change our parsing from CSS selectors to XPath.

Similar to goquery, we use the Find() and FindOne() methods to select the desired elements. Then, InnerText() is used to select the element's text value and SelectAttr() is used to select attributes.

For further details on parsing with XPath selectors, including advanced navigation techniques, refer to our dedicated guide on parsing HTML with XPath.

Crawling

Crawling is an extremely useful technique in the data scraping context. It provides the web data scraper with navigation capabilities by requesting specific href tag links inside the HTML page.

Let's create a Golang web crawler based on the previous code snippet. We'll crawl over the product links and extract all variant links on each product document:

import (
    "fmt"
    "net/http"

    "github.com/PuerkitoBio/goquery"
)

func main() {
    var productLinks []string
    var variantLinks []string

    // First, request the main products page
    req, _ := http.NewRequest("GET", "https://web-scraping.dev/products", nil)
    resp, _ := http.DefaultClient.Do(req)
    doc, _ := goquery.NewDocumentFromReader(resp.Body)

    // Get each main product link
    doc.Find("h3 a").Each(func(i int, s *goquery.Selection) {
        link := s.AttrOr("href", "")
        productLinks = append(productLinks, link)
    })

    // Next, request each product link
    for _, value := range productLinks {
        req, _ := http.NewRequest("GET", value, nil)
        resp, _ := http.DefaultClient.Do(req)
        doc, _ := goquery.NewDocumentFromReader(resp.Body)

        doc.Find("a[href*='variant']").Each(func(i int, s *goquery.Selection) {
            link := s.AttrOr("href", "")
            variantLinks = append(variantLinks, link)
        })
    }
    fmt.Println(variantLinks)
}

The above code successfully scraped all the product variant links:

[
  "https://web-scraping.dev/product/1?variant=orange-small",
  "https://web-scraping.dev/product/1?variant=orange-medium",
  "https://web-scraping.dev/product/1?variant=orange-large",
  "https://web-scraping.dev/product/1?variant=cherry-small",
  "https://web-scraping.dev/product/1?variant=cherry-medium",
  "https://web-scraping.dev/product/1?variant=cherry-large",
  ....
]

The previous Golang webscraper example is minimal, representing the core crawling logic. For further web crawling details, refer to our dedicated guide crawling with Python. The concepts mentioned can be applied to this tutorial.

Example Go Scraper

Let's create a Go crawler based on the previous code snippet. It will crawl over web-scraping.dev/products pages to scrape every product detail:

package main

import (
    "encoding/json"
    "fmt"
    "log"
    "net/http"
    "os"
    "regexp"
    "strconv"

    "github.com/PuerkitoBio/goquery"
)

type Review struct {
    Date string
    Rating int
    Text string
}

type Product struct {
    Name string
    Price float64
    Currency string
    Image string
    Desciption string
    Link string
    Reviews []Review
}

// Crawl the product reviews
func crawlReviews(url string) []Review {
    resp := requestPage(url)
    doc, _ := goquery.NewDocumentFromReader(resp.Body)
    // Find the product reviews in hidden script tag
    reviewsScript := doc.Find("script#reviews-data").Text()
    var reviews []Review
    json.Unmarshal([]byte(reviewsScript), &reviews)
    return reviews
}

// Parse the product details
func parseProducts(resp *http.Response) []Product {
    var products []Product

    doc, _ := goquery.NewDocumentFromReader(resp.Body)
    selector := doc.Find("div.products > div")

    for i := range selector.Nodes {
        sel := selector.Eq(i)
        price, _ := strconv.ParseFloat(sel.Find("div.price").Text(), 64)
        sel.Find("a").AttrOr("href", "")
        link := sel.Find("a").AttrOr("href", "")

        // crawl the product reviews from its product page
        reviews := crawlReviews(link)

        product := Product{
            Name: sel.Find("a").Text(),
            Price: price,
            Currency: "$",
            Image: sel.Find("img").AttrOr("src", ""),
            Desciption: sel.Find(".short-description").Text(),
            Link: link,
            Reviews: reviews,
        }
        products = append(products, product)
    }
    return products
}

// Get the max pages available for pagination
func getMaxPages(resp *http.Response) int {
    doc, _ := goquery.NewDocumentFromReader(resp.Body)
    pagingStr := doc.Find(".paging-meta").Text()
    // Find the max pages number using regex
    re := regexp.MustCompile(`(\d+) pages`)
    match := re.FindStringSubmatch(pagingStr)[1]
    maxPages, _ := strconv.Atoi(match)
    return maxPages
}

// Request a target URL with basic browser headers
func requestPage(url string) *http.Response {
    client := &http.Client{}
    req, _ := http.NewRequest("GET", url, nil)

    // Set browser-like headers
    req.Header.Set("Accept", "text/html")
    req.Header.Set("Accept-Encoding", "gzip, deflate, br")
    req.Header.Set("Accept-Language", "en-US,en;q=0.9")
    req.Header.Set("User-Agent", "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0")

    resp, _ := client.Do(req)
    return resp
}

// Main scrape logic
func scrapeProducts(url string) []Product {
    var data []Product
    resp := requestPage(url)
    maxPages := getMaxPages(resp)

    for pageNumber := 1; pageNumber <= maxPages; pageNumber++ {
        resp := requestPage(fmt.Sprintf("%s?page=%d", url, pageNumber))
        log.Printf("Scraping page: %s", resp.Request.URL)
        products := parseProducts(resp)
        data = append(data, products...)
    }

    return data
}

// Save the scraped data to a JSON file
func saveToJson(products []Product, fileName string) {
    file, _ := os.Create(fileName + ".json")
    defer file.Close()
    jsonData, _ := json.MarshalIndent(products, "", " ")
    file.Write(jsonData)
    fmt.Printf("Saved %d products to %s\n", len(products), fileName)
}

func main() {
    products := scrapeProducts("https://web-scraping.dev/products")
    saveToJson(products, "product_data")
}

The above Go web scraping logic seems quite complex. Let's break down its core functions:

requestPage: Requests a specific target web page URL with basic browser-like headers.
getMaxPages: Retrieves the total number of pages available for pagination.
parseProducts: Iterates over the product list and parses the data for each product.
crawlReviews: Crawls the review data by requesting the dedicated product URL.

This Go scraper starts by retrieving the total number of pages and then iterates over them while extracting each page's data. Finally, the results to a JSON file.

So far, we have explored creating a web scraper with Golang using only packages for sending HTTP requests and HTML parsing. Next, let's explore a dedicated scraping framework: Go Colly.

Web Scraping With Colly

Colly is one of the popular Golang web scraping libraries for building web scrapers and crawlers.

It supports various features to facilitate the web scraping tasks:

Built-in caching middleware.
Fast execution at 1k requests/sec on a single core.
Asynchronous, synchronous, and parallel execution.
Distributed scraping, request delays, and maximum concurrency.
Automatic cookie and session handling.

How Colly Works?

Before getting started with building web scrapers with Colly, let's review its core technical concepts.

Collectors

A Golang colly scraper requires at least one collector. A collector is the core component responsible for managing the entire scraping process. For example, below is a very minimal Colly collector:

func main() {
    c := colly.NewCollector()

    // Start scraping
    c.Visit("https://httpbin.dev")
}

Here, we use the NewCollector method to create a new collector with the default configuration. Then, we use the Visit to request the target web page and wait for the collector to finish using the Wait method.

A collector can be configured using an options object. Below are some of its common parameters:

Option	Description
`UserAgent`	Modifies the used User-Agent HTTP header.
`MaxDepth`	Limits the recursion depth of visited URLs when crawling. Set to 0 for infinite recursion.
`AllowedDomains`	List of allowed domains to visit. Set to blank to allow any domain.
`DisallowedDomains`	List of domains to blacklist.
`AllowURLRevisit`	Whether to allow multiple requests of the same URL.
`MaxBodySize`	Limit the retrieved response body in bytes, leave it to 0 for unlimited bandwidth.
`CacheDir`	A location to save cached files of sent GET requests for future use. If not specified, cache is disabled.
`IgnoreRobotsTxt`	Whether to ignore restrictions defined by the host robots.txt file.
`Async`	Whether to enable asynchronous network communication.

Here's an example of configuring a Colly collector. We can refer to a path environment variable or directly pass the options object values:

func main() {
    c := colly.NewCollector(
        colly.UserAgent("Mozilla/5.0 (Windows NT 6.1; rv:109.0) Gecko/20100101 Firefox/113.0"), 
        colly.Async(true),
        // Futher options
    )
}

For the full available options, refer to the official API specifications.

Callbacks

The next essential component of Golang Colly is callbacks. These are manually defined functions to handle specific events happening during the HTTP lifecycle.

Callbacks are triggered based on specific scraper events in a specified collector. Let's briefly mention them.

OnRequest

The OnRequest callback event is triggered just before an HTTP request is sent from a collector. It's useful for modifying the request configuration before it's sent or for taking other custom actions, such as logging.

collector.OnRequest(func(r *colly.Request) {
    log.Println("Scraping", r.URL)
    r.Headers.Set("key", "value")
})

OnResponse

The OnResponse callback event is the opposite of OnRequest. It's triggered once an HTTP response is received. It's useful for processing the response before being passed to other scraper components:

    collector.OnResponse(func(r *colly.Response) {
        log.Println("Request was executed with status code :", r.StatusCode)
        // Decompress the response body for example
        if r.Headers.Get("Content-Encoding") == "gzip" {
            reader, _ := gzip.NewReader(bytes.NewReader(r.Body))
            decompressedBody, _ := io.ReadAll(reader)
            // ....
        }
    })

OnHTML

The OnHTML callback is triggered once a specified CSS selector is found in the HTML. It's mainly used for HTML parsing or crawling by requesting specific HTML elements.

    collector.OnHTML("div.main", func(e *colly.HTMLElement) {
        // Crawl to other links
        productLink := e.Attr("href")
        collector.Visit(productLink)

        // Parse the HTML
        price := e.Attr("div.price")
    })

OnXML

The OnXML callback is the same as OnHTML, but it's triggered by receiving XML responses.

    collector.OnXML("sitemap", func(e *colly.XMLElement) {
        log.Println(e.Attr("loc"))
    })

OnError

The OnError callback is triggered when an error occurs while making HTTP requests.

    collector.OnError(func(r *colly.Response, err error) {
        log.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)
    })

OnScraped

The OnScraped callback is triggered once the response is processed by other callbacks : OnHTML and OnXML.

    collectorc.OnScraped(func(r *colly.Response) {
        log.Println("Finished scraping", r.Request.URL)
    })

Example Go Colly Scraper

In this section, we'll create a Colly web crawler to extract product data on web-scraping.dev/products.

First, let's start by creating a new collector to crawl product pages:

package main

import (
    "fmt"
    "log"
    "strings"

    "github.com/gocolly/colly"
)

func main() {
    c := colly.NewCollector(
        colly.CacheDir("./web-scraping.dev_cache"), // Cache responses to prevent multiple downloads
        colly.Async(true), // Enable asynchronous execution
    )

    // Clone the cerated collector
    searchCollector := c.Clone()

    // Paginate search pages
    searchCollector.OnHTML("div.paging", func(e *colly.HTMLElement) {
        links := e.ChildAttrs("a", "href")
        nextPage := links[len(links)-1]
        if strings.Contains(nextPage, "https://web-scraping.dev") {
            e.Request.Visit(nextPage)
        }
    })

    // Log before visiting each search page
    searchCollector.OnRequest(func(r *colly.Request) {
        log.Println("Visiting search page", r.URL.String())
    })  

    // On every product link, visit the product details page
    searchCollector.OnHTML("div.row.product div h3 a", func(e *colly.HTMLElement) {
        productLink := e.Attr("href")
        fmt.Println(productLink)
    })

    // Start scraping from the main products page
    searchCollector.Visit("https://web-scraping.dev/products")

    // Wait for the asynchronous searchCollector to
    searchCollector.Wait()
}

Let's break down the above Colly scraping code. We start by defining a new collector c with asynchronous execution and cache enabled. Then, we clone the defined c collector into a new searchCollector, which will be responsible for pagination requests.

Then, we utilize two callbacks:

OnHTML: To paginate the product listing pages, 5 in total.
OnRequest: To log the requested page URLs.

The above crawler output looks like the following:

2024/07/19 08:11:37 Visiting search page https://web-scraping.dev/products
https://web-scraping.dev/product/1
....

Now that we can retrieve all product page URLs let's request them and parse their data:

func main() {
    // ....

    // Log before visiting each search page
    searchCollector.OnRequest(func(r *colly.Request) {
        log.Println("Visiting search page", r.URL.String())
    })

    type Review struct {
        Date string
        Rating int
        Text string
    }

    type Product struct {
        Name string
        Price float64
        Currency string
        Image string
        Description string
        Link string
        Reviews []Review
    }

    var products []Product

    // Create a new collector
    productCollector := c.Clone()

    // On every product link, visit the product details page
    searchCollector.OnHTML("div.row.product div h3 a", func(e *colly.HTMLElement) {
        productLink := e.Attr("href")
        productCollector.Visit(productLink) // Request all the product pages using the productCollector
    })

    // Log before visiting each product page
    productCollector.OnRequest(func(r *colly.Request) {
        log.Println("Visiting product page", r.URL.String())
    })

    // Parse the productCollector responses by ierating over each product HTML element
    productCollector.OnHTML("body", func(e *colly.HTMLElement) {
        var reviews []Review
        reviewsScript := e.ChildText("script#reviews-data")
        json.Unmarshal([]byte(reviewsScript), &reviews)
        priceStr := strings.Split(e.ChildText("span.product-price"), "$")[1]
        price, _ := strconv.ParseFloat(priceStr, 64)

        product := Product{
            Name: e.ChildText("h3.card-title"),
            Price: price,
            Currency: "$", // Example currency, replace with actual scraping logic
            Image: e.ChildAttr("img", "src"),
            Description: e.ChildText(".description"),
            Link: e.Request.URL.String(),
            Reviews: reviews,
        }
        products = append(products, product)
    })

    // Start scraping from the main products page
    searchCollector.Visit("https://web-scraping.dev/products")

    // Wait for the collectors to finish
    searchCollector.Wait()
    productCollector.Wait()

    fmt.Println(products)
}

Above, we create a new collector productCollector to request product pages. Similar to the previous collector, we use OnRequest and OnHTML callbacks to log the requested log URLs and parse the retrieved HTML document. Finally, the script waits for both collectors to finish using colly's Wait() method.

Proxies

Websites and anti-bots define rate-limit rules to detect and block IP addresses exceeding their limits. Hence, rotating proxies to split the traffic across multiple IP addresses is essential to scrape at scale.

Colly provides a built-in proxy switcher to randomly change the IP address:

func main() {
    c := colly.NewCollector()

    p, _ := proxy.RoundRobinProxySwitcher(
        "socks5://some_proxy_domain:1234",
        "http://some_proxy_domain:1234",
    )
    c.SetProxyFunc(p)
    // ....
}

The above Colly data scraper uses the popular round-robin proxy rotation algorithm to switch IP addresses. However, a custom implementation can be passed:

var proxies []*url.URL = []*url.URL{
    &url.URL{Host: "socks5://some_proxy_domain:1234"},
    &url.URL{Host: "http://some_proxy_domain:1234"},
}

func randomProxySwitcher(_ *http.Request) (*url.URL, error) {
    return proxies[random.Intn(len(proxies))], nil
}

func main() {
    c := colly.NewCollector()
    c.SetProxyFunc(randomProxySwitcher)
    // ....
}

For further details, refer to our dedicated guide on using proxies for web scraping.

Parallel Scraping

The Go runtime is managed using lightweight threads called _Goroutines-, which allows for efficient parallel execution.

Colly leverages Go's goroutines to request pages in parallel, allowing for a decreased execution time while web scraping at scale. For example, let's crawl a website with and without using parallel execution and calculate the execution time.

Parallel execution:

package main

import (
    "fmt"
    "time"

    "github.com/gocolly/colly"
)

func main() {
    start := time.Now()

    c := colly.NewCollector(
        colly.MaxDepth(3), // Recursion depth of visited URLs.
        colly.Async(true), // Enable asynchronous communication
    )

    // Parallelism can be controlled also by spawning fixed number of go routines
    c.Limit(&colly.LimitRule{DomainGlob: "*", Parallelism: 20})

    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        e.Request.Visit(link)
    })

    c.Visit("https://web-scraping.dev/")

    c.Wait()
    end := time.Now()
    duration := end.Sub(start)
    fmt.Printf("Script finished in %v\n", duration)
        // Script finished in 7.1324493s
}

Synchronous execution

package main

import (
    "fmt"
    "time"

    "github.com/gocolly/colly"
)

func main() {
    start := time.Now()

    c := colly.NewCollector(
        colly.MaxDepth(3),
    )

    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        e.Request.Visit(link)
    })

    c.Visit("https://web-scraping.dev/")

    end := time.Now()
    duration := end.Sub(start)
    fmt.Printf("Script finished in %v\n", duration)
        // Script finished in 52.3559989s
}

Compared to 52 seconds for the synchronous execution, it only took 7 seconds when using parallel request execution. That's a huge performance boost!

Other Golang Web Scraping Libraries

We have explored web scraping with Go using goquery and htmlquery as parsing packages, and Colly as a crawling framework. Other useful Go packages for web scraping are:

webloop - A Golang headless browser with the WebKit engine, similar to PhantomJS.
golang-selenium - A Selenium Webdriver client for Go.
geziyor - A web scraping and crawling framework similar to Colly with additional JavaScript rendering capabilities.

Powering With Scrapfly

Scrapfly is a web scraping API enabling data extraction at scale by providing:

Anti-scraping protection bypass - For bypassing websites' anti-scraping protection mechanisms, such as Cloudflare.
Millions of residential proxy IPs in over 50 countries - For preventing IP address blocking and throttling while also allowing for scraping from almost any geographical location.
JavaScript rendering - For scraping dynamic web pages through cloud headless browsers without running them yourself.
Easy to use Python and Typescript SDKs, as well as Scrapy integration.
And much more!

ScrapFly service does the heavy lifting for you

Here's how to web scrape with Go using Scrapfly. It's as simple as sending an HTTP request:

package main

import (
    "encoding/json"
    "fmt"
    "io"
    "log"
    "net/http"
    "net/url"
    "strconv"
)

func main() {
    baseURL := "https://api.scrapfly.io/scrape"
    params := url.Values{}
    params.Add("key", "Your Scrapfly API key")
    params.Add("url", "https://web-scraping.dev/products") // Target web page URL
    params.Add("asp", strconv.FormatBool(true)) // Enable anti-scraping protection to bypass blocking
    params.Add("render_js", strconv.FormatBool(true)) // Enable rendering JavaScript (like headless browsers) to scrape dynamic content if needed
    params.Add("country", "us") // Set the proxy location to a specfic country
    params.Add("proxy_pool", "public_residential_pool") // Select the proxy pool
    URL := fmt.Sprintf("%s?%s", baseURL, params.Encode())

    client := &http.Client{}

    req, err := http.NewRequest("GET", URL, nil)
    if err != nil {
        log.Fatal(err)
    }

    res, err := client.Do(req)
    if err != nil {
        log.Fatal(err)
    }
    defer res.Body.Close()

    body, err := io.ReadAll(res.Body)
    if err != nil {
        log.Fatal(err)
    }

    // Parse the JSON response
    var result struct {
        Result struct {
            Content string `json:"content"`
        } `json:"result"`
    }

    err = json.Unmarshal(body, &result)
    if err != nil {
        log.Fatal(err)
    }

    // log the page HTML
    fmt.Println(result.Result.Content)
}

Try for FREE!

FAQ

To wrap up this guide, let's have a look at some frequently asked questions about web scraping with Go.

What is gocolly?

Colly is a web scraping framework for building scrapers and crawlers in Go using collectors and callbacks. It enables data extraction at scale throug a number of features including caching supprot, parallel execution, and automatic cookie and session handling.

What are the pros and cons of using Go for web scraping?

In terms of pros, Golang is known for its high performance and automatic garbage collection for effective memory management. This makes Go suitable for building and managing web scrapers at scale.

As for cons, data parsing and processing with Go can require a steep learning curve for its low-level data structure operations. Moreover, it lacks support for headless browser libraries found in other languages, such as Selenium, Playwright, and Puppeteer.

Go Scraping Summary

In this guide, we introduced using Go for web scraping. We started by going through the steps required to install and set up the Go environment. Then, we detailed the core concepts required to create Golang scrapers: sending HTTP requests, HTML parsing, and crawling.

Lastly, we wrapped up the guide by exploring Colly, the popular Go web crawling framework. We explored its components and how to use it for a real-life web scraping Golang example.

DEV Community