Scraper0024

Posted on Jan 23

Web Crawler with Golang: Step-by-Step Tutorial 2025

#go #website #beginners #programming

Most large web scraping projects in Go start by discovering and organizing URLs using a Golang web crawler. This tool lets you navigate an initial target domain (also known as a "seed URL") and recursively visit links on the page to discover more links.

This guide will teach you how to build and optimize a Golang web crawler using real-world examples. And, before we dive in, you'll also get some useful supplementary information.

Without further ado, let's see what fun stuff we have for you today!

What Is Web Crawling?

Web crawling, at its core, involves systematically navigating websites to extract useful data. A web crawler (often called a spider) fetches web pages, parses their content, and processes the information to meet specific goals, such as indexing or data aggregation. Let’s break it down:

Web crawlers send HTTP requests to retrieve web pages from servers and process the responses. Think of it like a polite handshake between your crawler and the website—“Hello, may I take your data for a spin?”

Once the page is fetched, the crawler extracts relevant data by parsing the HTML. DOM structures help break the page into manageable chunks, and CSS selectors act like a precise pair of tweezers, plucking out the elements you need.

Most websites spread their data across multiple pages. Crawlers need to navigate this labyrinth of pagination, all while respecting rate limits to avoid looking like a data-hungry bot on steroids.

Why Golang is Perfect for Web Crawling in 2025

If web crawling were a race, Golang would be the sports car you'd want in your garage. Its unique features make it the go-to language for modern web crawlers.

Concurrency: Golang's goroutines allow you to make multiple requests simultaneously, scraping data faster than you can say "parallel processing."
Simplicity: The language's clean syntax and minimalistic design keep your codebase manageable, even when tackling complex crawling projects.
Performance: Golang is compiled, meaning it runs blazing fast—ideal for handling large-scale web crawling tasks without breaking a sweat.
Robust Standard Library: With built-in support for HTTP requests, JSON parsing, and more, Golang's standard library equips you with everything you need to build a crawler from scratch.

Why struggle with clunky tools when you can fly through data like a Gopher on caffeine? Golang combines speed, simplicity, and power, making it the ultimate choice for crawling the web efficiently and effectively.

Are Crawling Websites Legal?

The legality of web crawling is not a one-size-fits-all situation. It depends on how, where, and why you're crawling. While scraping public data is generally permissible, violating terms of service or bypassing anti-scraping measures can lead to legal troubles.

To stay on the right side of the law, here are a few golden rules:

Respect the robots.txt directives.
Avoid scraping sensitive or restricted information.
Seek permission if you're unsure about a website’s policy.

How to Build Your First Golang Web Crawler?

Prerequisites

Make sure you have the latest version of Go installed. You can download the installation package from the official Golang website and follow the instructions to install it.
Choose your preferred IDE. This tutorial uses Goland as the editor.
Choose a Go web scraping library you are comfortable with. In this example, we will use chromedp.

To verify your Go installation, enter the following command in the terminal:

go version

If the installation is successful, you will see the following result:

go version go1.23.4 windows/amd64

Create your working directory, and once inside, enter the following commands:

Initialize the go mod:

go mod init crawl

Install chromedp dependency:

go get github.com/chromedp/chromedp

Create the crawl.go file.

Now we can start writing your web scraper code.

Get Page Elements

Visit Lazada, and use the browser's developer tools (F12) to easily identify the page elements and selectors you need.

Get the input field and the search button next to it:

func searchProduct(keyword string) chromedp.Tasks {
  return chromedp.Tasks{
   // wait for input element is visible
   chromedp.WaitVisible(`input[type=search]`, chromedp.ByQuery),
   // Enter the searched product.
   chromedp.SendKeys("input[type=search]", keyword, chromedp.ByQuery),
   // Click search
   chromedp.Click(".search-box__button--1oH7", chromedp.ByQuery),
  }
}

Get the price, title, and image elements from the product list:

Product list
Image element
Title element
Price element

func getNodes(nodes *[]*cdp.Node) chromedp.Tasks {
  return chromedp.Tasks{
   chromedp.WaitReady(`div[data-spm=list] .Bm3ON`, chromedp.ByQuery),
   scrollToBottom(), // Scroll to the bottom to ensure that all page elements are rendered.
   chromedp.Nodes("div[data-spm=list] .Bm3ON", nodes, chromedp.ByQueryAll),
  }
}

func scrollToBottom() chromedp.Action {
  return chromedp.ActionFunc(func(ctx context.Context) error {
   for i := 0; i < 4; i++ { // Scroll repeatedly to ensure that all pictures are loaded.
    _ = chromedp.Evaluate("window.scrollBy(0, window.innerHeight);", nil).Do(ctx)
    time.Sleep(1 * time.Second)
   }
   return nil
  })
}

func getProductData(ctx context.Context, node *cdp.Node) (*Product, error) {
  product := new(Product)
  err := chromedp.Run(ctx,
   chromedp.WaitVisible(".Bm3ON img[type=product]", chromedp.ByQuery, chromedp.FromNode(node)),
   chromedp.AttributeValue(".Bm3ON img[type=product]", "src", &product.Img, nil, chromedp.ByQuery, chromedp.FromNode(node)),
   chromedp.AttributeValue(".Bm3ON .RfADt a", "title", &product.Title, nil, chromedp.ByQuery, chromedp.FromNode(node)),
   chromedp.TextContent(".Bm3ON .aBrP0 span", &product.Price, chromedp.ByQuery, chromedp.FromNode(node)),
  )
  if err != nil {
   return nil, err
  }
  return product, nil
}

Set Up Chrome Environment

For easier debugging, we can launch a Chrome browser in PowerShell and specify a remote debugging port.

chrome.exe --remote-debugging-port=9223

We can access http://localhost:9223/json/list to retrieve the browser's exposed remote debugging address, webSocketDebuggerUrl.

[
  {
    "description": "",
    "devtoolsFrontendUrl": "/devtools/inspector.html?ws=localhost:9223/devtools/page/85CE4D807D11D30D5F22C1AA52461080",
    "id": "85CE4D807D11D30D5F22C1AA52461080",
    "title": "localhost:9223/json/list",
    "type": "page",
    "url": "http://localhost:9223/json/list",
    "webSocketDebuggerUrl": "ws://localhost:9223/devtools/page/85CE4D807D11D30D5F22C1AA52461080"
  }
]

Run the Code

The full code is as follows:

package main

import (
  "context"
  "encoding/json"
  "flag"
  "log"
  "time"

  "github.com/chromedp/cdproto/cdp"
  "github.com/chromedp/chromedp"
)

type Product struct {
  ID    string `json:"id"`
  Img   string `json:"img"`
  Price string `json:"price"`
  Title string `json:"title"`
}

func main() {
  product := flag.String("product", "iphone15", "your product keyword")
  url := flag.String("webSocketDebuggerUrl", "", "your websocket url")

  flag.Parse()

  var baseCxt context.Context

  if *url != "" {
   baseCxt, _ = chromedp.NewRemoteAllocator(context.Background(), *url)
  } else {
   baseCxt = context.Background()
  }

  ctx, cancel := chromedp.NewContext(
   baseCxt,
  )

  defer cancel()
  var nodes []*cdp.Node

  err := chromedp.Run(ctx,
   chromedp.Navigate(`https://www.lazada.com.my/`),
   searchProduct(*product),
   getNodes(&nodes),
  )

  products := make([]*Product, 0, len(nodes))

  for _, v := range nodes {
   product, err := getProductData(ctx, v)
   if err != nil {
    log.Println("run node task err: ", err)
    continue
   }
   products = append(products, product)
  }
  if err != nil {
   log.Fatal(err)
  }

  jsonData, _ := json.Marshal(products)

  log.Println(string(jsonData))
}

func searchProduct(keyword string) chromedp.Tasks {
  return chromedp.Tasks{
   // wait for input element is visible
   chromedp.WaitVisible(`input[type=search]`, chromedp.ByQuery),
   // Enter the searched product.
   chromedp.SendKeys("input[type=search]", keyword, chromedp.ByQuery),
   // Click search
   chromedp.Click(".search-box__button--1oH7", chromedp.ByQuery),
  }
}

func getNodes(nodes *[]*cdp.Node) chromedp.Tasks {
  return chromedp.Tasks{
   chromedp.WaitReady(`div[data-spm=list] .Bm3ON`, chromedp.ByQuery),
   scrollToBottom(), // Scroll to the bottom to ensure that all page elements are rendered.
   chromedp.Nodes("div[data-spm=list] .Bm3ON", nodes, chromedp.ByQueryAll),
  }
}

func scrollToBottom() chromedp.Action {
  return chromedp.ActionFunc(func(ctx context.Context) error {
   for i := 0; i < 4; i++ { // Scroll repeatedly to ensure that all pictures are loaded.
    _ = chromedp.Evaluate("window.scrollBy(0, window.innerHeight);", nil).Do(ctx)
    time.Sleep(1 * time.Second)
   }
   return nil
  })
}

func getProductData(ctx context.Context, node *cdp.Node) (*Product, error) {
  product := new(Product)
  err := chromedp.Run(ctx,
   chromedp.WaitVisible(".Bm3ON img[type=product]", chromedp.ByQuery, chromedp.FromNode(node)),
   chromedp.AttributeValue(".Bm3ON img[type=product]", "src", &product.Img, nil, chromedp.ByQuery, chromedp.FromNode(node)),
   chromedp.AttributeValue(".Bm3ON .RfADt a", "title", &product.Title, nil, chromedp.ByQuery, chromedp.FromNode(node)),
   chromedp.TextContent(".Bm3ON .aBrP0 span", &product.Price, chromedp.ByQuery, chromedp.FromNode(node)),
  )
  if err != nil {
   return nil, err
  }
  product.ID = node.AttributeValue("data-item-id")
  return product, nil
}

By running the following command, you will retrieve the scraped data results:

go run .\crawl.go -product="YOU KEYWORD"-webSocketDebuggerUrl="YOU WEBSOCKETDEBUGGERURL"

[
    {
        "id": "3792910846",
        "img": "https://img.lazcdn.com/g/p/a79df6b286a0887038c16b7600e38f4f.png_200x200q75.png_.webp",
        "price": "RM3,809.00",
        "title": "Apple iPhone 15"
    },
    {
        "id": "3796593281",
        "img": "https://img.lazcdn.com/g/p/627828b5fa28d708c5b093028cd06069.png_200x200q75.png_.webp",
        "price": "RM3,319.00",
        "title": "Apple iPhone 15"
    },
    {
        "id": "3794514070",
        "img": "https: //img.lazcdn.com/g/p/6f4ddc2693974398666ec731a713bcfd.jpg_200x200q75.jpg_.webp",
        "price": "RM3,499.00",
        "title": "Apple iPhone 15"
    },
    {
        "id": "3796440931",
        "img": "https://img.lazcdn.com/g/p/8df101af902d426f3e3a9748bafa7513.jpg_200x200q75.jpg_.webp",
        "price": "RM4,399.00",
        "title": "Apple iPhone 15"
    },

    ......

    {
        "id": "3793164816",
        "img": "https://img.lazcdn.com/g/p/b6c3498f75f1215f24712a25799b0d19.png_200x200q75.png_.webp",
        "price": "RM3,799.00",
        "title": "Apple iPhone 15"
    },
    {
        "id": "3793322260",
        "img": "https: //img.lazcdn.com/g/p/67199db1bd904c3b9b7ea0ce32bc6ace.png_200x200q75.png_.webp",
        "price": "RM5,644.00",
        "title": "[Ready Stock] Apple iPhone 15 Pro"
    },
    {
        "id": "3796624559",
        "img": "https://img.lazcdn.com/g/p/81a814a9c829afa200fbc691c9a0c30c.png_200x200q75.png_.webp",
        "price": "RM6,679.00",
        "title": "Apple iPhone 15 Pro (1TB)"
    }
]

Advanced Techniques for Scalable Your Web Crawler

Your web crawling needs improvement! To gather data effectively without being blocked or overloaded, you must implement techniques that balance speed, reliability, and resource optimization.

Let’s explore some advanced strategies to ensure your crawler excels under heavy workloads.

Maintain your request and session

When web crawling, bombarding a server with too many requests in a short time is a surefire way to get detected and banned. Websites often monitor the frequency of requests from the same client, and a sudden spike can trigger anti-bot mechanisms.

package main

import (
        "fmt"
        "io/ioutil"
        "net/http"
        "time"
)

func main() {
        // Create a reusable HTTP client
        client := &http.Client{}

        // URLs to crawl
        urls := []string{
                "https://example.com/page1",
                "https://example.com/page2",
                "https://example.com/page3",
        }

        // Interval between requests (e.g., 2 seconds)
        requestInterval := 2 * time.Second

        for _, url := range urls {
                // Create a new HTTP request
                req, _ := http.NewRequest("GET", url, nil)
                req.Header.Set("User-Agent", "Mozilla/5.0 (compatible; WebCrawler/1.0)")

                // Send the request
                resp, err := client.Do(req)
                if err != nil {
                        fmt.Println("Error:", err)
                        continue
                }

                // Read and print the response
                body, _ := ioutil.ReadAll(resp.Body)
                fmt.Printf("Response from %s:\n%s\n", url, body)
                resp.Body.Close()

                // Wait before sending the next request
                time.Sleep(requestInterval)
        }
}

Avoid Duplicate Links

There's nothing worse than wasting resources crawling the same URL twice. Implement a robust deduplication system by maintaining a URL set (e.g., a hash map or a Redis database) to track already-visited pages. Not only does this save bandwidth, but it also ensures your crawler operates efficiently and doesn’t miss new pages.

Proxy management to avoid IP bans.

Scraping at scale often triggers anti-bot measures, leading to IP bans. To avoid this, integrate proxy rotation into your crawler.

Use proxy pools to distribute requests across multiple IPs.
Rotate proxies dynamically to make your requests appear as if they originate from different users and locations.

Prioritize Specific Pages

Prioritizing specific pages helps streamline your crawling process and lets you focus on crawling usable links. In the current crawler, we use CSS selectors to target only pagination links and extract valuable product information.

However, if you are interested in all links on a page and want to prioritize pagination, you can maintain a separate queue and process pagination links first.

package main

import (
    "fmt"

    "github.com/gocolly/colly"
)

// ...

// create variables to separate pagination links from other links
var paginationURLs = []string{}
var otherURLs = []string{}

func main() {
    // ...
}

func crawl (currenturl string, maxdepth int) {
    // ...

    // ----- find and visit all links ---- //
    // select the href attribute of all anchor tags
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        // get absolute URL
        link := e.Request.AbsoluteURL(e.Attr("href"))
        // check if the current URL has already been visited
        if link != "" && !visitedurls[link] {
            // add current URL to visitedURLs
            visitedurls[link] = true
            if e.Attr("class") == "page-numbers" {
                paginationURLs = append(paginationURLs, link)
             } else {
                otherURLs = append(otherURLs, link)
             }
        }
    })


    // ...

    // process pagination links first
    for len(paginationURLs) > 0 {
        nextURL := paginationURLs[0]
        paginationURLs = paginationURLs[1:]
        visitedurls[nextURL] = true
        err := c.Visit(nextURL)
        if err != nil {
            fmt.Println("Error visiting page:", err)
        }
    }

    // process other links
    for len(otherURLs) > 0 {
        nextURL := otherURLs[0]
        otherURLs = otherURLs[1:]
        visitedurls[nextURL] = true
        err := c.Visit(nextURL)
        if err != nil {
            fmt.Println("Error visiting page:", err)
        }
    }

}

Scrapeless Scraping API: Effective Crawling Tool

Why is Scrapeless Scraping API more Ideal?

Scrapeless Scraping API is designed to simplify the process of extracting data from websites and it can navigate the most complex web environments, effectively managing dynamic content and JavaScript rendering.

Besides, Scrapeless Scraping API leverages a global network that spans 195 countries, supported by access to over 70 million residential IPs. With a 99.9% uptime and exceptional success rates, Scrapeless easily overcomes challenges like IP blocks and CAPTCHA, making it a robust solution for complex web automation and AI-driven data collection.

With our advanced Scraping API, you can access the data you need without having to write or maintain complex scraping scripts!

Advantages of Scraping API

Scrapeless supports high-performance JavaScript rendering, allowing it to handle dynamic content (such as data loaded via AJAX or JavaScript) and scrape modern websites that rely on JS for content delivery.

Affordable Pricing: Scrapeless is designed to offer exceptional value.
Stability and Reliability: With a proven track record, Scrapeless provides steady API responses, even under high workloads.
High Success Rates: Say goodbye to failed extractions and Scrapeless promises 99.99% successful access to Google SERP data.
Scalability: Handle thousands of queries effortlessly, thanks to the robust infrastructure behind Scrapeless.

Get the cheap and powerful Scrapeless Scraping API now!

Scrapeless offers a reliable and scalable web scraping platform at competitive prices, ensuring excellent value for its users:

Scraping Browser: From $0.09 per hour
Scraping API: From $1.00 per 1k URLs
Web Unlocker: $0.20 per 1k URLs
Captcha Solver: From $0.80 per 1k URLs
Proxies: $2.80 per GB

By subscribing, you can enjoy discounts of up to 20% discount on each service. Do you have specific requirements? Contact us today, and we'll provide even greater savings tailored to your needs!

How to use Scrapeless Scraping API?

It is very easy to use scrapeless Scraping API to crawl Lazada data. You only need one simple request to get all the data you want. How to call scrapeless API quickly? Please follow my steps:

Step 1. Log in to Scrapeless
Step 2. Click the "Scraping API"

Step 3. Find our "Lazada" API and enter it:

Step 4. Fill in the complete information of the product you want to crawl in the operation box on the left. When you use chromedp to crawl data, you have already obtained the ID of the crawled product. Now you only need to add the ID to the itemId parameter to obtain more detailed data about the product. Then select the expression language you want:

Step 5. Click "Start Scraping", and the product's crawling results will appear in the preview box on the right:

You can refer to our Golang sample code, or visit our API documentation for other languages.

package main

import (
  "bytes"
  "encoding/json"
  "fmt"
  "io/ioutil"
  "net/http"
)

type Payload struct {
  Actor string         `json:"actor"`
  Input map[string]any `json:"input"`
  Proxy map[string]any `json:"proxy"`
}

func sendRequest() error {
  host := "api.scrapeless.com"
  url := fmt.Sprintf("https://%s/api/v1/scraper/request", host)
  token := "YOU_TOKEN"

  headers := map[string]string{"x-api-token": token}

  inputData := map[string]any{
   "itemId": "3792910846", // Replace with the itemId you want to obtain.
   "site":   "my",
  }

  proxy := map[string]any{
   "country": "ANY",
  }

  payload := Payload{
   Actor: "scraper.lazada",
   Input: inputData,
   Proxy: proxy,
  }

  jsonPayload, err := json.Marshal(payload)
  if err != nil {
   return err
  }

  req, err := http.NewRequest("POST", url, bytes.NewBuffer(jsonPayload))
  if err != nil {
   return err
  }

  for key, value := range headers {
   req.Header.Set(key, value)
  }

  client := &http.Client{}
  resp, err := client.Do(req)
  if err != nil {
   return err
  }
  defer resp.Body.Close()

  body, err := ioutil.ReadAll(resp.Body)
  if err != nil {
   return err
  }
  fmt.Printf("body %s\n", string(body))

  return nil
}

func main() {
  err := sendRequest()
  if err != nil {
   fmt.Println("Error:", err)
   return
  }
}

After running, you will get the following detailed data. Including links, pictures, SKU, Reviews, Same Seller, etc.

Due to space constraints, we only show a portion of the crawling results here. You can visit our Dashboard and get a Free Trial to quickly crawl and get the full crawling results!

{
    "Breadcrumb": [
        {
            "title": "Mobiles & Tablets",
            "url": "https://www.lazada.com.my/shop-mobiles-tablets/"
        },
        {
            "title": "Smartphones",
            "url": "https://www.lazada.com.my/shop-mobiles/"
        },
        {
            "title": "Apple iPhone 15"
        }
    ],
    "deliveryOptions": {
        "21911329880": [
            {
                "badge": false,
                "dataType": "delivery",
                "deliveryWorkTimeMax": "2025-01-27T23:27+08:00[GMT+08:00]",
                "deliveryWorkTimeMin": "2025-01-24T23:27+08:00[GMT+08:00]",
                "description": "For local items, you can expect your item within 2-4 working days. <br/>Shipping fee is determined by the total size/weight of the products purchased from the seller.<br/><br/><a href=\"https://www.lazada.com.my/helpcenter/shipping_delivery/#answer-faq-whatisshippingfee-ans\" target=\"_blank\">Find out more</a>",
                "duringTime": "Guaranteed by 24-27 Jan",
                "fee": "RM4.90",
                "feeValue": 4.9,
                "hasTip": true,
                "title": "Standard Delivery",
                "type": "standard"
            },
            {
                "badge": true,
                "dataType": "service",
                "description": "",
                "feeValue": 0,
                "hasTip": true,
                "title": "Cash on Delivery not available",
                "type": "noCOD"
            }
        ],
        "21911329881": [
            {
                "badge": false,
                "dataType": "delivery",
                "deliveryWorkTimeMax": "2025-01-27T23:27+08:00[GMT+08:00]",
                "deliveryWorkTimeMin": "2025-01-24T23:27+08:00[GMT+08:00]",
                "description": "For local items, you can expect your item within 2-4 working days. <br/>Shipping fee is determined by the total size/weight of the products purchased from the seller.<br/><br/><a href=\"https://www.lazada.com.my/helpcenter/shipping_delivery/#answer-faq-whatisshippingfee-ans\" target=\"_blank\">Find out more</a>",
                "duringTime": "Guaranteed by 24-27 Jan",
                "fee": "RM4.90",
                "feeValue": 4.9,
                "hasTip": true,
                "title": "Standard Delivery",
                "type": "standard"
            },
            {
                "badge": true,
                "dataType": "service",
                "description": "",
                "feeValue": 0,
                "hasTip": true,
                "title": "Cash on Delivery not available",
                "type": "noCOD"
            }
        ],
        ...

Golang Crawling Best Practices and Considerations

Parallel Crawling and Concurrency

Scraping multiple pages synchronously can lead to inefficiencies because only one goroutine can actively handle tasks at any given time. Your web crawler spends most of its time waiting for responses and processing data before moving on to the next task

However, taking advantage of Go's concurrency capabilities to try concurrent crawling can significantly reduce your overall crawling time!

However, you must manage concurrency appropriately to avoid overwhelming the target server and triggering anti-bot restrictions.

Crawling JavaScript Rendered Pages in Go

Although Colly is a great web crawler tool with many built-in features, it cannot crawl JavaScript rendered pages (dynamic content). It can only fetch and parse static HTML, and dynamic content does not exist in the static HTML of a website.

However, you can integrate with a headless browser or a JavaScript engine to crawl dynamic content.

Ending Thoughts

You learned how to build a Golang web crawler using advanced programming. Keep in mind that while building a web scraper to browse the web is a great starting point, you must overcome anti-bot measures to access modern websites.

Rather than struggling with manual configuration that is likely to fail, consider the Scrapeless Scraping API, the most reliable solution for bypassing any anti-bot system.

Try Scrapeless for free today to get started!

DEV Community