agenthustler

Posted on Mar 26

Web Scraping with Go: Colly vs goquery vs Rod

#python #tutorial #webdev #programming

Why Go for Web Scraping?

Go offers serious advantages for web scraping: compiled binaries with zero dependencies, goroutines for massive concurrency, low memory footprint, and excellent HTTP libraries. If you are building scrapers that need to handle thousands of URLs per minute, Go is worth considering.

Let's compare the three most popular Go scraping libraries.

Colly: The Batteries-Included Framework

Colly is the most popular Go scraping framework. It handles request queuing, rate limiting, caching, and parallelism out of the box:

package main

import (
    "fmt"
    "log"
    "github.com/gocolly/colly/v2"
)

type Product struct {
    Name  string
    Price string
    URL   string
}

func main() {
    c := colly.NewCollector(
        colly.AllowedDomains("example.com"),
        colly.Async(true),
        colly.MaxDepth(2),
    )

    // Rate limiting
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 4,
        Delay:       2 * time.Second,
    })

    var products []Product

    c.OnHTML(".product-card", func(e *colly.HTMLElement) {
        product := Product{
            Name:  e.ChildText(".product-title"),
            Price: e.ChildText(".product-price"),
            URL:   e.Request.AbsoluteURL(e.ChildAttr("a", "href")),
        }
        products = append(products, product)
    })

    // Follow pagination
    c.OnHTML("a.next-page", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        e.Request.Visit(link)
    })

    c.OnError(func(r *colly.Response, err error) {
        log.Printf("Error %d: %s", r.StatusCode, err)
    })

    c.Visit("https://example.com/products")
    c.Wait()

    fmt.Printf("Scraped %d products\n", len(products))
}

Best for: Crawling websites with pagination, respecting robots.txt, built-in caching.

goquery: The jQuery-Style Parser

goquery gives you jQuery-like selectors for HTML parsing. It is lower-level than Colly — you handle HTTP requests yourself:

package main

import (
    "fmt"
    "log"
    "net/http"
    "github.com/PuerkitoBio/goquery"
)

func scrapeHackerNews() {
    resp, err := http.Get("https://news.ycombinator.com")
    if err != nil {
        log.Fatal(err)
    }
    defer resp.Body.Close()

    doc, err := goquery.NewDocumentFromReader(resp.Body)
    if err != nil {
        log.Fatal(err)
    }

    doc.Find(".titleline > a").Each(func(i int, s *goquery.Selection) {
        title := s.Text()
        href, exists := s.Attr("href")
        if exists {
            fmt.Printf("%d. %s\n   %s\n\n", i+1, title, href)
        }
    })
}

func main() {
    scrapeHackerNews()
}

Best for: Simple page parsing, when you want full control over HTTP, lightweight projects.

Rod: The Headless Browser Controller

Rod controls Chromium browsers for JavaScript-heavy sites. It is Go's answer to Puppeteer:

package main

import (
    "fmt"
    "time"
    "github.com/go-rod/rod"
    "github.com/go-rod/rod/lib/launcher"
)

func scrapeSPA() {
    // Launch headless browser
    url := launcher.New().Headless(true).MustLaunch()
    browser := rod.New().ControlURL(url).MustConnect()
    defer browser.MustClose()

    page := browser.MustPage("https://spa-website.com")

    // Wait for dynamic content to load
    page.MustWaitLoad()
    page.MustWaitIdle()

    // Wait for specific element
    page.MustElement(".results-container").MustWaitVisible()

    // Scroll to trigger lazy loading
    for i := 0; i < 5; i++ {
        page.MustEval(`window.scrollTo(0, document.body.scrollHeight)`)
        time.Sleep(2 * time.Second)
    }

    // Extract data
    elements := page.MustElements(".result-item")
    for _, el := range elements {
        title := el.MustElement(".title").MustText()
        price := el.MustElement(".price").MustText()
        fmt.Printf("%s: %s\n", title, price)
    }
}

func main() {
    scrapeSPA()
}

Best for: JavaScript-rendered sites, SPAs, sites requiring interaction.

Head-to-Head Comparison

Feature	Colly	goquery	Rod
JS rendering	No	No	Yes
Built-in concurrency	Yes	No	Limited
Rate limiting	Built-in	Manual	Manual
Memory usage	Low	Very low	High
robots.txt	Automatic	Manual	Manual
Learning curve	Medium	Easy	Medium
Best use case	Crawling	Parsing	SPAs

Combining Libraries

The real power comes from combining them. Use Rod to render JS, then goquery to parse:

import (
    "github.com/PuerkitoBio/goquery"
    "github.com/go-rod/rod"
    "strings"
)

func renderAndParse(url string) (*goquery.Document, error) {
    browser := rod.New().MustConnect()
    defer browser.MustClose()

    page := browser.MustPage(url)
    page.MustWaitLoad()

    html := page.MustHTML()
    return goquery.NewDocumentFromReader(strings.NewReader(html))
}

Handling Anti-Bot at Scale

For production Go scrapers:

Use ScraperAPI as an HTTP proxy to handle blocks automatically
Rotate residential proxies with ThorData
Monitor success rates with ScrapeOps

Which One Should You Pick?

Starting a new project? Start with Colly. Its built-in features save hours.
Just parsing HTML? Use goquery. It is fast and simple.
Need JavaScript rendering? Rod is your only option in pure Go.
Need all three? Colly for crawling + Rod for JS pages + goquery for parsing.

Conclusion

Go's concurrency model makes it excellent for web scraping at scale. Colly handles 90% of use cases, goquery gives you surgical precision for parsing, and Rod tackles the JavaScript-heavy sites. Choose based on your target sites and scale requirements.

DEV Community