DEV Community

agenthustler
agenthustler

Posted on

Web Scraping with Go: Colly vs goquery vs Rod

Why Go for Web Scraping?

Go offers serious advantages for web scraping: compiled binaries with zero dependencies, goroutines for massive concurrency, low memory footprint, and excellent HTTP libraries. If you are building scrapers that need to handle thousands of URLs per minute, Go is worth considering.

Let's compare the three most popular Go scraping libraries.

Colly: The Batteries-Included Framework

Colly is the most popular Go scraping framework. It handles request queuing, rate limiting, caching, and parallelism out of the box:

package main

import (
    "fmt"
    "log"
    "github.com/gocolly/colly/v2"
)

type Product struct {
    Name  string
    Price string
    URL   string
}

func main() {
    c := colly.NewCollector(
        colly.AllowedDomains("example.com"),
        colly.Async(true),
        colly.MaxDepth(2),
    )

    // Rate limiting
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*",
        Parallelism: 4,
        Delay:       2 * time.Second,
    })

    var products []Product

    c.OnHTML(".product-card", func(e *colly.HTMLElement) {
        product := Product{
            Name:  e.ChildText(".product-title"),
            Price: e.ChildText(".product-price"),
            URL:   e.Request.AbsoluteURL(e.ChildAttr("a", "href")),
        }
        products = append(products, product)
    })

    // Follow pagination
    c.OnHTML("a.next-page", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        e.Request.Visit(link)
    })

    c.OnError(func(r *colly.Response, err error) {
        log.Printf("Error %d: %s", r.StatusCode, err)
    })

    c.Visit("https://example.com/products")
    c.Wait()

    fmt.Printf("Scraped %d products\n", len(products))
}
Enter fullscreen mode Exit fullscreen mode

Best for: Crawling websites with pagination, respecting robots.txt, built-in caching.

goquery: The jQuery-Style Parser

goquery gives you jQuery-like selectors for HTML parsing. It is lower-level than Colly — you handle HTTP requests yourself:

package main

import (
    "fmt"
    "log"
    "net/http"
    "github.com/PuerkitoBio/goquery"
)

func scrapeHackerNews() {
    resp, err := http.Get("https://news.ycombinator.com")
    if err != nil {
        log.Fatal(err)
    }
    defer resp.Body.Close()

    doc, err := goquery.NewDocumentFromReader(resp.Body)
    if err != nil {
        log.Fatal(err)
    }

    doc.Find(".titleline > a").Each(func(i int, s *goquery.Selection) {
        title := s.Text()
        href, exists := s.Attr("href")
        if exists {
            fmt.Printf("%d. %s\n   %s\n\n", i+1, title, href)
        }
    })
}

func main() {
    scrapeHackerNews()
}
Enter fullscreen mode Exit fullscreen mode

Best for: Simple page parsing, when you want full control over HTTP, lightweight projects.

Rod: The Headless Browser Controller

Rod controls Chromium browsers for JavaScript-heavy sites. It is Go's answer to Puppeteer:

package main

import (
    "fmt"
    "time"
    "github.com/go-rod/rod"
    "github.com/go-rod/rod/lib/launcher"
)

func scrapeSPA() {
    // Launch headless browser
    url := launcher.New().Headless(true).MustLaunch()
    browser := rod.New().ControlURL(url).MustConnect()
    defer browser.MustClose()

    page := browser.MustPage("https://spa-website.com")

    // Wait for dynamic content to load
    page.MustWaitLoad()
    page.MustWaitIdle()

    // Wait for specific element
    page.MustElement(".results-container").MustWaitVisible()

    // Scroll to trigger lazy loading
    for i := 0; i < 5; i++ {
        page.MustEval(`window.scrollTo(0, document.body.scrollHeight)`)
        time.Sleep(2 * time.Second)
    }

    // Extract data
    elements := page.MustElements(".result-item")
    for _, el := range elements {
        title := el.MustElement(".title").MustText()
        price := el.MustElement(".price").MustText()
        fmt.Printf("%s: %s\n", title, price)
    }
}

func main() {
    scrapeSPA()
}
Enter fullscreen mode Exit fullscreen mode

Best for: JavaScript-rendered sites, SPAs, sites requiring interaction.

Head-to-Head Comparison

Feature Colly goquery Rod
JS rendering No No Yes
Built-in concurrency Yes No Limited
Rate limiting Built-in Manual Manual
Memory usage Low Very low High
robots.txt Automatic Manual Manual
Learning curve Medium Easy Medium
Best use case Crawling Parsing SPAs

Combining Libraries

The real power comes from combining them. Use Rod to render JS, then goquery to parse:

import (
    "github.com/PuerkitoBio/goquery"
    "github.com/go-rod/rod"
    "strings"
)

func renderAndParse(url string) (*goquery.Document, error) {
    browser := rod.New().MustConnect()
    defer browser.MustClose()

    page := browser.MustPage(url)
    page.MustWaitLoad()

    html := page.MustHTML()
    return goquery.NewDocumentFromReader(strings.NewReader(html))
}
Enter fullscreen mode Exit fullscreen mode

Handling Anti-Bot at Scale

For production Go scrapers:

  • Use ScraperAPI as an HTTP proxy to handle blocks automatically
  • Rotate residential proxies with ThorData
  • Monitor success rates with ScrapeOps

Which One Should You Pick?

  • Starting a new project? Start with Colly. Its built-in features save hours.
  • Just parsing HTML? Use goquery. It is fast and simple.
  • Need JavaScript rendering? Rod is your only option in pure Go.
  • Need all three? Colly for crawling + Rod for JS pages + goquery for parsing.

Conclusion

Go's concurrency model makes it excellent for web scraping at scale. Colly handles 90% of use cases, goquery gives you surgical precision for parsing, and Rod tackles the JavaScript-heavy sites. Choose based on your target sites and scale requirements.

Top comments (0)