Why Go for Web Scraping?
Go offers serious advantages for web scraping: compiled binaries with zero dependencies, goroutines for massive concurrency, low memory footprint, and excellent HTTP libraries. If you are building scrapers that need to handle thousands of URLs per minute, Go is worth considering.
Let's compare the three most popular Go scraping libraries.
Colly: The Batteries-Included Framework
Colly is the most popular Go scraping framework. It handles request queuing, rate limiting, caching, and parallelism out of the box:
package main
import (
"fmt"
"log"
"github.com/gocolly/colly/v2"
)
type Product struct {
Name string
Price string
URL string
}
func main() {
c := colly.NewCollector(
colly.AllowedDomains("example.com"),
colly.Async(true),
colly.MaxDepth(2),
)
// Rate limiting
c.Limit(&colly.LimitRule{
DomainGlob: "*",
Parallelism: 4,
Delay: 2 * time.Second,
})
var products []Product
c.OnHTML(".product-card", func(e *colly.HTMLElement) {
product := Product{
Name: e.ChildText(".product-title"),
Price: e.ChildText(".product-price"),
URL: e.Request.AbsoluteURL(e.ChildAttr("a", "href")),
}
products = append(products, product)
})
// Follow pagination
c.OnHTML("a.next-page", func(e *colly.HTMLElement) {
link := e.Attr("href")
e.Request.Visit(link)
})
c.OnError(func(r *colly.Response, err error) {
log.Printf("Error %d: %s", r.StatusCode, err)
})
c.Visit("https://example.com/products")
c.Wait()
fmt.Printf("Scraped %d products\n", len(products))
}
Best for: Crawling websites with pagination, respecting robots.txt, built-in caching.
goquery: The jQuery-Style Parser
goquery gives you jQuery-like selectors for HTML parsing. It is lower-level than Colly — you handle HTTP requests yourself:
package main
import (
"fmt"
"log"
"net/http"
"github.com/PuerkitoBio/goquery"
)
func scrapeHackerNews() {
resp, err := http.Get("https://news.ycombinator.com")
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
doc, err := goquery.NewDocumentFromReader(resp.Body)
if err != nil {
log.Fatal(err)
}
doc.Find(".titleline > a").Each(func(i int, s *goquery.Selection) {
title := s.Text()
href, exists := s.Attr("href")
if exists {
fmt.Printf("%d. %s\n %s\n\n", i+1, title, href)
}
})
}
func main() {
scrapeHackerNews()
}
Best for: Simple page parsing, when you want full control over HTTP, lightweight projects.
Rod: The Headless Browser Controller
Rod controls Chromium browsers for JavaScript-heavy sites. It is Go's answer to Puppeteer:
package main
import (
"fmt"
"time"
"github.com/go-rod/rod"
"github.com/go-rod/rod/lib/launcher"
)
func scrapeSPA() {
// Launch headless browser
url := launcher.New().Headless(true).MustLaunch()
browser := rod.New().ControlURL(url).MustConnect()
defer browser.MustClose()
page := browser.MustPage("https://spa-website.com")
// Wait for dynamic content to load
page.MustWaitLoad()
page.MustWaitIdle()
// Wait for specific element
page.MustElement(".results-container").MustWaitVisible()
// Scroll to trigger lazy loading
for i := 0; i < 5; i++ {
page.MustEval(`window.scrollTo(0, document.body.scrollHeight)`)
time.Sleep(2 * time.Second)
}
// Extract data
elements := page.MustElements(".result-item")
for _, el := range elements {
title := el.MustElement(".title").MustText()
price := el.MustElement(".price").MustText()
fmt.Printf("%s: %s\n", title, price)
}
}
func main() {
scrapeSPA()
}
Best for: JavaScript-rendered sites, SPAs, sites requiring interaction.
Head-to-Head Comparison
| Feature | Colly | goquery | Rod |
|---|---|---|---|
| JS rendering | No | No | Yes |
| Built-in concurrency | Yes | No | Limited |
| Rate limiting | Built-in | Manual | Manual |
| Memory usage | Low | Very low | High |
| robots.txt | Automatic | Manual | Manual |
| Learning curve | Medium | Easy | Medium |
| Best use case | Crawling | Parsing | SPAs |
Combining Libraries
The real power comes from combining them. Use Rod to render JS, then goquery to parse:
import (
"github.com/PuerkitoBio/goquery"
"github.com/go-rod/rod"
"strings"
)
func renderAndParse(url string) (*goquery.Document, error) {
browser := rod.New().MustConnect()
defer browser.MustClose()
page := browser.MustPage(url)
page.MustWaitLoad()
html := page.MustHTML()
return goquery.NewDocumentFromReader(strings.NewReader(html))
}
Handling Anti-Bot at Scale
For production Go scrapers:
- Use ScraperAPI as an HTTP proxy to handle blocks automatically
- Rotate residential proxies with ThorData
- Monitor success rates with ScrapeOps
Which One Should You Pick?
- Starting a new project? Start with Colly. Its built-in features save hours.
- Just parsing HTML? Use goquery. It is fast and simple.
- Need JavaScript rendering? Rod is your only option in pure Go.
- Need all three? Colly for crawling + Rod for JS pages + goquery for parsing.
Conclusion
Go's concurrency model makes it excellent for web scraping at scale. Colly handles 90% of use cases, goquery gives you surgical precision for parsing, and Rod tackles the JavaScript-heavy sites. Choose based on your target sites and scale requirements.
Top comments (0)