Scrapy 🕷️, but in Go: Building High-Performance Scrapers without the Boilerplate

#webscraping #go #goscrapy

Hi everyone 👋

Web scraping can start out pretty basic. You just loop through some pages, grab the HTML, pull out what you need, and store it somewhere. But then when you try to make it bigger, like dealing with a ton of requests or figuring out retries and cookies for different sites, it turns into a real hassle quick.

I remember using Scrapy in Python, and it handled all that stuff without me having to think too hard about the details. It felt structured, you know. When I switched over to Go, there are good tools for the basics, but I missed that kind of ready-made setup. So thats why I ended up making GoScrapy.

What is GoScrapy? (A quick intro)

GoScrapy is basically this framework that tries to mimic the experience of Scrapy but in Go. It is not just for pulling HTML, it manages the whole process from start to finish for extracting data. And it uses Go's built in concurrency, so you get fast performance without messing around with goroutines yourself all the time. I think that part is what makes it stand out, especially if you are coming from other languages.

To get going with it

There is a CLI tool that feels similar to Scrapy's. You install it with go install and it needs Go 1.22 or later.

go install github.com/tech-engine/goscrapy/cmd/...@latest

Then verify the installation with gos -v.

Starting a project is simple too

gos startproject my_scraper

It sets up everything for you, like the module and files.

gos startproject books_to_scrape

🚀 GoScrapy generating project files. Please wait!

📦 Initializing Go module: books_to_scrape...
go: creating new go.mod: module books_to_scrape
go: to add module requirements and sums:
        go mod tidy
✔️  books_to_scrape\base.go
✔️  books_to_scrape\constants.go
✔️  books_to_scrape\errors.go
✔️  books_to_scrape\job.go
✔️  main.go
✔️  books_to_scrape\record.go
✔️  books_to_scrape\settings.go
✔️  books_to_scrape\spider.go

📦 Do you want to resolve dependencies now (go mod tidy)? [Y/n]: y
📦 Resolving dependencies...

The output shows it's scaffolding the project will all the necessary file, so you don't have to manually. It even asks if you want to run go mod tidy right then. Pretty handy, I guess, though sometimes it takes a second to finish.

Project Structure

Once you have the project, the structure looks like this for something like books_to_scrape

books_to_scrape/
├── main.go               # The starting point
└── books_to_scrape/
    ├── base.go           # Initializing the engine, middlewares etc
    ├── constants.go      # Shared stuff
    ├── errors.go         # Custom errors, if you need any
    ├── job.go            # Parameters for your spider
    ├── record.go         # The structure of data you scrape
    ├── settings.go       # Configs like pipelines
    └── spider.go         # Where the actual extraction happens

main.go sets up the context, instantiate the spider, starts the request, and waits for it to finish. It keeps things straightforward.

// main.go

// this template will be autogenerate as well like all the files
func main() { 
    ctx := context.Background() 
    spider := books_to_scrape.New(ctx) 
    // start the first request
    spider.StartRequest(ctx, nil) 

    // wait for the spider to finish, true mean, the spider would
    // exit as soon as it's done. But if you are running goscrapy
    // as a server, you can set it to false(default), and it will keep running, accepts more jobs via the StartRequest() method.

    if err := spider.Wait(true); err != nil { 
        log.Fatal(err) 
    } 
}

Writing Spiders

In spider.go, thats where you write how to handle requests and parse responses. There are two main things you use all the time: s.Request and s.Parse.

s.Request(ctx) is what you call each time for a new fetch. It gives you a request object where you can chain things like Url, Meta, or Headers. Each request has to be its own thing don't try to reuse them for different pages/requests.

s.Parse(req, callback) is what actually sends the request off to the engine. You also tell it which function should handle the response once it comes back. Usually, you start with a StartRequest method that kicks things off.

For example, with books.toscrape.com, in the parse function you grab product links, make a full URL, and parse each one with parseProduct. Then check for the next page, and if there is one, follow it back to parse.

// spider.go

func (s *Spider) parse(ctx context.Context, resp core.IResponseReader) {
    // grab product links
    for _, productUrl := range resp.Css("article.product_pod h3 a").Attr("href") {
        req := s.Request(ctx)
        req.Url(fmt.Sprintf("%s/%s", s.baseUrl, productUrl))
        s.Parse(req, s.parseProduct)
    }

    // check for next page
    if next := resp.Css("li.next a").Attr("href"); len(next) > 0 {
        req := s.Request(ctx)
        req.Url(fmt.Sprintf("%s/%s", s.baseUrl, next[0]))
        s.Parse(req, s.parse) 
    }
}

func (s *Spider) parseProduct(ctx context.Context, resp core.IResponseReader) {
    product := resp.Css("article.product_page")

    // yield a Record
    s.Yield(&Record{
        Title: product.Css(".product_main h1").Text()[0],
        Price: product.Css(".price_color").Text()[0],
        Stock: product.Css(".availability").Text()[0],
    })
}

It handles pagination kind of naturally, but you have to be careful with the base URL.

Middlewares and Pipelines

The framework has middlewares and pipelines like Scrapy. Middlewares for things like retry with backoff or dupe filter are set in settings.go as a slice:


// settings.go
var MIDDLEWARES = []middlewaremanager.Middleware{
    middlewares.Retry(), 
    middlewares.DupeFilter,
}

Pipelines take the yielded records and export them, say to CSV. Every yield goes right there. It seems efficient for output, though I am not totally sure how it scales with huge datasets yet.

var export2CSV = pipelines.Export2CSV[*Record](pipelines.Export2CSVOpts{ 
    Filename: "output.csv", 
})

TUI Dashboard

There is also this TUI dashboard for watching progress in the terminal. You can add it in base.go by making a tui.New(app.Logger()), a telemetry hub, and starting it in a goroutine.

//base.go

// Tweak base.go to wire up the TUI
dashboard := tui.New(app.Logger())
// create a telemetry hub
hub := ts.NewTelemetryHub()
// add the dashboard as an observer
hub.AddObserver(dashboard)
// set the telemetry hub to the app
app.WithTelemetry(hub)

go func() {
    _ = gos.StartWithTUI(ctx, app, dashboard)
}()