Hi everyone π
Web scraping can start out pretty basic. You just loop through some pages, grab the HTML, pull out what you need, and store it somewhere. But then when you try to make it bigger, like dealing with a ton of requests or figuring out retries and cookies for different sites, it turns into a real hassle quick.
I remember using Scrapy in Python, and it handled all that stuff without me having to think too hard about the details. It felt structured, you know. When I switched over to Go, there are good tools for the basics, but I missed that kind of ready-made setup. So thats why I ended up making GoScrapy.
What is GoScrapy? (A quick intro)
GoScrapy is basically this framework that tries to mimic the experience of Scrapy but in Go. It is not just for pulling HTML, it manages the whole process from start to finish for extracting data. And it uses Go's built in concurrency, so you get fast performance without messing around with goroutines yourself all the time. I think that part is what makes it stand out, especially if you are coming from other languages.
To get going with it
There is a CLI tool that feels similar to Scrapy's. You install it with go install and it needs Go 1.22 or later.
go install github.com/tech-engine/goscrapy/cmd/...@latest
Then verify the installation with gos -v.
Starting a project is simple too
gos startproject my_scraper
It sets up everything for you, like the module and files.
gos startproject books_to_scrape
π GoScrapy generating project files. Please wait!
π¦ Initializing Go module: books_to_scrape...
go: creating new go.mod: module books_to_scrape
go: to add module requirements and sums:
go mod tidy
βοΈ books_to_scrape\base.go
βοΈ books_to_scrape\constants.go
βοΈ books_to_scrape\errors.go
βοΈ books_to_scrape\job.go
βοΈ main.go
βοΈ books_to_scrape\record.go
βοΈ books_to_scrape\settings.go
βοΈ books_to_scrape\spider.go
π¦ Do you want to resolve dependencies now (go mod tidy)? [Y/n]: y
π¦ Resolving dependencies...
The output shows it's scaffolding the project will all the necessary file, so you don't have to manually. It even asks if you want to run go mod tidy right then. Pretty handy, I guess, though sometimes it takes a second to finish.
Project Structure
Once you have the project, the structure looks like this for something like books_to_scrape
books_to_scrape/
βββ main.go # The starting point
βββ books_to_scrape/
βββ base.go # Initializing the engine, middlewares etc
βββ constants.go # Shared stuff
βββ errors.go # Custom errors, if you need any
βββ job.go # Parameters for your spider
βββ record.go # The structure of data you scrape
βββ settings.go # Configs like pipelines
βββ spider.go # Where the actual extraction happens
main.go sets up the context, instantiate the spider, starts the request, and waits for it to finish. It keeps things straightforward.
// main.go
// this template will be autogenerate as well like all the files
func main() {
ctx := context.Background()
spider := books_to_scrape.New(ctx)
// start the first request
spider.StartRequest(ctx, nil)
// wait for the spider to finish, true mean, the spider would
// exit as soon as it's done. But if you are running goscrapy
// as a server, you can set it to false(default), and it will keep running, accepts more jobs via the StartRequest() method.
if err := spider.Wait(true); err != nil {
log.Fatal(err)
}
}
Writing Spiders
In spider.go, thats where you write how to handle requests and parse responses. There are two main things you use all the time: s.Request and s.Parse.
s.Request(ctx) is what you call each time for a new fetch. It gives you a request object where you can chain things like Url, Meta, or Headers. Each request has to be its own thing don't try to reuse them for different pages/requests.
s.Parse(req, callback) is what actually sends the request off to the engine. You also tell it which function should handle the response once it comes back. Usually, you start with a StartRequest method that kicks things off.
For example, with books.toscrape.com, in the parse function you grab product links, make a full URL, and parse each one with parseProduct. Then check for the next page, and if there is one, follow it back to parse.
// spider.go
func (s *Spider) parse(ctx context.Context, resp core.IResponseReader) {
// grab product links
for _, productUrl := range resp.Css("article.product_pod h3 a").Attr("href") {
req := s.Request(ctx)
req.Url(fmt.Sprintf("%s/%s", s.baseUrl, productUrl))
s.Parse(req, s.parseProduct)
}
// check for next page
if next := resp.Css("li.next a").Attr("href"); len(next) > 0 {
req := s.Request(ctx)
req.Url(fmt.Sprintf("%s/%s", s.baseUrl, next[0]))
s.Parse(req, s.parse)
}
}
func (s *Spider) parseProduct(ctx context.Context, resp core.IResponseReader) {
product := resp.Css("article.product_page")
// yield a Record
s.Yield(&Record{
Title: product.Css(".product_main h1").Text()[0],
Price: product.Css(".price_color").Text()[0],
Stock: product.Css(".availability").Text()[0],
})
}
It handles pagination kind of naturally, but you have to be careful with the base URL.
Middlewares and Pipelines
The framework has middlewares and pipelines like Scrapy. Middlewares for things like retry with backoff or dupe filter are set in settings.go as a slice:
// settings.go
var MIDDLEWARES = []middlewaremanager.Middleware{
middlewares.Retry(),
middlewares.DupeFilter,
}
Pipelines take the yielded records and export them, say to CSV. Every yield goes right there. It seems efficient for output, though I am not totally sure how it scales with huge datasets yet.
var export2CSV = pipelines.Export2CSV[*Record](pipelines.Export2CSVOpts{
Filename: "output.csv",
})
TUI Dashboard
There is also this TUI dashboard for watching progress in the terminal. You can add it in base.go by making a tui.New(app.Logger()), a telemetry hub, and starting it in a goroutine.
//base.go
// Tweak base.go to wire up the TUI
dashboard := tui.New(app.Logger())
// create a telemetry hub
hub := ts.NewTelemetryHub()
// add the dashboard as an observer
hub.AddObserver(dashboard)
// set the telemetry hub to the app
app.WithTelemetry(hub)
go func() {
_ = gos.StartWithTUI(ctx, app, dashboard)
}()
It shows stats visually, which is nice if you are fond of that.
Anyway, thats the gist of it
GoScrapy is still in early versions, v0.x, and under development. If you want to check it out or help, feel free.
- GitHub: github.com/tech-engine/goscrapy
- Discord: Join our community
I think it could be useful for Go folks doing scraping, but it might need more tweaks.
Thank you for reading this far π

Top comments (0)