<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Goscrapy</title>
    <description>The latest articles on DEV Community by Goscrapy (@goscrapy).</description>
    <link>https://dev.to/goscrapy</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3880027%2F3e0c8ca1-b308-47ae-bffe-429c8d83b776.png</url>
      <title>DEV Community: Goscrapy</title>
      <link>https://dev.to/goscrapy</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/goscrapy"/>
    <language>en</language>
    <item>
      <title>Scrapy 🕷️, but in Go: Building High-Performance Scrapers without the Boilerplate</title>
      <dc:creator>Goscrapy</dc:creator>
      <pubDate>Wed, 15 Apr 2026 16:28:56 +0000</pubDate>
      <link>https://dev.to/goscrapy/scrapy-but-in-go-building-high-performance-scrapers-without-the-boilerplate-49k3</link>
      <guid>https://dev.to/goscrapy/scrapy-but-in-go-building-high-performance-scrapers-without-the-boilerplate-49k3</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsf9gdmzxhcoou8gcrt4g.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsf9gdmzxhcoou8gcrt4g.gif" alt="GoScrapy: Harnessing Go's perfomance for blazingly fast web scraping, inspired by Python's Scrapy framework." width="880" height="433"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Hi everyone 👋&lt;/p&gt;

&lt;p&gt;Web scraping can start out pretty basic. You just loop through some pages, grab the HTML, pull out what you need, and store it somewhere. But then when you try to make it bigger, like dealing with a ton of requests or figuring out retries and cookies for different sites, it turns into a real hassle quick.&lt;/p&gt;

&lt;p&gt;I remember using &lt;strong&gt;Scrapy&lt;/strong&gt; in Python, and it handled all that stuff without me having to think too hard about the details. It felt structured, you know. When I switched over to Go, there are good tools for the basics, but I missed that kind of ready-made setup. So thats why I ended up making &lt;strong&gt;GoScrapy&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is GoScrapy? (A quick intro)
&lt;/h2&gt;

&lt;p&gt;GoScrapy is basically this &lt;strong&gt;framework that tries to mimic the experience of Scrapy but in Go&lt;/strong&gt;. It is not just for pulling HTML, it manages the whole process from start to finish for extracting data. And it uses Go's built in concurrency, so you &lt;strong&gt;get fast performance without messing around with goroutines yourself&lt;/strong&gt; all the time. I think that part is what makes it stand out, especially if you are coming from other languages.&lt;/p&gt;

&lt;h2&gt;
  
  
  To get going with it
&lt;/h2&gt;

&lt;p&gt;There is a CLI tool that feels similar to Scrapy's. You install it with go install and it needs Go 1.22 or later.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;go &lt;span class="nb"&gt;install &lt;/span&gt;github.com/tech-engine/goscrapy/cmd/...@latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then verify the installation with &lt;code&gt;gos -v&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;Starting a project is simple too&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gos startproject my_scraper
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It sets up everything for you, like the module and files.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;gos startproject books_to_scrape

🚀 GoScrapy generating project files. Please wait!

📦 Initializing Go module: books_to_scrape...
go: creating new go.mod: module books_to_scrape
go: to add module requirements and sums:
        go mod tidy
✔️  books_to_scrape\base.go
✔️  books_to_scrape\constants.go
✔️  books_to_scrape\errors.go
✔️  books_to_scrape\job.go
✔️  main.go
✔️  books_to_scrape\record.go
✔️  books_to_scrape\settings.go
✔️  books_to_scrape\spider.go

📦 Do you want to resolve dependencies now (go mod tidy)? [Y/n]: y
📦 Resolving dependencies...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output shows it's scaffolding the project will all the necessary file, so you don't have to manually. It even asks if you want to run go mod tidy right then. Pretty handy, I guess, though sometimes it takes a second to finish.&lt;/p&gt;

&lt;h2&gt;
  
  
  Project Structure
&lt;/h2&gt;

&lt;p&gt;Once you have the project, the structure looks like this for something like books_to_scrape&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;books_to_scrape/
├── main.go               # The starting point
└── books_to_scrape/
    ├── base.go           # Initializing the engine, middlewares etc
    ├── constants.go      # Shared stuff
    ├── errors.go         # Custom errors, if you need any
    ├── job.go            # Parameters for your spider
    ├── record.go         # The structure of data you scrape
    ├── settings.go       # Configs like pipelines
    └── spider.go         # Where the actual extraction happens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;main.go&lt;/code&gt; sets up the context, instantiate the spider, starts the request, and waits for it to finish. It keeps things straightforward.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// main.go&lt;/span&gt;

&lt;span class="c"&gt;// this template will be autogenerate as well like all the files&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; 
    &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Background&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; 
    &lt;span class="n"&gt;spider&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;books_to_scrape&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="c"&gt;// start the first request&lt;/span&gt;
    &lt;span class="n"&gt;spider&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StartRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 

    &lt;span class="c"&gt;// wait for the spider to finish, true mean, the spider would&lt;/span&gt;
    &lt;span class="c"&gt;// exit as soon as it's done. But if you are running goscrapy&lt;/span&gt;
    &lt;span class="c"&gt;// as a server, you can set it to false(default), and it will keep running, accepts more jobs via the StartRequest() method.&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;spider&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Wait&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; 
        &lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Fatal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="p"&gt;}&lt;/span&gt; 
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Writing Spiders
&lt;/h2&gt;

&lt;p&gt;In &lt;code&gt;spider.go&lt;/code&gt;, thats where you write how to handle requests and parse responses. There are two main things you use all the time: &lt;code&gt;s.Request&lt;/code&gt; and &lt;code&gt;s.Parse&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;s.Request(ctx)&lt;/code&gt; is what you call each time for a new fetch. It gives you a request object where you can chain things like &lt;code&gt;Url&lt;/code&gt;, &lt;code&gt;Meta&lt;/code&gt;, or &lt;code&gt;Headers&lt;/code&gt;. Each request has to be its own thing don't try to reuse them for different pages/requests.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;s.Parse(req, callback)&lt;/code&gt; is what actually sends the request off to the engine. You also tell it which function should handle the response once it comes back. Usually, you start with a &lt;code&gt;StartRequest&lt;/code&gt; method that kicks things off.&lt;/p&gt;

&lt;p&gt;For example, with &lt;code&gt;books.toscrape.com&lt;/code&gt;, in the parse function you grab product links, make a full URL, and parse each one with &lt;code&gt;parseProduct&lt;/code&gt;. Then check for the next page, and if there is one, follow it back to parse.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// spider.go&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IResponseReader&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// grab product links&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;productUrl&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"article.product_pod h3 a"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Attr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"href"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"%s/%s"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;baseUrl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;productUrl&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parseProduct&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// check for next page&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;next&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"li.next a"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Attr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"href"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;next&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;req&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"%s/%s"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;baseUrl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;next&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
        &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Spider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;parseProduct&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="n"&gt;core&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IResponseReader&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;product&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"article.product_page"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c"&gt;// yield a Record&lt;/span&gt;
    &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Yield&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Record&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;Title&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".product_main h1"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Text&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;Price&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".price_color"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Text&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;Stock&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;product&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;".availability"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Text&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It handles pagination kind of naturally, but you have to be careful with the base URL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Middlewares and Pipelines
&lt;/h2&gt;

&lt;p&gt;The framework has middlewares and pipelines like Scrapy. Middlewares for things like retry with backoff or dupe filter are set in &lt;code&gt;settings.go&lt;/code&gt; as a slice:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;
&lt;span class="c"&gt;// settings.go&lt;/span&gt;
&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;MIDDLEWARES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;middlewaremanager&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Middleware&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;middlewares&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Retry&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; 
    &lt;span class="n"&gt;middlewares&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DupeFilter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pipelines take the yielded records and export them, say to CSV. Every yield goes right there. It seems efficient for output, though I am not totally sure how it scales with huge datasets yet.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;export2CSV&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pipelines&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Export2CSV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Record&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="n"&gt;pipelines&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Export2CSVOpts&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt; 
    &lt;span class="n"&gt;Filename&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"output.csv"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  TUI Dashboard
&lt;/h2&gt;

&lt;p&gt;There is also this TUI dashboard for watching progress in the terminal. You can add it in &lt;code&gt;base.go&lt;/code&gt; by making a &lt;code&gt;tui.New(app.Logger())&lt;/code&gt;, a telemetry hub, and starting it in a goroutine.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;//base.go&lt;/span&gt;

&lt;span class="c"&gt;// Tweak base.go to wire up the TUI&lt;/span&gt;
&lt;span class="n"&gt;dashboard&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;tui&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;New&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Logger&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="c"&gt;// create a telemetry hub&lt;/span&gt;
&lt;span class="n"&gt;hub&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewTelemetryHub&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c"&gt;// add the dashboard as an observer&lt;/span&gt;
&lt;span class="n"&gt;hub&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AddObserver&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dashboard&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c"&gt;// set the telemetry hub to the app&lt;/span&gt;
&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithTelemetry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;hub&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;go&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gos&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;StartWithTUI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dashboard&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It shows stats visually, which is nice if you are fond of that.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anyway, thats the gist of it
&lt;/h2&gt;

&lt;p&gt;GoScrapy is still in early versions, v0.x, and under development. If you want to check it out or help, feel free.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/tech-engine/goscrapy" rel="noopener noreferrer"&gt;github.com/tech-engine/goscrapy&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discord&lt;/strong&gt;: &lt;a href="https://discord.gg/FPvxETjYPH" rel="noopener noreferrer"&gt;Join our community&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I think it could be useful for Go folks doing scraping, but it might need more tweaks.&lt;/p&gt;

&lt;p&gt;Thank you for reading this far 💚&lt;/p&gt;

</description>
      <category>webscraping</category>
      <category>go</category>
      <category>goscrapy</category>
    </item>
  </channel>
</rss>
