DEV Community

cylon
cylon

Posted on

2 1

10分钟go crawler colly从入门到精通

Introduction

本文对colly如何使用,整个代码架构设计,以及一些使用实例的收集。

Colly是Go语言开发的Crawler Framework,并不是一个完整的产品,Colly提供了类似于Python的同类产品(BeautifulSoupScrapy)相似的表现力和灵活性。

Colly这个名称源自 Collector 的简写,而Collector 也是 Colly的核心。

Colly Official Docs,内容不是很多,最新的消息也很就远了,仅仅是活跃在Github

Concepts

Architecture

从理解上来说,Colly的设计分为两层,核心层和解析层,

  • Collector :是Colly实现,该组件负责网络通信,并负责在Collector 作业运行时执行对应事件的回调。
  • Parser:这个其实是抽象的,官网并未对此说明,goquery和一些htmlquery,通过这些就可以将访问的结果解析成类Jquery对象,使html拥有了,XPath选择器和CSS选择器

通常情况下Crawler的工作流生命周期大致为

  • 构建客户端
  • 发送请求
  • 获取响应的数据
  • 将相应的数据解析
  • 对所需数据处理
  • 持久化

而Colly则是将这些概念进行封装,通过将事件注册到每个步骤中,通过事件的方式对数据进行清理,抽象来说,Colly面向的是过程而不是对象。大概的工作架构如图

event

通过上述的概念,可以大概了解到 Colly 是一个基于事件的Crawler,通过开发者自行注册事件函数来触发整个流水线的工作

Colly 具有以下事件处理程序:

  • OnRequest:在请求之前调用
  • OnError :在请求期间发生错误时调用
  • OnResponseHeaders :在收到响应头后调用
  • OnResponse: 在收到响应后调用
  • OnHTML:如果接收到的内容是 HTML,则在 OnResponse 之后立即调用
  • OnXML :如果接收到的内容是 HTML 或 XML,则在 OnHTML 之后立即调用
  • OnScraped:在 OnXML 回调之后调用
  • OnHTMLDetach:取消注册一个OnHTML事件函数,取消后,如未执行过得事件将不会再被执行
  • OnXMLDetach:取消注册一个OnXML事件函数,取消后,如未执行过得事件将不会再被执行

Reference

goquery

htmlquery

Utilities

简单使用

package main

import (
    "fmt"

    "github.com/gocolly/colly"
)

func main() {
    // Instantiate default collector
    c := colly.NewCollector(
        // Visit only domains: hackerspaces.org, wiki.hackerspaces.org
        colly.AllowedDomains("hackerspaces.org", "wiki.hackerspaces.org"),
    )

    // On every a element which has href attribute call callback
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {
        link := e.Attr("href")
        // Print link
        fmt.Printf("Link found: %q -> %s\n", e.Text, link)
        // Visit link found on page
        // Only those links are visited which are in AllowedDomains
        c.Visit(e.Request.AbsoluteURL(link))
    })

    // Before making a request print "Visiting ..."
    c.OnRequest(func(r *colly.Request) {
        fmt.Println("Visiting", r.URL.String())
    })

    // Start scraping on https://hackerspaces.org
    c.Visit("https://hackerspaces.org/")
}
Enter fullscreen mode Exit fullscreen mode

错误处理

package main

import (
    "fmt"

    "github.com/gocolly/colly"
)

func main() {
    // Create a collector
    c := colly.NewCollector()

    // Set HTML callback
    // Won't be called if error occurs
    c.OnHTML("*", func(e *colly.HTMLElement) {
        fmt.Println(e)
    })

    // Set error handler
    c.OnError(func(r *colly.Response, err error) {
        fmt.Println("Request URL:", r.Request.URL, "failed with response:", r, "\nError:", err)
    })

    // Start scraping
    c.Visit("https://definitely-not-a.website/")
}
Enter fullscreen mode Exit fullscreen mode

处理本地文件

word.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Document title</title>
</head>
<body>
<p>List of words</p>
<ul>
    <li>dark</li>
    <li>smart</li>
    <li>war</li>
    <li>cloud</li>
    <li>park</li>
    <li>cup</li>
    <li>worm</li>
    <li>water</li>
    <li>rock</li>
    <li>warm</li>
</ul>
<footer>footer for words</footer>
</body>
</html>
Enter fullscreen mode Exit fullscreen mode
package main

import (
    "fmt"
    "net/http"

    "github.com/gocolly/colly/v2"
)

func main() {

    t := &http.Transport{}
    t.RegisterProtocol("file", http.NewFileTransport(http.Dir(".")))

    c := colly.NewCollector()
    c.WithTransport(t)

    words := []string{}

    c.OnHTML("li", func(e *colly.HTMLElement) {
        words = append(words, e.Text)
    })

    c.Visit("file://./words.html")

    for _, p := range words {
        fmt.Printf("%s\n", p)
    }
}
Enter fullscreen mode Exit fullscreen mode

使用代理交换器

通过 ProxySwitcher , 可以直接使用一批代理IP池进行访问了,然而这里只有RR,如果需要其他的均衡算法,需要有自己实现了

package main

import (
    "bytes"
    "log"

    "github.com/gocolly/colly"
    "github.com/gocolly/colly/proxy"
)

func main() {
    // Instantiate default collector
    c := colly.NewCollector(colly.AllowURLRevisit())

    // Rotate two socks5 proxies
    rp, err := proxy.RoundRobinProxySwitcher("socks5://127.0.0.1:1337", "socks5://127.0.0.1:1338")
    if err != nil {
        log.Fatal(err)
    }
    c.SetProxyFunc(rp)

    // Print the response
    c.OnResponse(func(r *colly.Response) {
        log.Printf("Proxy Address: %s\n", r.Request.ProxyURL)
        log.Printf("%s\n", bytes.Replace(r.Body, []byte("\n"), nil, -1))
    })

    // Fetch httpbin.org/ip five times
    for i := 0; i < 5; i++ {
        c.Visit("https://httpbin.org/ip")
    }
}

Enter fullscreen mode Exit fullscreen mode

随机延迟

该功能可以对行为设置一种特征,以免被反扒机器人检测,并禁止我们,如速率限制和延迟

package main

import (
    "fmt"
    "time"

    "github.com/gocolly/colly"
    "github.com/gocolly/colly/debug"
)

func main() {
    url := "https://httpbin.org/delay/2"

    // Instantiate default collector
    c := colly.NewCollector(
        // Attach a debugger to the collector
        colly.Debugger(&debug.LogDebugger{}),
        colly.Async(true),
    )

    // Limit the number of threads started by colly to two
    // when visiting links which domains' matches "*httpbin.*" glob
    c.Limit(&colly.LimitRule{
        DomainGlob:  "*httpbin.*",
        Parallelism: 2,
        RandomDelay: 5 * time.Second,
    })

    // Start scraping in four threads on https://httpbin.org/delay/2
    for i := 0; i < 4; i++ {
        c.Visit(fmt.Sprintf("%s?n=%d", url, i))
    }
    // Start scraping on https://httpbin.org/delay/2
    c.Visit(url)
    // Wait until threads are finished
    c.Wait()
}
Enter fullscreen mode Exit fullscreen mode

多线程请求队列

package main

import (
    "fmt"

    "github.com/gocolly/colly"
    "github.com/gocolly/colly/queue"
)

func main() {
    url := "https://httpbin.org/delay/1"

    // Instantiate default collector
    c := colly.NewCollector(colly.AllowURLRevisit())

    // create a request queue with 2 consumer threads
    q, _ := queue.New(
        2, // Number of consumer threads
        &queue.InMemoryQueueStorage{MaxSize: 10000}, // Use default queue storage
    )

    c.OnRequest(func(r *colly.Request) {
        fmt.Println("visiting", r.URL)
        if r.ID < 15 {
            r2, err := r.New("GET", fmt.Sprintf("%s?x=%v", url, r.ID), nil)
            if err == nil {
                q.AddRequest(r2)
            }
        }
    })

    for i := 0; i < 5; i++ {
        // Add URLs to the queue
        q.AddURL(fmt.Sprintf("%s?n=%d", url, i))
    }
    // Consume URLs
    q.Run(c)

}
Enter fullscreen mode Exit fullscreen mode

异步

默认情况下,Colly的工作模式是同步的。可以使用 Async 函数启用异步模式。在异步模式下,我们需要调用Wait 等待Collector 工作完成。

package main

import (
    "fmt"

    "github.com/gocolly/colly/v2"
)

func main() {

    urls := []string{
        "http://webcode.me",
        "https://example.com",
        "http://httpbin.org",
        "https://www.perl.org",
        "https://www.php.net",
        "https://www.python.org",
        "https://code.visualstudio.com",
        "https://clojure.org",
    }

    c := colly.NewCollector(
        colly.Async(),
    )

    c.OnHTML("title", func(e *colly.HTMLElement) {
        fmt.Println(e.Text)
    })

    for _, url := range urls {

        c.Visit(url)
    }

    c.Wait()

}
Enter fullscreen mode Exit fullscreen mode

最大深度

深度是在访问这个页面时,其页面还有link,此时需要采集到入口link几层的link?默认1

package main

import (
   "fmt"

   "github.com/gocolly/colly"
)

func main() {
   // Instantiate default collector
   c := colly.NewCollector(
      // MaxDepth is 1, so only the links on the scraped page
      // is visited, and no further links are followed
      colly.MaxDepth(1),
   )

   // On every a element which has href attribute call callback
   c.OnHTML("a[href]", func(e *colly.HTMLElement) {
      link := e.Attr("href")
      // Print link
      fmt.Println(link)
      // Visit link found on page
      e.Request.Visit(link)
   })

   // Start scraping on https://en.wikipedia.org
   c.Visit("https://en.wikipedia.org/")
}
Enter fullscreen mode Exit fullscreen mode

Reference

gocolly

colly

Sentry image

Hands-on debugging session: instrument, monitor, and fix

Join Lazar for a hands-on session where you’ll build it, break it, debug it, and fix it. You’ll set up Sentry, track errors, use Session Replay and Tracing, and leverage some good ol’ AI to find and fix issues fast.

RSVP here →

Top comments (0)

AWS Security LIVE!

Join us for AWS Security LIVE!

Discover the future of cloud security. Tune in live for trends, tips, and solutions from AWS and AWS Partners.

Learn More

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay