Learning Golang (some rough notes) - S01E10 - Concurrency (Web Crawler)

#go #learning

In the previous exercise I felt my absence of a formal CompSci background with the introduction of Binary Sorted Trees, and now I am concious of it again with learning about mutex. I’d heard of them before, mostly when Oracle performance folk were talking about wait types - TIL it stands for mutual exclusion!

What if we … want to make sure only one goroutine can access a variable at a time to avoid conflicts?

This concept is called mutual exclusion, and the conventional name for the data structure that provides it is mutex.

Exercise: Web Crawler

👉 A Tour of Go : Exercise: Web Crawler

This was quite a fun one once I wrapped my head around it. It gave a health dose of copy & paste advice in the form of the previous example which I used to implement the first requirement.

Don’t fetch the same URL twice

I created a URLs struct to hold a map of URLs and a boolean of whether they have been crawled or not, and included a mutex so that it can be read and updated safely in concurrent execution

type URLs struct {
    c map[string]bool
    mux sync.Mutex
}

var u URLs = URLs{c: make(map[string]bool)}

The URLs type implements two functions - one to check if a given URL has been crawled, and the other to mark it as such

func (u URLs) IsCrawled(url string) bool {
    fmt.Printf("\n👀 Checking if %v has been crawled…", url)
    u.mux.Lock()
    defer u.mux.Unlock()
    if _, ok := u.c[url]; ok == false {
        fmt.Printf("…it hasn't\t")
        return false
    }
    fmt.Printf("…it has\t")
    return true
}

func (u URLs) Crawled(url string) {
    u.mux.Lock()
    u.c[url] = true
    u.mux.Unlock()
}

To the main Crawl function I then added calls to these functions and a conditional return:

// Check if the URL has been crawled already
if u.IsCrawled(url) == true {
    return
}
fmt.Printf("\n➡️ Crawling %v", url)
body, urls, err := fetcher.Fetch(url)
// Mark the URL as crawled (assumes that if there's an error you don't want to retry it)
u.Crawled(url)

As the comment notes, we assume that if a URL has been crawled, then we mark it as such, regardless of error status. If I was feeling adventurous I guess I could implement some kind of retry logic with incremental backoff…but I’m keeping it simple for now :)

Fetch URLs in parallel

This one I assumed could be done by simply using a Go routine in calling the nested Crawl functions. What it actually did was just fetch the first URL and exit

-> Checking if https://golang.org/ has been crawled……it hasn't  
    found: https://golang.org/ "The Go Programming Language"

Off to Google we went and found this answer on StackOverflow which showed the use of WaitGroups (nice example here). I ripped this off shamelessly into my code and it almost worked…

👀 Checking if https://golang.org/ has been crawled……it hasn't    
➡️ Crawling https://golang.org/
    ->✅ found: https://golang.org/ "The Go Programming Language"

👀 Checking if https://golang.org/pkg/ has been crawled……it hasn't    
➡��� Crawling https://golang.org/pkg/
    ->✅ found: https://golang.org/pkg/ "Packages"

👀 Checking if https://golang.org/ has been crawled……it has   
👀 Checking if https://golang.org/cmd/ has been crawled……it hasn't    
➡️ Crawling https://golang.org/cmd/
    ->⚠️ not found: https://golang.org/cmd/

👀 Checking if https://golang.org/pkg/fmt/ has been crawled……it hasn't    
➡️ Crawling https://golang.org/pkg/fmt/
    ->✅ found: https://golang.org/pkg/fmt/ "Package fmt"

👀 Checking if https://golang.org/ has been crawled……it has   
👀 Checking if https://golang.org/pkg/ has been crawled……it has   
👀 Checking if https://golang.org/pkg/os/ has been crawled……it hasn't 
➡️ Crawling https://golang.org/pkg/os/
    ->✅ found: https://golang.org/pkg/os/ "Package os"

👀 Checking if https://golang.org/ has been crawled……it has   
👀 Checking if https://golang.org/pkg/ has been crawled……it has   
👀 Checking if https://golang.org/cmd/ has been crawled……it has

but then threw a panic

panic: sync: negative WaitGroup counter

goroutine 1 [running]:
sync.(*WaitGroup).Add(0xc0000a4010, 0xffffffffffffffff)
    /usr/local/go/src/sync/waitgroup.go:74 +0x1ec
sync.(*WaitGroup).Done(0xc0000a4010)
    /usr/local/go/src/sync/waitgroup.go:99 +0x34
main.Crawl(0x1100e8c, 0x13, 0x4, 0x110fb60, 0xc0000801b0, 0xc0000a4010)
    /Users/rmoff/go/src/webcrawler/webcrawler.go:68 +0x676
main.main()
    /Users/rmoff/go/src/webcrawler/webcrawler.go:73 +0x98

A bit of Googling showed that panic: sync: negative WaitGroup counter (as the error actually suggests) comes about because Done had been called to decrease the number of WaitGroups and taken them below zero.

This happened because every execution of Crawl includes

defer wg.Done()

but the corresponding

wg.Add(1)

was only added in the nested call to Crawl and not the initial invocation from main(). Adding this into main() then made everything work just great.

func Crawl(url string, depth int, fetcher Fetcher, wg *sync.WaitGroup) {
    defer wg.Done()

    if depth <= 0 {
        return
    }

    // Check if the URL has been crawled already
    if u.IsCrawled(url) == true {
        return
    }
    fmt.Printf("\n➡️ Crawling %v", url)
    body, urls, err := fetcher.Fetch(url)
    // Mark the URL as crawled (assumes that if there's an error you don't want to retry it)
    u.Crawled(url)

    if err != nil {
        fmt.Println(err)
        return
    }
    fmt.Printf("\n\t->✅ found: %s %q\n", url, body)
    for _, z := range urls {
        wg.Add(1)

        Crawl(z, depth-1, fetcher, wg)
    }

}

func main() {
    wg := &sync.WaitGroup{}
    wg.Add(1)
    Crawl("https://golang.org/", 4, fetcher, wg)
    wg.Wait()
}

DEV Community

Learning Golang (some rough notes) - S01E10 - Concurrency (Web Crawler)

Exercise: Web Crawler

Don’t fetch the same URL twice

Fetch URLs in parallel

Top comments (0)