Daniel McMahon

Posted on Dec 6, 2018

Lucas - A Webscraper in Go

#webdev #beginners #go

Lucas

Lucas is a webscraper built using Go and the Colly library. He's also an adorable spider on YouTube!

Why?

I wanted to experiment a little with Go as it's a programming language used by certain teams at my office and not something I often get hands on with. I figured it would be fun to try deconstruct a random online website and see if I could come up with a programmatic way to abstract the HTML data in a way that could be perceived as useful.

The project started out as a fun experiment but after taking it as far as I have I decided to stop working on it for a few reasons which I'll outline below.

The Morality of Web Scraping

As will be apparent to many out there, web scraping is a bit of a 'grey' area morally speaking. From all the research and reading I've done on the subject I usually boil it down to this:

'if you are not violating fair terms of use, abiding by a companies robots.txt file & notifying them that you are doing so, then you should be in the clear'

This is a highly generalised statement for this complex issue but I find its a nice guideline to try follow.

When it comes to e-commerce online websites can hold some highly profitable data, data which they may not even necessarily surface on the frontend of their stores, but they will still have the JSON floating around in their static page content.

Some types of information e-commerce stores may leave floating around in their HTML include:

Stock levels (things like low in stock flags or even exact stock numbers!)
Size availability
Pricing/Sales Rates

For general user purposes this data can go unnoticed but it can be a simple case of right clicking on a storefront and selecting 'view page source' to access this data.

The 'Drawbacks' of Structured data

When it comes to designing a web scraper you want to try analyse the structure of your 'target' sites pages. Ask yourself these kinds of questions:

Is there an easy pattern to follow to access product specific pages
Is there a replicated structure across multiple pages to allow easy html parsing

On the more technical side of things consider the following:

Does the website update its HTML often?
Can you account for missing data during your web scraping?
How will you store the data? Will a DB be fast enough?
Do you plan to handle multi-threading/distributed scraping?

How the scraper works

So onto the fun part - the code! I will talk in generalised terms here so the practices used can be applied to any form of e-commerce store.

Data Storage

In order to setup a way to store the scraped websites data I decided to roll with a postgres DB, for no other reason than I was familiar with it and it was easy to setup via a docker-compose file.

lucas:
  container_name: lucas
  image: postgres:9.6-alpine
  ports:
    - '5432:5432'
  environment:
    POSTGRES_DB: 'lucas_db'
    POSTGRES_USER: 'user'

With this basic PSQL db I was able to setup a basic table by running the following command with the following input file

psql -h localhost -U user lucas_db -f dbsetup.sql

dbsetup.sql:


\c lucas_db

CREATE TABLE floryday(
  index serial,
  product text,
  code text,
  description text,
  price decimal(53, 4)
)

As you can see from the table there were a few basic details I decided to scrape from the web pages in question:

index: this was just used as a unique id
product: name of the product item
code: code of the product in question
description: the description text of the product
price: the price of the product

As listed above there are other additional fields you might be interested in abstracting in your own experiences like size & availability.

Go Dependencies

I was a little sloppy in my service setup in that I did not rely on a Go service dependency management tool like dep, instead I just took care of manually installing them (as there were only 3 dependencies it wasn't so bad). These were the three main external libraries I used they can be installed with the command go get <repo-name>:

github.com/gocolly/colly - webscraping package
github.com/fatih/color - command line colors package
github.com/lib/pq - postgres driver package

To make this setup a little easier I setup a Dockerfile to keep track of the installation

FROM golang:1.11

MAINTAINER Daniel McMahon <daniel40392@gmail.com>

WORKDIR /opt/lucas

ADD . /opt/lucas

ENV PORT blergh

# installing our golang dependencies
RUN go get -u github.com/gocolly/colly && \
  go get -u github.com/fatih/color && \
  go get -u github.com/lib/pq

EXPOSE 8000

CMD go run lucas.go

The main logic

In short the main code does the following:

Starts at a seed url
Scans all the links on the page
Looks for a page that matches a certain regex i.e. -Dresses we know from some basic checks that these pages all have a similar page structure and are usually product pages that we are interested in
Define a Struct to represent our clothing values of interest
Write the Struct to the postgres DB
Continue up to a size of 200 writes

Here is the main bulk of the code with comments explaining the logic - it is not optimized and still quite rough around the edges but its key functionality is in place:

lucas.go

// as our scraper will only use one file this will be our main package
package main

// importing dependencies
import (
    "encoding/json"
    "log"
    "os"
    "fmt"
    "strings"
    "github.com/gocolly/colly"
    "github.com/fatih/color"
    "database/sql"
    _ "github.com/lib/pq"
    "strconv"
)

// setting up a datastruture to represent a form of Clothing
type Clothing struct {
    Name                    string
    Code                    string
    Description     string
    Price                   float64
}

// setting up a function to write to our db 
func dbWrite(product Clothing) {
    const (
      host     = "localhost"
      port     = 5432
      user     = "user"
      // password = ""
      dbname   = "lucas_db"
    )

    psqlInfo := fmt.Sprintf("host=%s port=%d user=%s "+
    "dbname=%s sslmode=disable",
    host, port, user, dbname)

    db, err := sql.Open("postgres", psqlInfo)
  if err != nil {
    panic(err)
  }
  defer db.Close()

  err = db.Ping()
  if err != nil {
    panic(err)
  }

  // some debug print logs
  log.Print("Successfully connected!")
    fmt.Printf("%s, %s, %s, %f", product.Name, product.Code, product.Description, product.Price)
    sqlStatement := `
    INSERT INTO floryday (product, code, description, price)
    VALUES ($1, $2, $3, $4)`
    _, err = db.Exec(sqlStatement, product.Name, product.Code, product.Description, product.Price)
    if err != nil {
      panic(err)
    }
}

// our main function - using a colly collector
func main() {
        // creating our new colly collector with a localised cache
    c := colly.NewCollector(
        // colly.AllowedDomains("https://www.clotheswebsite.com/"),
        colly.CacheDir(".floryday_cache"),
    // colly.MaxDepth(5), // keeping crawling limited for our initial experiments
  )

    // clothing detail scraping collector
    detailCollector := c.Clone()

        // setting our array of clothing to size 200
    clothes := make([]Clothing, 0, 200)

    // Find and visit all links
    c.OnHTML("a[href]", func(e *colly.HTMLElement) {

        link := e.Attr("href")

        // hardcoded urls to skip -> these arent relevant for products
        if !strings.HasPrefix(link, "/?country_code") || strings.Index(link, "/cart.php") > -1 ||
        strings.Index(link, "/login.php") > -1 || strings.Index(link, "/cart.php") > -1 ||
        strings.Index(link, "/account") > -1 || strings.Index(link, "/privacy-policy.html") > -1 {
            return
        }

        // scrape the page
        e.Request.Visit(link)
    })

    // printing visiting message for debug purposes
    c.OnRequest(func(r *colly.Request) {
        log.Println("Visiting", r.URL.String(), "\n")
    })

    // visting any href links -> this can be optimised later
    c.OnHTML(`a[href]`, func(e *colly.HTMLElement) {

        clothingURL := e.Request.AbsoluteURL(e.Attr("href"))

        // this was a way to determine the page was definitely a product
                // if it contained -Dress- we were good to scrape
        if strings.Contains(clothingURL, "-Dress-"){
            // Activate detailCollector
            color.Green("Crawling Link Validated -> Commencing Crawl for %s", clothingURL)
            detailCollector.Visit(clothingURL)
        } else {
            color.Red("Validation Failed -> Cancelling Crawl for %s", clothingURL)
            return
        }

    })

    // Extract details of the clothing
    detailCollector.OnHTML(`div[class=prod-right-in]`, func(e *colly.HTMLElement) {
        // some html parsing to get the exact values we want
        title := e.ChildText(".prod-name")
        code := strings.Split(e.ChildText(".prod-item-code"), "#")[1]
        stringPrice := strings.TrimPrefix(e.ChildText(".prod-price"),"€ ") 
        price, err := strconv.ParseFloat(stringPrice, 64) // conversion to float64
        color.Red("err in parsing price -> %s", err)
        description := e.ChildText(".grid-uniform")

        clothing := Clothing{
            Name:                   title,
            Code:                   code,
            Description:        description,
            Price:                  price,
        }

        // writing as we go to DB
        // TODO optiize to handle bulk array uploads instead of one at a time
        dbWrite(clothing)

        // appending to our output array...
        clothes = append(clothes, clothing)
    })

    // start scraping at our seed address
    c.Visit("https://www.ourclothingwebstore.com/Dresses-r9872/")

    enc := json.NewEncoder(os.Stdout)
    enc.SetIndent("", "  ")

    // Dump json to the standard output
    enc.Encode(clothes)

}

Dangerzone

After getting the basic functionality setup and working it was at this point I decided to try experiment with multiple product seed pages. I discovered that this particular store laid out its main products on a page called: https://www.ourclothingwebstore.com/Dresses-r9872/ however this could be paginated with a simple /p2 at the end of the url, or /p3, /p4 all the way up to p80 something! On average there was around 40+ products on each of these pages. I had implemented a simple for loop that iterated over this seed url updating it each time. I could in essence with a little more logic setup the crawler to hit all the Dress products this store had on sale (and similarly I'm sure with a small regex tweak the logic could have been applied to other fashion categories the retailer had on offer).

It was at this point that I decided to stop work on the project as I had achieved the basic goals I set out to do and learned perhaps a little too easily how malicious this innocent project could turn. It was turning into a one way stop into essentailly having a DB that contained the entire storefront of this ecommerce site.

There are definitely optimisations that would be required to achieve this goal but I imagine with the use of goroutines you could get some parallelisation of this scraper happening to speed up the process to potentially scrape the entire website in a short timespan.

Closing Thoughts

I had some fun trying to reverse engineer the websites HTML structure and figure out how they were displaying product pages and how to go about abstracting the right product hrefs to crawl and the correct data to be able to obtain and write to a DB. It was enjoyable but it all felt a little too close to the sun for my legal liking.

I have deliberately not referenced the 'real' websites name in this example for the sake of their anonymity but the underlying principles should be applicable to most major online ecommerce retailers.

I was amazed with how easy it was to get up and running with the Colly library - I definitely suggest testing it out but be careful with what data you decide to scrape/store and that you investigate your targets robots.txt file to ensure you have permission to hit their website.

Any thoughts/opinions/comments feel free to leave below.

See you next time!

Top comments (1)

Adrian B.G. • Dec 7 '18

Nice, I also wrote a simple scraper based on GoCrawl. Simple as in it does not care about the content, it only visits the webpages of a single domain. I needed to trigger ServerSide Renders cache for google crawlers (prerender)

github.com/bgadrian/warmcache