loading...

Comparing the same web scraper in Haskell, Python, Go

yujiri8 profile image Ryan Westlund Updated on ・4 min read

So this project started with a need - or, not really a need, but an annoyance I realized would be a good opportunity to strengthen my Haskell, even if the solution probably wasn't worth it in the end.

There's a blog I follow (Fake Nous) that uses Wordpress, meaning its comment section mechanics and account system are as convoluted and nightmarish as Haskell's package management. In particular I wanted to see if I could do away with relying on kludgy Wordpress notifications that only seem to work occasionally and write a web scraper that'd fetch the page, find the recent comments element and see if a new comment had been posted.

I've done the brunt of the job now - I wrote a Haskell script that outputs the "Name on Post" string of the most recent comment. And I thought it'd be interesting to compare the Haskell solution to Python and Go solutions.

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TupleSections #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE MultiWayIf #-}
{-# LANGUAGE ViewPatterns #-}

import Network.HTTP.Req
import qualified Text.HTML.DOM as DOM
import qualified Text.XML.Cursor as Cursor
import qualified Text.XML.Selector as Selector
import qualified Data.XML.Types as Types
import qualified Text.XML as XML
import Data.Text (Text, unpack)
import Control.Monad

main = do
    resp <- runReq defaultHttpConfig $ req GET (https "fakenous.net") NoReqBody lbsResponse mempty
    let dom = Cursor.fromDocument $ DOM.parseLBS $ responseBody resp
        recentComments = XML.toXMLNode $ Cursor.node $ head $ Selector.query "#recentcomments" $ dom
        newest = head $ Types.nodeChildren recentComments
    putStrLn $ getCommentText newest

getCommentText commentElem =
    let children = Types.nodeChildren commentElem
    in foldl (++) "" $ unwrap <$> children

unwrap :: Types.Node -> String
unwrap (Types.NodeContent (Types.ContentText s)) = unpack s 
unwrap e = unwrap $ head $ Types.nodeChildren e

My Haskell clocs in at 25 lines, although if you remove unused language extensions, it comes down to 21 (The other four in there just because they're "go to" extensions for me). So 21 is a fairer count. If you don't count imports as lines of code, it can be 13.

Writing this was actually not terribly difficult; of the 5 or so hours I probably put into it in the end, 90% of that time was spent struggling with package management (the worst aspect of Haskell). In the end I finally resorted to Stack even though this is a single-file script that should be able to compile with just ghc.

I'm proud of my work though, and thought it reflected fairly well on a language to do this so concisely. My enthusiasm dropped a bit when I wrote a Python solution:

import requests
from bs4 import BeautifulSoup

file = requests.get("https://fakenous.net").text

dom = BeautifulSoup(file, features='html.parser')
recentcomments = dom.find(id = 'recentcomments')
print(''.join(list(recentcomments.children)[0].strings))

6 lines to Haskell's 21, or 4 to 13. Damn. I'm becoming more and more convinced nothing will ever displace my love for Python.

Course you can attribute some of Haskell's relative size to having an inferior library, but still.

Here's a Go solution:

package main

import (
    "fmt"
    "net/http"

    "github.com/ericchiang/css"
    "golang.org/x/net/html"
)

func main() {
    var resp, err = http.Get("https://fakenous.net")
    must(err)
    defer resp.Body.Close()
    tree, err := html.Parse(resp.Body)
    must(err)
    sel, err := css.Compile("#recentcomments > *:first-child")
    must(err)
    // It will only match one element.
    for _, elem := range sel.Select(tree) {
        var name = elem.FirstChild
        var on = name.NextSibling
        fmt.Printf("%s%s%s\n", unwrap(name), unwrap(on), unwrap(on.NextSibling))
    }

}

func unwrap(node *html.Node) string {
    if node.Type == html.TextNode {
        return node.Data
    }
    return unwrap(node.FirstChild)
}

func must(err error) {
    if err != nil {
        panic(err)
    }
}

32 lines, including imports. So at least Haskell came in shorter than Go. I'm proud of you, Has- oh nevermind, that's not a very high bar to clear.

It would be reasonable to object that the Python solution is so brief because it doesn't need a main function, but in real Python applications you generally still want that. But even if I modify it:

import requests
from bs4 import BeautifulSoup

def main():
    file = requests.get("https://fakenous.net").text
    dom = BeautifulSoup(file, features='html.parser')
    recentcomments = dom.find(id = 'recentcomments')
    return ''.join(list(recentcomments.children)[0].strings)

if __name__ == '__main__': main()

It only clocs in at 8 lines, including imports.

An alternate version of the Go solution that doesn't hardcode the number of nodes (since the Python and Haskell ones don't):

package main

import (
    "fmt"
    "net/http"

    "github.com/ericchiang/css"
    "golang.org/x/net/html"
)

func main() {
    var resp, err = http.Get("https://fakenous.net")
    must(err)
    defer resp.Body.Close()
    tree, err := html.Parse(resp.Body)
    must(err)
    sel, err := css.Compile("#recentcomments > *:first-child")
    must(err)
    // It will only match one element.
    for _, elem := range sel.Select(tree) {
        fmt.Printf("%s\n", textOfNode(elem))
    }

}

func textOfNode(node *html.Node) string {
    var total string
    var elem = node.FirstChild
    for elem != nil {
        total += unwrap(elem)
        elem = elem.NextSibling
    }
    return total
}

func unwrap(node *html.Node) string {
    if node.Type == html.TextNode {
        return node.Data
    }
    return unwrap(node.FirstChild)
}

func must(err error) {
    if err != nil {
        panic(err)
    }
}

Though it ends up being 39 lines.

Maybe Python's lead would decrease if I implemented the second half, having the scripts save the last comment they found in a file, read it on startup, and update if it's different and notify me somehow (email could be an interesting test). I doubt it, but if people like this post I'll finish them.

Edit: I finished them.

Posted on Apr 1 by:

yujiri8 profile

Ryan Westlund

@yujiri8

I'm a programmer, writer, and philosopher. My Github account is yujiri8; all my content besides code is at yujiri.xyz.

Discussion

markdown guide
 

A Haskell one-liner:

(toListOf $ responseBody . to (decodeUtf8with lenientDecode) . html . allAttribute (folded . only "recentcomments") . children) <$> (get "https://fakenous.net")
 

Can you give some context for this? When I plug it in, even with all the imports I used, almost everything in there is undefined.

I'm ready with the full version that does the saving and emailing me for all three languages, but I'm holding off on posting now because I don't want to finalize if the Haskell can be improved by that much.

 

Apologies, I should have thought of this earlier. Anyway, adding more details:

-- file : Main.hs
{-# LANGUAGE OverloadedStrings #-}

module Main where

import Control.Lens (to, only, toListOf, folded)
import Data.Text.Encoding.Error (lenientDecode)
import Data.Text.Lazy.Encoding (decodeUtf8With)
import Network.Wreq (responseBody, get)
import Text.Taggy.Lens (html, children, allAttributed)

main = (toListOf $ responseBody . to (decodeUtf8With lenientDecode) . html . allAttributed (folded . only "recentcomments") . children) <$> (get "https://fakenous.net") >>= print 

The dependencies can be put in a dev-to.cabal file:

-- dev-to.cabal
cabal-version:       2.4
name:                dev-to
version:             0.1.0.0
license-file:        LICENSE
author:              Providence Salumu
maintainer:          Providence <dot> Salumu <at> smunix <dot> com
extra-source-files:  CHANGELOG.md

executable dev-to
  main-is:             Main.hs
  build-depends:       base ^>=4.13.0.0
                     , lens
                     , bytestring
                     , http-client
                     , text
                     , taggy
                     , taggy-lens
                     , wreq
  default-language:    Haskell2010

Doing the saving and emailing you would be a simpler addition.

You can clone my repo from github.com/smunix/dev-to

Ah. Still, that doesn't seem to be a complete solution. I ran it with cabal run and the output is the object:

[[NodeElement (Element {eltName = "li", eltAttrs = fromList [("class","recentcomments")], eltChildren = [NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeContent "Gerardo"]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1130#comment-1909")], eltChildren = [NodeContent "The Failings of Analytic Philosophy"]})]}),NodeElement (Element {eltName = "li", eltAttrs = fromList [("class","recentcomments")], eltChildren = [NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://yujiri.xyz"),("rel","external nofollow ugc"),("class","url")], eltChildren = [NodeContent "Yujiri"]})]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1704#comment-1908")], eltChildren = [NodeContent "How Can You Put a Price on Human Life?"]})]}),NodeElement (Element {eltName = "li", eltAttrs = fromList [("class","recentcomments")], eltChildren = [NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeContent "Paul Lake"]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=327#comment-1907")], eltChildren = [NodeContent "Studies in Irrationality: Marxism"]})]}),NodeElement (Element {eltName = "li", eltAttrs = fromList [("class","recentcomments")], eltChildren = [NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeContent "Dave"]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1704#comment-1905")], eltChildren = [NodeContent "How Can You Put a Price on Human Life?"]})]}),NodeElement (Element {eltName = "li", eltAttrs = fromList [("class","recentcomments")], eltChildren = [NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","http://www.daviddfriedman.com"),("rel","external nofollow ugc"),("class","url")], eltChildren = [NodeContent "David Friedman"]})]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1674#comment-1904")], eltChildren = [NodeContent "Do Religious People Believe Religion?"]})]})],[NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeContent "Gerardo"]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1130#comment-1909")], eltChildren = [NodeContent "The Failings of Analytic Philosophy"]})],[NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://yujiri.xyz"),("rel","external nofollow ugc"),("class","url")], eltChildren = [NodeContent "Yujiri"]})]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1704#comment-1908")], eltChildren = [NodeContent "How Can You Put a Price on Human Life?"]})],[NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeContent "Paul Lake"]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=327#comment-1907")], eltChildren = [NodeContent "Studies in Irrationality: Marxism"]})],[NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeContent "Dave"]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1704#comment-1905")], eltChildren = [NodeContent "How Can You Put a Price on Human Life?"]})],[NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","http://www.daviddfriedman.com"),("rel","external nofollow ugc"),("class","url")], eltChildren = [NodeContent "David Friedman"]})]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1674#comment-1904")], eltChildren = [NodeContent "Do Religious People Believe Religion?"]})]]

Instead of the text.

I also wouldn't consider that one line. If I were to really use that code, I'd certainly break it into 2-4. Still, it is an impressive improvement! I'll have to look more into those libraries.

 

One may also argue that your python code uses the beautifulsoup library which has already done the hard work of parsing the html/xml for you!

(though in fairness, I don't know much about haskell or go to comment on how "bare metal" those pieces of code are).

 

True, beautiful soup seems much high-level than the other libraries. Though I am using at-least two non-standard libraries for all languages (Go has great high-level HTTP in the stdlib but needed 2 HTML/traversal libraries just to get there, for Haskell I'm using 5 libraries: req, html-conduit, dom-selector, xml-conduit and xml-types (might be a way to cut down on those but I really couldn't find it cause some of those libraries are just like 'provides HTML helpers for XML types' or something)).

 

I would recommend Colly (github.com/gocolly/colly) to get a better comparison since you are using BeautifulSoup for Python. Both scraper libraries have superb APIs.

 

Wow! I didn't know about that library. That does much more for me here than even BeautifulSoup! New&Improved Go version:

package main

import (
    "fmt"

    "github.com/gocolly/colly"
)

func main() {
    var col = colly.NewCollector()
    col.OnHTML("#recentcomments > *:first-child", func(e *colly.HTMLElement) {
        fmt.Println(e.Text)
    })
    col.Visit("https://fakenous.net")
}

That gets it down to about same number of "meaningful" lines as Python. Technically can drop 2 more lines by putting the function inline, but I wouldn't do that IRL.

 

Would love to see more posts like this!

 

Great post, haven't tried Haskell yet, looks interesting. Can you do a performance test on each version? The LOC is surely a factor but knowing the performance would be even better.

 

Python seems to average about 1.6 seconds. The first run was 3 seconds which is probs because of filesystem caching or TLS resumption. Go is averaging about 1.25 and Haskell about 1.35.

I don't think performance really means much here though, because on such a short program, factors like the time to start the interpreter and parse source code, write to the console, etc, are much more significant than they should be. The Haskell binary dynamically loads 11 system libraries while the Go binary only loads 2 dynamically, and that might account for the speed difference there. I've heard dynamic linking increases startup costs.

 

I was expecting performance write up.
Lol