loading...

Haskell/Python/Go web scraper comparison completed

yujiri8 profile image Ryan Westlund ・5 min read

Yesterday I posted Comparing the same web scraper in Haskell, Python, Go and the post was well-liked, despite my comparisons not being all that fair (the Python HTML library I used was higher-level than the Haskell and Go ones, and commenters pointed out alternatives that would significantly improve Haskell and Go). So here I am with the completed version of the task: the script should save the most recent comment to a file, read that on startup, and email me if it's different. Also, it must fetch the most recent post as well as the most recent comment. The email includes both iff they're different and the subject reflects in natural English which ones are different.

I'll be using the improved Go approach here. I'm not using the improved Haskell approach because it depended on lenses, a functional programming concept I don't grok yet. (The journey to learn Haskell seems to never end...) So we'll bear in mind that my Haskell solution is selling it a short.

I also feel like I should say now that this isn't supposed to be a "obviously Python > Haskell > Go" post. While I do think conciseness is one of the most important traits (and do have that order of liking for the languages), I understand there are other important traits that aren't being reflected here.

Without further ado, here's the Haskell solution I came up with:

{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE LambdaCase #-}

import Network.HTTP.Req
import qualified Text.HTML.DOM as DOM
import qualified Text.XML.Cursor as Cursor
import qualified Text.XML.Selector as Selector
import qualified Data.XML.Types as Types
import qualified Text.XML as XML
import Data.Text (Text, unpack)
import Control.Exception
import Control.Monad
import System.IO.Error
import System.Process
import System.Exit

main = do
    resp <- runReq defaultHttpConfig $ req GET (https "fakenous.net") NoReqBody lbsResponse mempty
    let dom = Cursor.fromDocument $ DOM.parseLBS $ responseBody resp
        newPostElem = XML.toXMLNode $ Cursor.node $ head $
            -- it seems like more selectors after :first-child don't take effect, bug?
            Selector.query "#recent-posts-2 > ul > li:first-child" $ dom
        newCommentElem = XML.toXMLNode $ Cursor.node $ head $
            Selector.query "#recentcomments > li:first-child" $ dom
        -- Got the elements. Now get their text.
        newPost = getElementText $ (Types.nodeChildren $ newPostElem) !! 1
        newComment = getElementText newCommentElem
    -- Read saved last stuff.
    file <- catch (readFile "recent-hs") (\case e | isDoesNotExistError e -> pure " \n ")
    let [oldPost, oldComment] = lines file
    when (oldPost == newPost && oldComment == newComment) $ exitSuccess
    -- Mail.
    let (subjectPiece, msg) = if oldPost /= newPost && oldComment /= newComment
        then ("Post and Comment", "New Post: " ++ newPost ++ "\nNew Comment: " ++ newComment)
        else if oldPost /= newPost
            then ("Post", "New Post: " ++ newPost)
            else ("Comment", "New Comment: " ++ newComment)
    readProcess "mutt" ["user@example.com", "-s", "New " ++ subjectPiece ++ " on Fake Nous"] msg
    -- Write out the new file.
    writeFile "recent-hs" $ newPost ++ "\n" ++ newComment

getElementText commentElem =
    let children = Types.nodeChildren commentElem
    in foldl (++) "" $ unwrap <$> children

unwrap :: Types.Node -> String
unwrap (Types.NodeContent (Types.ContentText s)) = unpack s 
unwrap e = unwrap $ head $ Types.nodeChildren e

39 lines.

As you can see, I'm using mutt to send emails.

Much of this length is coming from having a weaker HTML API. If I was using a library like what I am for Python and Go, I can imagine this being cut down to 32 or so. If I discount import lines, that would actually pull it ahead of Python.

package main

import (
    "io/ioutil"
    "os"
    "os/exec"
    "strings"

    "github.com/gocolly/colly"
)

func main() {
    var newPost, oldPost, newComment, oldComment string
    var col = colly.NewCollector()
    col.OnHTML("#recent-posts-2 > ul > *:first-child > a", func(e *colly.HTMLElement) {
        newPost = e.Text
    })
    col.OnHTML("#recentcomments > *:first-child", func(e *colly.HTMLElement) {
        newComment = e.Text
    })
    col.Visit("https://fakenous.net")
    fileBytes, err := ioutil.ReadFile("recent-go")
    if !os.IsNotExist(err) {
        must(err)
        var fileLines = strings.Split(string(fileBytes), "\n")
        oldPost, oldComment = fileLines[0], fileLines[1]
        if oldPost == newPost && oldComment == newComment {
            return
        }
    }
    // Send email.
    var subjectPiece = "Post and Comment"
    var msg = "New post: " + newPost + "\nNew comment: " + newComment
    if oldPost == newPost {
        subjectPiece = "Comment"
        msg = "New comment: " + newComment
    } else {
        subjectPiece = "Post"
        msg = "New post: " + newPost
    }
    var cmd = exec.Command("mutt", "user@example.com", "-s", "New "+subjectPiece+" on Fake Nous")
    pipe, err := cmd.StdinPipe()
    must(err)
    _, err = pipe.Write([]byte(msg))
    must(err)
    must(pipe.Close())
    err = cmd.Run()
    must(err)
    // Save new latest.
    err = ioutil.WriteFile("recent-go", []byte(newPost+"\n"+newComment), 0644)
    must(err)
}

func must(err error) {
    if err != nil {
        panic(err)
    }
}

52 lines. This gocolly library is able to replace both the ones I was using before while saving many lines, and its selector support seems to be as powerful as a browser's. It's much more restricted in purpose though.

As is usual for Go, a good 20% of the program is dedicated to error-checking and handling, even though I'm just panicking on every error :(

I feel I should mention in Go's defense that 9 of these lines are just closing braces/parens. If I consider an alternate count excluding imports, I should probably also consider an alternate count excluding "pseudo-blank" lines.

Python solution:

import requests
from bs4 import BeautifulSoup

import os, sys, subprocess

file = requests.get("https://fakenous.net").text
dom = BeautifulSoup(file, features='html.parser')

recent_posts = dom.find(id = 'recent-posts-2').find('ul')
# The first child is '\n'.
new_post = list(recent_posts.children)[1].find('a').string

recent_comments = dom.find(id = 'recentcomments')
new_comment = ''.join(list(recent_comments.children)[0].strings)

try:
    with open('recent-py') as f:
        old_post, old_comment = f.readlines()
        # readlines retains newlines.
        if old_post == new_post + '\n' and old_comment == new_comment + '\n':
            sys.exit()
except FileNotFoundError:
    old_post = old_comment = ''

subject_piece = 'Post and Comment'
msg = f'New post: {new_post}\nNew Comment {new_comment}'
if old_post == new_post:
    subject_piece = 'Comment'
    msg = f'New comment: {new_comment}'
else:
    subject_piece = 'Post'
    msg = f'New post: {new_post}'

mutt = subprocess.run(['mutt', 'user@example.com', '-s', f'New {subject_piece} on Fake Nous'],
    input = msg.encode('utf8'), check = True)

with open('recent-py', 'w') as f:
    f.write(new_post + '\n' + new_comment + '\n')

A slim 28 lines.

To count the possible impacts of alternate counts: if you discount imports and the package declaration, the counts become:

  • Haskell - 28 (with the inferior API)
  • Python - 25
  • Go - 42

If you discount pseudo-blank lines, Go reaches 43. (Haskell and Python don't have any pseudo-blank lines.) If you discount both, Go is 34.

Maybe I should start doing posts like this more often. It was fun, and I think what I learned about the libraries - and the pointer on my next Haskell concept to learn - made it worth the work.

Posted on by:

yujiri8 profile

Ryan Westlund

@yujiri8

I'm a programmer, writer, and philosopher. My Github account is yujiri8; all my content besides code is at yujiri.xyz.

Discussion

markdown guide
 

What I suspect is that all three versions overlap on a section of the dom when traversing it from scratch for each of the two queries. I've write a Haskell version (using lens) that doesn't do the redundant traversal. I would like to see the corresponding python and go memorized traversal versions:

{-# LANGUAGE LambdaCase #-}

import Network.Wreq ( get, responseBody )
import Text.Taggy.Lens ( html, allAttributed, named, children, elements, contents, element )
import Data.Text.Encoding.Error ( lenientDecode )
import Data.Text.Lazy.Encoding ( decodeUtf8With )
import Data.Text ( unpack )
import Control.Lens ( (^..), (^.), ix, to, universe, only )
import Control.Exception ( catch )
import System.IO.Error ( isDoesNotExistError )
import System.Process ( readProcess )
import System.Exit ( exitSuccess )
import Control.Monad ( when )

main = do
    r <- get "https://fakenous.net"
    let markup = decodeUtf8With lenientDecode $ r ^. responseBody
        branch = markup ^.. html . allAttributed (ix "id" . only "widget-area") . children . traverse . element
        q1 = branch ^.. ix 2 . elements . named (only "ul") . children . ix 0 . to universe . traverse . contents
        q2 = branch ^.. ix 1 . elements . named (only "ul") . children . ix 0 . elements . contents
        newComment = unpack $ q1 !! 1 <> q1 !! 0 <> q1 !! 2
        newPost = unpack $ q2 !! 0 
    -- Read saved last stuff.
    file <- catch (readFile "recent-hs") (\case e | isDoesNotExistError e -> pure " \n ")
    let [oldPost, oldComment] = lines file
    when (oldPost == newPost && oldComment == newComment) exitSuccess
    -- Mail.
    let (subjectPiece, msg)
          | oldPost /= newPost && oldComment /= newComment = ("Post and Comment", "New Post: " ++ newPost ++ "\nNew Comment: " ++ newComment)
          | oldPost /= newPost = ("Post", "New Post: " ++ newPost)
          | otherwise = ("Comment", "New Comment: " ++ newComment)
    readProcess "mutt" ["user@example.com", "-s", "New " ++ subjectPiece ++ " on Fake Nous"] msg
    -- Write out the new file.
    writeFile "recent-hs" $ newPost ++ "\n" ++ newComment```

 

Interesting, but I don't think that would be a good change to make in general, since the cost of redundant traversal isn't relevant for the use case.

Also you didn't do triple backquotes so the code shows up unformatted...

 

Thanks for the tip. It's just an interesting case of composability. In Haskell this optimization comes cheap, it means adding a single let binding. I was curious what would surmount to in go and python

 

I find lines to be far less relevant than characters when comparing the lengths of programs written in different languages/styles.

Haskell Go Python
As Written 2199 1621 1235
Strip Comments and Imports 1535 1402 1088
Strip Blank Lines and Indentation 1395 1257 1016

There's not a huge difference here, but this comparison is less favorable to Haskell than the line count.

Some metrics that are actually interesting to me when comparing implementations are:

  • How long did each take to write initially?
  • How correct were they after the initial implementation?
  • How long did it take to debug until they worked correctly?
  • How much time would it take to explain to a new dev with no experience in the language?
  • How long would it take to review a change by that new dev and feel confident in its correctness?
 

You make a good point. I neglected the importance of line length.

I think a lot of the reason my Haskell has such long lines is that the workarounds that purify non-pure things tend to involve a lot of helpers like runReq and defaultHttpConfig, which don't add as much semantic complexity as an equivalent increase in characters would normally suggest.

I'd say one of the reasons I value line count is that it affects how much code (semantically) I can fit on the screen at once, which is a big factor in how easily I can read and maintain it. Counting by characters also makes it seem better to use very short variable names, which can hurt readability as much as verbosity can, whereas if you count by lines, identifier length isn't a factor (unless it makes you wrap).

Though I have to say that for most of the metrics you mention, Haskell would be disadvantaged just because it's difficult to learn and I'm not a master of it yet. Writing the Haskell version took me a couple hours, but a more seasoned Haskeller could surely have written it as fast as I wrote the others. And I of course spent a few hours just struggling with Haskell package management (I do every time I touch Haskell for anything). As far as writing time, the time to find libraries and read documentation on ones I wasn't already familiar with was also a big factor for both Haskell and Go.

Line count (or character count) on the other hand is less dependent on how experienced I am with the language.

As for the last metric, though, I think Haskell would win on that even despite my lesser experience because it encodes the most information into the type system. With Python I'd have to do the most testing, with Go I'd have to worry about nil and pointers and other gotchas, and with Haskell, if it type checks, I can almost be certain it works.

 

For sending mails you could also use smtp-mail where you just enter your mailserver.
example:

import Network.Mail.Mime as Mime
import Network.Mail.SMTP
import qualified Data.Text.Lazy as LT
sendMail' smtpHost (fromIntegral smtpPort) (Mime.simpleMail' "user@example.com"  (fromString $ "noreply@myserver.internet") "Title" (LT.pack content))

but don't ask me how to do that in python or go ;)


and if you fear lenses: wait, until you get to know recursion-schemes & do fixpoint-parsing/traversal of HTML-DOM1 ;)
I think i have to write a post about that -.-


  1. Oh yes. There is alway another rabbit-hole to fall into after the next craziest thing you learn.. :D 

 

In the last paragraph, you discount pseudo-blank lines and yet the count for Go goes up instead of down - think there's a typo here in the final number of lines.

I agree with Dhwaneet, I think a performance test or two would be worthwhile as number of lines in terms of a metric is fairly useless when the numbers are as close as this (unless we're talking about competitive programming that is)

 

In the last paragraph, you discount pseudo-blank lines and yet the count for Go goes up instead of down - think there's a typo here in the final number of lines.

I don't think so. The 42 is discounting only imports and the package declaration, and the 43 is discounting only pseudo-blank lines. 34 is the count for discounting both.

I didn't count performance because I don't think performance matters for this job. And I don't think the numbers are very close - 25 to 34 is a 36% increase, and while pseudo-blank lines don't add anything semantically, they do take up visual space.

But since there's such a demand for performance testing, I'll do that next time I do a comparison :)

 

If you add performance comparison too, it would be really good. My guess - Go would blow others off.

 

I didn't do performance comparison this time because I don't think it matters much for this task, and because the vast majority of the running time for all three is waiting on the network (and because the time to start the interpreter or load shared libraries and stuff would probably account for a big part of the difference). Though it's true, I did try out the half-complete version and Go was the fastest (but the difference between it and Haskell was really small). I could pick out something performance sensitive next time?