DEV Community

loading...

Discussion on: Comparing the same web scraper in Haskell, Python, Go

Collapse
smunix profile image
Providence Salumu

A Haskell one-liner:

(toListOf $ responseBody . to (decodeUtf8with lenientDecode) . html . allAttribute (folded . only "recentcomments") . children) <$> (get "https://fakenous.net")
Collapse
yujiri8 profile image
Ryan Westlund Author

Can you give some context for this? When I plug it in, even with all the imports I used, almost everything in there is undefined.

I'm ready with the full version that does the saving and emailing me for all three languages, but I'm holding off on posting now because I don't want to finalize if the Haskell can be improved by that much.

Collapse
smunix profile image
Providence Salumu

Apologies, I should have thought of this earlier. Anyway, adding more details:

-- file : Main.hs
{-# LANGUAGE OverloadedStrings #-}

module Main where

import Control.Lens (to, only, toListOf, folded)
import Data.Text.Encoding.Error (lenientDecode)
import Data.Text.Lazy.Encoding (decodeUtf8With)
import Network.Wreq (responseBody, get)
import Text.Taggy.Lens (html, children, allAttributed)

main = (toListOf $ responseBody . to (decodeUtf8With lenientDecode) . html . allAttributed (folded . only "recentcomments") . children) <$> (get "https://fakenous.net") >>= print 

The dependencies can be put in a dev-to.cabal file:

-- dev-to.cabal
cabal-version:       2.4
name:                dev-to
version:             0.1.0.0
license-file:        LICENSE
author:              Providence Salumu
maintainer:          Providence <dot> Salumu <at> smunix <dot> com
extra-source-files:  CHANGELOG.md

executable dev-to
  main-is:             Main.hs
  build-depends:       base ^>=4.13.0.0
                     , lens
                     , bytestring
                     , http-client
                     , text
                     , taggy
                     , taggy-lens
                     , wreq
  default-language:    Haskell2010

Doing the saving and emailing you would be a simpler addition.

You can clone my repo from github.com/smunix/dev-to

Thread Thread
yujiri8 profile image
Ryan Westlund Author

Ah. Still, that doesn't seem to be a complete solution. I ran it with cabal run and the output is the object:

[[NodeElement (Element {eltName = "li", eltAttrs = fromList [("class","recentcomments")], eltChildren = [NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeContent "Gerardo"]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1130#comment-1909")], eltChildren = [NodeContent "The Failings of Analytic Philosophy"]})]}),NodeElement (Element {eltName = "li", eltAttrs = fromList [("class","recentcomments")], eltChildren = [NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://yujiri.xyz"),("rel","external nofollow ugc"),("class","url")], eltChildren = [NodeContent "Yujiri"]})]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1704#comment-1908")], eltChildren = [NodeContent "How Can You Put a Price on Human Life?"]})]}),NodeElement (Element {eltName = "li", eltAttrs = fromList [("class","recentcomments")], eltChildren = [NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeContent "Paul Lake"]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=327#comment-1907")], eltChildren = [NodeContent "Studies in Irrationality: Marxism"]})]}),NodeElement (Element {eltName = "li", eltAttrs = fromList [("class","recentcomments")], eltChildren = [NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeContent "Dave"]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1704#comment-1905")], eltChildren = [NodeContent "How Can You Put a Price on Human Life?"]})]}),NodeElement (Element {eltName = "li", eltAttrs = fromList [("class","recentcomments")], eltChildren = [NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","http://www.daviddfriedman.com"),("rel","external nofollow ugc"),("class","url")], eltChildren = [NodeContent "David Friedman"]})]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1674#comment-1904")], eltChildren = [NodeContent "Do Religious People Believe Religion?"]})]})],[NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeContent "Gerardo"]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1130#comment-1909")], eltChildren = [NodeContent "The Failings of Analytic Philosophy"]})],[NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://yujiri.xyz"),("rel","external nofollow ugc"),("class","url")], eltChildren = [NodeContent "Yujiri"]})]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1704#comment-1908")], eltChildren = [NodeContent "How Can You Put a Price on Human Life?"]})],[NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeContent "Paul Lake"]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=327#comment-1907")], eltChildren = [NodeContent "Studies in Irrationality: Marxism"]})],[NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeContent "Dave"]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1704#comment-1905")], eltChildren = [NodeContent "How Can You Put a Price on Human Life?"]})],[NodeElement (Element {eltName = "span", eltAttrs = fromList [("class","comment-author-link")], eltChildren = [NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","http://www.daviddfriedman.com"),("rel","external nofollow ugc"),("class","url")], eltChildren = [NodeContent "David Friedman"]})]}),NodeContent " on ",NodeElement (Element {eltName = "a", eltAttrs = fromList [("href","https://fakenous.net/?p=1674#comment-1904")], eltChildren = [NodeContent "Do Religious People Believe Religion?"]})]]

Instead of the text.

I also wouldn't consider that one line. If I were to really use that code, I'd certainly break it into 2-4. Still, it is an impressive improvement! I'll have to look more into those libraries.