DEV Community

loading...

Learning F# — Web Scraping With F# Data

citizen428 profile image Michael Kohl Originally published at citizen428.net ・4 min read

In my ongoing efforts to learn F# properly, I recently stumbled upon the F# Data library, which implements type providers and other useful tools for working with data in CSV, HTML, JSON and XML formats.

To try it out I hacked together a quick .fsx script which uses the library's HTML parser to scrape the top three F# posts from one of my favorite programming websites, dev.to. Please note that the site does have an API, which would be the preferred way of doing this, the following is for demonstration purposes only.

Simple Scraping

The first step is to reference the downloaded Fsharp.Data assembly relative to the script's directory and importing its contents with the open keyword:

#r "fsharp-data/lib/net45/FSharp.Data.dll"

open FSharp.Data
Enter fullscreen mode Exit fullscreen mode

Next we can use HtmlDocument.Load to fetch and parse dev.to's top F# posts page from which we extract the individual articles via the CssSelect method and the appropriate CSS selector:

let doc = HtmlDocument.Load("https://dev.to/t/fsharp/top/infinity")
let posts = doc.CssSelect(".single-article")
Enter fullscreen mode Exit fullscreen mode

We then iterate over the posts, extract the title, author and reaction count and take the top three posts (which are already sorted by number of reactions):

let top3 =
    posts
    |> Seq.map(fun post ->
        post |> firstChildText ".content h3",
        post |> firstChildText "h4 > a" |> cleanName,
        post |> firstChildText ".reactions-count .engagement-count-number")
    |> Seq.take 3
Enter fullscreen mode Exit fullscreen mode

This uses two small helpers, firstChildText which is a little convenience function to smooth over the fact that CssSelect doesn't yet support the :fist-child pseudo-selector, and cleanName which removes unnecessary characters around the author's name:

let firstChildText selector (post : HtmlNode) =
    post.CssSelect(selector).[0].DirectInnerText().Trim()

let cleanName (name : string) = name.Replace("・", "")
Enter fullscreen mode Exit fullscreen mode

Lastly, we use a simple for loop with pattern matching to output the results:

for (title, author, reactions) in top3 do
    printf "\"%s\" (%s): %s\n" title author reactions
Enter fullscreen mode Exit fullscreen mode

Running this on my Mac with .NET Core produces the following output (which hopefully will include this post in the future 😉):

$ fsharpi --exec devto.fsx
"F# for JS Devs" (Jason): 84
"Mere Functional Programming in F#" (Kasey Speakman): 65
"F# is Pretty Cool" (Ben Lovy): 51
Enter fullscreen mode Exit fullscreen mode

The full script can be found in this gist.

Asynchronous scraping

After I finished the first version of this script, I thought that this could also be a good opportunity to finally explore F#'s async programming model. So I decided to write a second version which fetches the top posts for F#, Elm, and Haskell, and displays the top five across all three tags. The overall code is quite similar to the previous script, so I'll skip over all the parts I reused. The full script is again available in a gist.

To generate a sequence of the top five posts, the following function is used:

let top5 =
    ["fsharp"; "elm"; "haskell"]
    |> getPosts
    |> Seq.collect(fun posts ->
        posts
        |> Seq.map(fun post ->
            post |> firstChildText ".content h3",
            post |> firstChildText "h4 > a" |> cleanName,
            post |> firstChildText ".reactions-count .engagement-count-number"))
    |> Seq.sortBy(fun (_, _, score) -> -(int score))
    |> Seq.take 5
Enter fullscreen mode Exit fullscreen mode

This invokes the getPosts helper, which uses Async.Parallel to execute three calls to fetchTagAsync in parallel, as well as Async.RunSynchronously to wait for all of them to finish:

let getPosts tags =
    tags
    |> List.map fetchTagAsync
    |> Async.Parallel
    |> Async.RunSynchronously
Enter fullscreen mode Exit fullscreen mode

fetchTagAsync is relatively simple: inside an async workflow it first constructs a URL for the given tag, then uses an asynchronous let! binding to wait for the result of the computation, before returning the posts:

let fetchTagAsync tag =
    async {
        let url = sprintf "https://dev.to/t/%s/top/infinity" tag
        let! doc = HtmlDocument.AsyncLoad(url)
        return doc.CssSelect(".single-article")
    }
Enter fullscreen mode Exit fullscreen mode

When executing this it produces the following result:

$ fsharpi --exec devto.fsx
"What the heck is polymorphism?" (Jan van Brügge): 212
"Tour of an Open-Source Elm SPA" (Richard Feldman): 140
"How I (Finally) Built an App in Elm" (Ali Spittel): 110
"F# for JS Devs" (Jason): 84
"Elm 0.19 brings better collections" (Robin Heggelund Hansen): 78
Enter fullscreen mode Exit fullscreen mode

Summary

Despite only having started seriously looking at F# around 2 months ago, I already feel quite productive in it. Its pragmatic support for functional and object programming (but not object-oriented, see Don Syme's talk "F# Code I Love"), as well as the good standard library, make for a pleasant developer experience and the language really lends itself to solving problems in terms of data and transformations.

Discussion (5)

pic
Editor guide
Collapse
anjankant profile image
Anjan Kant

Here is an article which explains 10 easy steps (dev.to/anjankant/visual-studio-lea...) to scrape whole website.

Collapse
citizen428 profile image
Michael Kohl Author

This was more about Fsharp.Data than it was about scraping per se.

Collapse
anjankant profile image
Anjan Kant

Okay! But good overview to introduce F#,I never used it but it has good capabilities then can use. I scraped so many websites but with c#, htmlagilitypack, LINQ,regex all this blend. Really I enjoyed all these :)

Thread Thread
citizen428 profile image
Michael Kohl Author

My go to scraping solution is Python + Scrapy. Really nice, especially with the hosting options at ScrapingHub.

Thread Thread
anjankant profile image
Anjan Kant

If I,LL take time then definitely try with python (very known for web scraping according google search), I think python also work around DOM scraping, I don't know how fast and efficient it's. But c# has very good capabilities with htmlagilitypack, LINQ, Regex and few more c# features. In starting took time but later settled down smoothly.