Parallel processing is an exciting topic, especially when talking about heavy processes that can be split into smaller tasks, executed in isolation, aggregated, and presented as a unique outcome.
Clojure provides different ways to accomplish this kind of job, and futures are a handy approach to split a big task into parallel threads that will eventually be gathered by the process that started them up, allowing it to aggregate the results to provide the same outcome in a faster way.
Talking Futures
You can think about a future as a piece of code that runs on its own and independent thread. More specifically, a future is a Clojure macro that takes a set of expressions in order to execute them in another thread. The macro returns a reference in memory to the triggered future. This allows to communicate with it via some helper functions:
- future?: Verifies the provided argument is a future.
- future-done?: Takes a future as argument and verifies whether it has been completed.
- future-cancel: Attempts to cancel the given future.
- future-cancelled?: Verifies if the future passed as argument was cancelled.
The result of the future is stored and available somewhere in the cache, we can attempt to get its outcome even when not finished yet by calling (deref my-future)
. In this case, the thread that tries to deref
is going to get blocked until the future is done.
Getting the Stargazers from Github
We are interested in getting the total amount of stars that Clojure has on Github. That is, the sum of stars from each repository that belongs to the Clojure account.
This information can be readily fetched by the Github API. The GET /users/:username/repos
endpoint already responds with the star count of each repository. We can just loop over the returned JSON array, get the star counts and sum them up.
For the sake of this article, let's assume the endpoint above does not return the star counts, but the name of the repositories, so that we can hit the GET /repos/:owner/:repo
endpoint to get the star count. This means that for each repository name, a request to GET /repos/:owner/:repo
will be launched.
As the image above suggests, getting the stars from each repository is a sequential and synchronous task, which might take a considerable amount of time for each request to complete. Once all requests have successfully responded, the reduce
function comes into play to sum up all the stars.
The following function, get-star-count
, executes the HTTP request synchronously to get the repository details and then extracts the stargazers-count
attribute from the response, which represents the star count.
Now, let’s take a look at the sum-stars
function. It provides two arities, one that takes the username exclusively and another one that requires a sequence of repository names:
Note that the first arity in line 3 calls the 2-arity function and sends the result of github/repo-names
as the second argument (repos
). This involves an additional HTTP call that prevents us from manually entering the repo names using a previously defined sequence. We can avoid this additional HTTP call interfering with sum-stars
by either providing a defined sequence containing the repository names or storing the result of github/repo-names
in a symbol and then pass it in to the function.
Each repository is passed in to get-star-count
in line 6, blocking the main thread until each single request is finished synchronously. Then the results are aggregated by reduce
.
Futures in Action
Now we are going to implement the same functionality using futures. This time, for each repository a future that executes GET /repos/:owner/:repo
will be launched.
Every HTTP request is working in parallel, as shown by the image above. The futures are triggered in point A using the future
macro. Once they all are ready with their response, a list comprehension can be utilized to make them meet in point B in order to be reduced to a unique result.
Note that some requests might take longer than others, as they depend on the network throughput, latency of the Github server or even the speed of the processor running the threads. So, the worst case for this scenario is the request that takes more time to complete, as that will be the time for sum-stars
to be done.
The 1-arity function remains pretty much the same as in the synchronous version. All the repos of the given username are passed in to the 2-arity function.
The interesting part begins in line 5, with the for
statement, which creates a list comprehension of futures. That is, a sequence of futures (line 6) that represents the location in memory of the threads working independently. Having this sequence of futures allows us to communicate with them using the helper functions explained earlier or simply try to deref
and get the result.
The pipeline in line 7 shows up the process to aggregate each result. The sequence of futures is passed in to map
using deref
. This is going to block the main thread until each parallel thread is done. Once derefed, the list comprehension of futures becomes a sequence of responses with the actual star counts. Then they are passed in to the reduce
function that simply sums them up.
Benchmarking
Let’s see how fast the sync and async versions work. In order to do that, we are going to use the handy time
function that is part of clojure.core
.
Considerations
- Executed on MacBook Pro, macOS Catalina 10.15.1, 2.2GHz 6-Core Intel i7, 16 GB 2400 MHz DDR4.
- The Clojure account on Github is used for the tests.
-
github/repo-names
is limited to return 30 repos, despite Clojure owns 85+. - Based on the point above, the tests consider 30 repositories from the Clojure account.
user=> (def username "clojure")
user=> (def repos (github/repo-names username))
user=>
user=> (time (core/sum-stars username repos))
"Elapsed time: 13841.095288 msecs"
20263
user=>
user=> (time (core/sum-stars-async username repos))
"Elapsed time: 587.558128 msecs"
20263
As you can see, the result of both versions is the same 20263
, which represents the total stars of the first 30 repositories of Clojure on Github. But the elapsed time for the synchronous version is almost 14 seconds, unlike the 587.55 ms of the asynchronous call.
Conclusion
Parallel processing provides a powerful way to split a heavy task into smaller and independent problems that can be aggregated once they all are done. Futures in Clojure are a powerful and easy-to-use approach to accomplish this kind of task.
Next time you are dealing with a function that is taking so long to complete, consider using futures.
Do not forget to clone and play around with the code of this article.
Happy coding!
The cover image is a Photo by Jamie Street on Unsplash
Top comments (0)