With a personal life that gets in the way of writing, I've been on a short hiatus, but I do plan to return to my overall story. In the meantime, I was playing with Clojure transducers last night and thought I'd write some notes.
Sudoku
During a discussion on query algebra for graph databases with a friend, he mentioned constraint propagation and pointed me at Peter Norvig's Sudoku Solver. I'm a little rusty on Python so I worked my way through it, which was a fun thing to do over my New Year break.
As an advocate of Functional Programming (FP), the statefulness of the solution made me feel particularly uncomfortable, though it did not look to be too tied to the solution. I was tempted to rewrite it in more familiar terms until I saw that the post includes a link to Justin Kramer's Clojure version (thank you, Justin!).
Short Circuits
As a sometimes-lazy functional language, Clojure has many operations for processing sequences of data. Using these as much as possible is idiomatic for Clojure, as it tends to isolate concerns (reducing the likelihood of bugs), creates abstractions that are favorable for refactoring, and on occasion gives libraries the latitude to improve performance. So I often look for opportunities to use sequence processing, even when the implementation may be less obvious. That doesn't mean I would necessarily do this in production code, but the exploration is pedagogical, and occasionally leads me to new insights.
One aspect of Norvig's code is that it terminates processing early when it discovers an inconsistency in the data (e.g. 7 may be the only option left for this square, but it is already used elsewhere in the same column). This allows the algorithm to move to the next possibility. To do this, instead of returning an updated data structure from the search functions, Norvig's code returns the value False
. (Cue the howls from advocates of statically typed languages. After all, he doesn't even use an Option monad! 😉).
In general, Clojure's sequence processing does not have an option to terminate early. Of course, exceptions can always be thrown, but this is not really appropriate for handling expected data, and it is definitely not "good form" for FP.
In Justin's case, he wrote a function called reduce-true
which performs the usual operation of reduce
, but terminates early whenever the result becomes a false value (false
or nil
). This function uses a loop
/recur
construct, which is straightforward to follow. This is exactly how I've always done such things in the past.
But this phrase "terminates early" is one that is specifically referenced by Clojure Transducers. This made me wonder if there was something I should consider here.
Transducers
Clojure Transducers are a mechanism introduced in Clojure 1.7. From the changelog in 2014:
Transducers is a new way to decouple algorithmic transformations from their application in different contexts. Transducers are functions that transform reducing functions to build up a "recipe" for transformation.
The transducers mechanism supplies the "separation of concerns" that I mentioned earlier and results in several benefits, most especially in performance when data goes through multiple transformations. But the thing that always stood out for me is that transducers can process a sequence to contain more or fewer elements than the initial sequence. This is a requirement that keeps coming up for me, and to date, I have been avoiding using transducers to fill this gap.
Multiple Elements
The main place where standard map
operations let me down is where a single value in an input sequence may result in multiple values in the output sequence. Consider an example of a modified FizzBuzz. For an input sequence of:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
Then the output should be:
[1, 2, :fizz, 4, :buzz, :fizz, 7, 8, :fizz, :buzz, 11, :fizz, 13, 14, :fizz, :buzz, 16]
The numbers smaller than 15 can all be managed with a simple condition-based function:
(defn fizzbuzz1 [x]
(condp #(zero? (mod %2 %1)) x
3 :fizz
5 :buzz
x))
(map fizzbuzz1 (range 1 15))
=> (1 2 :fizz 4 :buzz :fizz 7 8 :fizz :buzz 11 :fizz 13 14)
But what about the case for 15? We can't just return 2 values, as this will result in a pair object in the result stream:
(defn fizzbuzz2 [x]
(condp #(zero? (mod %2 %1)) x
15 [:fizz :buzz]
3 :fizz
5 :buzz
x))
(map fizzbuzz2 (range 1 17))
=> (1 2 :fizz 4 :buzz :fizz 7 8 :fizz :buzz 11 :fizz 13 14 [:fizz :buzz] 16)
One approach is to use a loop instead, note that the test for 15 could be avoided, but I'm keeping it in place to parallel the above function:
(loop [[n & ns] (range 1 17) result []]
(if-not n
result
(let [appended (fizzbuzz2 n)
result (if (sequential? appended)
(into result appended)
(conj result appended))]
(recur ns result))))
=> (1 2 :fizz 4 :buzz :fizz 7 8 :fizz :buzz 11 :fizz 13 14 :fizz :buzz 16)
This works, but it mixes the mechanism (the loop
/recur
) with the specific operation that we want. It also has a specific test for the datatype being returned, to determine how to add the data.
We can fix the second issue, by using the same datatype (a vector) for everything. This wraps scalars as a single-element vector...
(defn fizzbuzz3 [x]
(condp #(zero? (mod %2 %1)) x
15 [:fizz :buzz]
3 [:fizz]
5 [:buzz]
[x]))
(loop [[n & ns] (range 1 17) result []]
(if-not n
result
(recur ns (into result (fizzbuzz3 n)))))
But with consistent return types, we can now go one better with the fizzbuzz
function and skip the test for 15:
(defn fizzbuzz4 [x]
(or
(seq
(cond-> []
(zero? (mod x 3)) (conj :fizz)
(zero? (mod x 5)) (conj :buzz)))
[x]))
(loop [[n & ns] (range 1 17) result []]
(if-not n
result
(recur ns (into result (fizzbuzz3 n)))))
Finally, we can revisit the mechanism, which is the loop
, and consider the map
operation again. If we just look a the simple application of this operation, we get the intermediate structure:
(map fizzbuzz4 (range 1 17))
=> ([1] [2] (:fizz) [4] (:buzz) (:fizz) [7] [8] (:fizz) (:buzz) [11] (:fizz) [13] [14] (:fizz :buzz) [16])
This sequence of sequences can be joined into a single sequence with mapcat
:
(mapcat fizzbuzz4 (range 1 17))
=> (1 2 :fizz 4 :buzz :fizz 7 8 :fizz :buzz 11 :fizz 13 14 :fizz :buzz 16)
This does better than the previous iterations:
-
fizzbuzz4
only tests for divisibility by 3 and 5, and automatically handles 15. - There is little consideration of the mechanism for processing the list and building the result.
The only thing that may be of concern here is that all results for the fizzbuzz4
function are sequences, even when the result is a single value. This creates a minor overhead and is also exposing something of how the data is joined in the mapcat
operation. Nevertheless, this works well, and it is the approach that I have used numerous times in the past.
Transducer Approach
Since transducers can output multiple values in a sequence for a given input element, is this something that we can use here?
To reiterate the documentation for transducers, a transducer is just an operation that returns a function with 3 arities. These follow the pattern we see in a reduce
operation. Indeed, the transducer function is given a reducing function to create a context for the multiple arity function to use:
(fn [reducing-function]
(fn
([] ...) ;; initialize
([result] ...) ;; completion
([result input] ...))) ;; step
The initialize function gets called first and is used for any initialization required. If initialization starts without an initial value, then the reducing function would be called with no arguments here. This reducing function varies according to the use of the transducer, but it can be provided directly for the transduce
function.
e.g. if the code that calls the transducer was provided the +
function for reduction as in:
(transduce xf + [1 2 3 4])
Then the init
function would call (+)
(which is 0). If the xf
function is identity
, then this will give the same result as:
(reduce + [1 2 3 4])
The step function is where the work happens. It processes the input
argument and chooses whether or not to append to the result
argument.
Finally, the completion function flushes any data and provides the final result.
Step
The step function takes the current result
and the next element to be processed. If the transducer is a filter, then it may choose to not add something to the result. If the transducer adds multiple elements (as in the FizzBuzz example above), then all of these can be added to the result here.
While learning about building a transducer, I was curious about what had been accumulated in the result
already. This depends entirely on the type of operation that is being done with the transducer.
- If it is called with
sequence
then theresult
will benil
, and the reducing function will buffer the data, as a lazy sequence. - If
(into [] ...)
is used, then aTransientVector
is created and populated. - If
transduce
is called, then the initial value for the result is either provided or will be the result of calling the 0-arity form of the reducing function. - If the transducer is called via
eduction
, then the result will depend on the way that the iterator is called. Printing to repl will provide aresult
ofnil
. Calling viareduce
will have aresult
of the data being reduced into.
In other worse, there is no guarantee of what kind of result will be provided, nor on the reduction function that will be used to interact with it. These values must be considered opaque.
If we want to know what has been added to the result
already, then the appropriate way to do it is with state of our own. For instance, the dedupe
transducer uses a volatile value to save the last thing appended.
Take 2
Mixing in the operation of fizzbuzz4
with the transducer operation, we get:
(defn fizzbuzz*
[]
(fn [rf]
(fn
([] (rf))
([result] (rf result))
([result input]
(let [fb (cond-> []
(zero? (mod input 3)) (conj :fizz)
(zero? (mod input 5)) (conj :buzz))]
(if (seq fb)
(reduce rf result fb)
(rf result input)))))))
It would be possible to use the existing fizzbuzz4
function and reduce the returned value into the result, but this approach avoids the creation of that wrapping vector.
The first few lines of this transducer are all boilerplate, which is always an unfortunate thing, but it is not too onerous here. The interesting work all occurs in the final 2-arity function. First of all, the fb
value is set to an empty vector which is then appended to if the input is divisible by 3 and/or 5. If this fb
vector is not empty, then each of the values are put into the result with the reducing function (via reduce
), or else the reducing function is used to add the current input number.
How can this be used? Like any other reducer, it can be comp
-ed into a transducer pipeline, or else it can be given to a function that provides a transducer context:
(transduce (fizzbuzz*) conj [] (range 1 17))
(into [] (fizzbuzz*) (range 1 17))
(sequence (fizzbuzz*) (range 1 17))
(eduction (fizzbuzz*) (range 1 17))
The first 2 will provide a vector with the correct output, the third (sequence
) will create a lazy sequence, and the final one (eduction
) will create an object that can be iterated on. All of them will result in the desired seq.
Truncating Results
Recall that I started this because I wanted to create a reduction operation that terminated early. So how is that managed?
Transducers can terminate early by calling reduced
on the result being returned from the step function. This indicates to the calling process that it must not process any more data, and call the completion function.
Let's return to Justin's reduce-true function. This operates just like a reduce
operation, unless the reduction function returns a false value at any point, in which case it will immediately return that value with no further processing.
To do this we need to know what data the reducing function generates before adding it to the result, since we are not able to look at the result value. We also don't know what the reducing function will look like, since different contexts can use a different function.
I decided to opt for the approach used by dedupe
which keeps some state using a volatile value. This gives us the option of checking the result when invoking the user-supplied function and setting an exit flag if it is nil. When this happens, the step function can use reduced
on the result to indicate the end of the reduction.
So this is my approach:
(defn reduce-true
[f val coll]
(let [end (volatile! false)
terminating (fn [rf]
(fn
([] (rf))
([result] (rf result))
([result input]
(let [nxt (rf result input)]
(if @end
(reduced nxt)
nxt)))))
ff (fn
([a] a)
([a b]
(let [r (f a b)]
(when-not r (vreset! end true))
r)))]
(transduce terminating ff val coll)))
The first line of the let
creates the termination flag and initializes it to false
. The terminating
value is then set to our terminating transducer.
We can see here that the step function gets the result of calling the reducing function, and then checks the flag to determine if the final result needs to have reduced
called or not. If this reduced
function gets called, then the process will finish.
Next, we can see that the user-supplied reducing function is wrapped in a function called ff
that will set this exit
flag if the user-supplied function ever returns a false value. A single arity function is also provided for cleanup, though this is just a no-op.
With this transducer set up, we can now do a terminating reduction using transduce
, starting with the user-supplied val
and processing the provided collection coll
.
Wrap Up
Is this better?
For the provided example, I would say no. Justin's reduce-true
function is clearer in operation, despite revealing the mechanism for processing the collection. Given that the loop
/recur
construct of this function has no construction overhead, then there won't be any performance benefit either.
An approach like this transducer-based function may be more flexible and integrate with other transducers a little better (if the function returns the transducer directly and does not call transduce
itself). But for the provided application there isn't any benefit.
However, I now have a better idea of where things go when building my own transducer. When I first started playing with this, I was focused on the transduce
function, which gave me the wrong idea of what the reduction function would be and the datatype of the result
arguments. Writing this post helped me fill in some of the edges of my knowledge that I hadn't realized were lacking.
So now I know how to avoid mapcat
and how to create early-terminating sequence processing. That's a little more about transducers than I was comfortable with when I started. 🙂
Happy New Decade everyone!
Top comments (0)