Programming Should Be About Transforming Data [...] I don't want to hide data I want to transform it.
- Dave Thomas, Programming Elixir
Whether it's calculating students' final grades, turning mouse clicks into t-shirt orders, or reading and publishing temperature data from a RPi, programming is basically distilling data from one form into another. Oftentimes, it doesn't feel like pure data transformation because we try to handle how the data is stored as well as its transformation in one step. This can make code awfully confusing, especially to anyone that has to maintain the code.
Any fool can write code that a computer can understand. Good programmers write code that humans can understand.
- Martin Fowler
I've got a couple of examples from past projects that I'd like to share, going from how I'd written algorithms when I was more concerned about the place of my datapoints, and then after I discovered partitioning/chunking functions.
What is Partitioning/Chunking?
Partitioning and chunking, in the context of this article, is subdividing existing collections, streams, and enumerables/iterables, into pieces more useful to your algorithm. This chunking can be done either by:
- Size (from ClojureDocs)
;; partition a list of 22 items into 5 (20/4) lists of 4 items
;; the last two items do not make a complete partition and are dropped.
(partition 4 (range 22))
;;=> ((0 1 2 3) (4 5 6 7) (8 9 10 11) (12 13 14 15) (16 17 18 19))
- Max Size (from LoDash)
_.chunk(['a', 'b', 'c', 'd'], 2);
// => [['a', 'b'], ['c', 'd']]
_.chunk(['a', 'b', 'c', 'd'], 3);
// => [['a', 'b', 'c'], ['d']]
- A mapping function to group contiguous items in the stream that map to the same value (from Elixir Docs)
Enum.chunk_by([1, 2, 2, 3, 4, 4, 6, 7, 7], &(rem(&1, 2) == 1))
# => [[1], [2, 2], [3], [4, 4, 6], [7, 7]]
Basically, yes... divide and conquer.
Parsing Arrays from Binary Streams
If you've ever had to get raw data off of a wire or read in a proprietary/lesser-used binary format, then you've probably had to write your own parser for it. And more often than not, you've had to read an array whose length is defined by an earlier declared variable. Endianness aside, that in and of itself is not the most fun. The format is usually akin to:
Value | Type | Size |
---|---|---|
Sync Word (0xC011FEFE) | int32 | 4 |
Timestamp (ยตsec) | int64 | 8 |
nElements | int32 | 4 |
Data elements | int32[] | 4*nElements |
Now, normally how I would have normally built something to read in the packet (without validation) like:
public class DataFrame {
public readonly DateTime Timestamp;
public readonly int[] Data;
public DataFrame(DateTime timestamp, int[] data) {
Timestamp = timestamp;
Data = data;
}
public static FromBytes(byte[] rawData) {
var timeUsec = BitConverter.ToInt64(rawData, 4); // Skip the sync word
var timestamp = DateTime.FromFileTimeUtc(timeUsec);
var count = BitConverter.ToInt32(rawData, 12);
var data = new int[count];
// Skip the first 16 bytes before we start reading the data
for (int i = 0; i < count; i++) {
data[i] = BitConverter.ToInt32(rawData, 16 + i * 4);
}
return new DataFrame(timestamp, data);
}
}
Granted, I have to care a lot about where the header information is, but there has to be a way to make my intention clearer on reading the array of data into data
. The same thing, written in F#, feels a little cleaner to me:
type DateFrame = { Timestamp : DateTime; Data : int array }
let fromBytes data =
let time = BitConverter.ToInt64(data, 4)
let numItems = BitConverter.ToInt32(data, 12)
{
Timestamp = DateTime.FromFileTimeUtc time
Data = data
|> Array.skip 16
|> Array.chunkBySize 4
|> Array.take numItems
|> Array.map (fun x -> BitConverter.ToInt32(x, 0))
}
Chunking data
into 4 byte increments allows me to very cleanly pass info to the BitConverter
to make the conversion to 32-bit integers. I don't have to keep track of offsets, and have effectively removed the entire category of offset or "off by x" errors from my code!
Only parts of the data matter
Let's face it: sometimes you just don't care about a lot of your data.
Image credit to this blog post
Let's say you're a delivery company and you want to make sure that your drivers are obeying speed limits. You really don't care about times that they're actually doing that, but when they go over the speed limit they might want to figure out why. Maybe they were going downhill and started to brake late. Perhaps they were texting and not paying attention. Who knows? But you're supposed to take the speed limit from your map API and velocity from the GPS and figure out when, and for how long, a driver was going over the speed limit. Let's see what it might look like to find drivers that were speeding for more than one minute:
public class DriverData {
public readonly DateTime Timestamp;
public readonly double Speed;
public readonly double SpeedLimit;
public DriverData(DateTime time, double speed, double speedLimit) {
Timestamp = time;
Speed = speed;
SpeedLimit = speedLimit;
}
}
public class Driver {
public readonly Name;
public IEnumerable<DriverData> DrivingHistory;
public IEnumerable<(DateTime, TimeSpan)> SpeedingMoreThanAMinute() {
// If the collection has no values or only one value, result is empty
if (DrivingHistory == null || DrivingHistory.Count() <= 1) {
yield break;
}
var isSpeeding = false;
var speedingStart = new DateTime(0);
var lastTime = new DateTime(0);
var i = 0;
foreach (var point in DrivingHistory) {
var isSpeedingNow = point.Speed > point.SpeedLimit;
if (isSpeedingNow) {
if (!isSpeeding) {
speedingStart = point.Timestamp;
}
isSpeeding = true;
lastTime = point.Timestamp;
}
else {
if (isSpeeding) { // We were speeding, and now we're not
var duration = (lastTime - speedingStart).TotalSeconds();
if (duration >= 60) {
yield return (speedingStart, duration);
}
}
isSpeeding = false;
}
}
yield break;
}
}
Yuck. That's a lot of ugly logic! There are six possible conditions there, and we have to keep all of those internal state variables and their functions in our head as we write this out. We need all of this overhead to find out when our drivers are speeding? Surely you can't be serious!
This where chunking by a function comes in handy. Trying this in Elixir (EDIT: slight fix thanks to Aleksei Matiushkin!) :
defmodule DriverData do
defstruct [:timestamp, :speed, :speed_limit]
def is_speeding_one_minute_plus(data)
data
|> Enum.drop_while(& &1[speed] <= &1[speed_limit])
|> Enum.chunk_by(& &1[speed] > &1[speed_limit])
|> Enum.take_every(2)
|> Enum.map(fun x ->
%{start => hd(x)[timestamp],
duration => DateTime.diff(List.last(x)[timestamp] - hd(x)[timestamp])
end)
|> Enum.filter(& &1[duration] >= 60)
end
end
Since this is one of five languages I've used in this post (six if you count English!), I'll walk you through what's going on since it's less simple. The Elixir |>
operator takes the previously evaluated expression and uses this as the first argument in the next function call. This functions the same way as Clojure's ->
macro, but differently than F#'s |>
operator (which uses the previous statement as the last argument in the next function). I use Enum.drop_while/2
to drop off all the points at the beginning from where the driver isn't speeding, and then I use Enum.chunk_by/2
then break the list into chunks based on whether or not the driver's speed exceeds the speed limit, and then use Enum.take_every/2
to take every 2nd element (as specified by the 2
in the arguments). That leaves me with only Lists of times where the driver was speeding. I then use Enum.map/2
to turn those lists into maps with keys of :start
and :duration
, and then filter those based on a duration >= 60
seconds. We did the whole thing solely by operating on the chunks of data we cared about, filtered out what we didn't need, and turned the output into something we could use, all without using any conditional branches or any internal state.
Conclusion
Admittedly, it took me quite a while to wrap my head around chunking/ partitioning data into useful bits. It's hard to make that transition from how I'm used to developing algorithms in over a decade of procedural and OO programming, to thinking about everything as just data transformations. If the concept seems difficult for you, just understand that lots of other people, including me, have been there too! If you find yourself building considerable amount of internal state just to parse some data out of an iterable/ enumerable collection, use that as a code smell to see if chunking is right for your problem.
Happy coding!
Top comments (2)
My R library disk.frame(github.com/xiaodaigh/disk.frame) follows exactly this chunking strategy.
Ahh, you're right! That's what I get for hastily typing it up. Thanks for the catch!