DEV Community

loading...
Cover image for Wormhole - Data Collection Cheat Sheet and Library in 4 Languages

Wormhole - Data Collection Cheat Sheet and Library in 4 Languages

Jason Steinhauser
15+ years of analysis and development. Father of 3. Passionate about testing, functional programming, and pretty graphs.
・6 min read

During the Advent of Code this past year, I was trying to enhance my knowledge of Elixir, as well as just functional programming in general. There were times where I found a function that most other languages I used didn't have an analogous function (a glaringly obvious one being Enum.reduce_while), and other times where I was writing functions I'd used often in other languages (Clojure's frequencies would've been mighty handy!). I finally decided to bite the bullet and try to create a list of collection manipulation functions I used often in other languages, implement them in others I've learned/currently trying to learn, and discover new functions that I wouldn't want to live without!

What Is Wormhole?

It's this, really:

GitHub logo jdsteinhauser / wormhole

Some of my most used functions implemented in different languages

Wormhole

You ever think, "Hey I wish this langugage had the capability of some other language I like?" Enter Wormhole.

Motivation

During the Advent of Code 2018, I found myself writing the same functions in Elixir that I knew I had used in Clojure, F#, or some other language. In order to prevent myself from doing this in the future, I decided that building a library to do house all of these helpful functions across languages that I either knew or wanted to learn.

Desired functions

These are the functions that I've used and that I'd like to have in multiple languages. Some of them are already implemented in the language already, so I won't reimplement them. Each implementation will list the functions as implemented in the language as well as links to their documentation.

  • map
  • filter
  • reduce
  • reduce_while
  • chunk
  • chunk_by
  • juxt
  • min_by
  • max_by
  • frequencies
  • group_by
  • scan
  • inc
  • dec
  • zip

I've got an addiction. I love learning new languages. With learning new languages, you end up finding functions, classes, and concepts that you wish that you had in other languages. Sometimes, those functions are named different things and it gets confusing when you switch between languages. I end up doing a lot of data collection manipulation, and so I decided to start with what I knew best and branch out from there!

What Functions Am I Looking For?

For a non-exhaustive list, I wanted to have at least the following:

  • Collection basics: map, filter, reduce, and scan
  • Chunking data: chunk, chunk_by
  • Common stats: min_by, max_by, group_by, frequencies
  • Other hella useful things: reduce_while, juxt, identity

What Languages Am I Targeting?

For now, I have filled in my perceived gaps in functions in C#, Clojure, and Elixir. I have an F# solution that I'll be comfortable with early this week, and I've started looking at a comprehensive list of Ruby functions as well. After that... well, I'm not entirely sure! I think I'm going to go through Rust, JavaScript, Java, and possibly Kotlin and Python 3 to see what other handy things I can implement across all those languages.

Will These Be Deployed to Package Managers?

Yes... but not right now. I need to get the documentation to a suitable state. I've pulled down several packages before but I've never pushed mine up to any! I'm sure that will end up being a blog post in and of itself.

Current Cheat Sheet

Here's a summary of the languages I've targeted so far, with documentation links to each function that either already exists, or that I've implemented in Wormhole.

  • F# contains a Seq.windowed function, but it only moves the chunk one element at a time.

Why Is This Stuff Useful?

Well, some of the functions are either self-explanatory or already written about in several other articles. I'll cover some of the lesser known ones and why I personally found them useful.

Chunking

I've written about chunk and chunk_by before, but in case you missed it, check out my previous article!

Reduce While

I'll admit that this is possibly a not-so-often used case. Sometimes you don't want to reduce an entire sequence - just up to a certain point. Unfortunately, reduce is typically all or nothing. That doesn't really work when you have a potentially infinite series of data. However, Elixir's reduce_while helped me keep my solution for AoC 2018 Day 1 Part 2 compact. I'm hoping to find more real-world use cases for it... but it's still one of my favorite data processing functions I've found.

Juxt

While I admit that, at first glance, juxt is nothing special. Take an array of functions that operate on the same parameters, and then return a single function that takes that parameter and returns an array of each function run on those parameters? Why use that?

I've ported this function from Clojure into other work projects before. For instance, I had a very large collection of data (1MM+ entries!) and I couldn't afford to iterate over them multiple times. I used juxt to compose my analysis functions together so that I only had to iterate over the collection one time.

Similarly, since a keyword in Clojure can be treated as a function for retrieving a value out of a map with that key ((:foo {:foo 5 :bar 3}) returns 5), you can compose several keywords for accessing data out of a collection of maps and returning the results in kind of like a table format. I wrote about that as part of a previous post on dense Clojure code:

Frequencies

Because sometimes, you just need a histogram. frequencies provides that in one single function!

Conclusion

Hopefully someone out there will find this useful, either as a cheat sheet or as a library. In the near-term, I will be investigating Ruby and Rust (in that order) to see what other handy functions I could foresee using across multiple languages. I'll also put Wormhole up as a package in your favorite package managers soon, and probably write about the things I do/don't like about each.

Happy coding, and I'd love to hear about other general purpose data manipulation functions you've found useful!

Discussion (6)

Collapse
zanxhipe profile image
Pieter Slabbert

In clojure you can use reduced to stop before you have done the entire sequence

Collapse
camdez profile image
Cameron Desautels

I was going to say the same! Here's what that looks like:

(reduce (fn [acc x]
          (+ acc x))
        (range 11))
;; => 55

(reduce (fn [acc x]
          (if (> x 3)
            (reduced acc)
            (+ acc x)))
        (range 11))
;; => 6

So we already have that capability built-in (without introducing a new function). Clojure also already has max-by and min-bymax-key, min-key though, perhaps, with a slightly different interface that what you might have expected.

Collapse
jdsteinhauser profile image
Jason Steinhauser Author

I was unaware that this function existed! I'll have to take a look into it to see how it could've helped in a few cases. Thanks for letting me know!

Collapse
qm3ster profile image
Mihail Malo • Edited

Do you know if there's anything juxtlike that would help this case:

There is often a case where I have multiple indexes, so I end up doing something like this:

const cats = [
  { name: "Aeris", id: 0x00, isFavourite: true },
  { name: "Juri", id: 0x01 },
  { name: "Dante", id: 0x03 },
  { name: "Frankenstein", id: 0xff }
]
const byName = new Map()
const byId = new Map()
for (const cat of cats) {
  byName.set(cat.name, cat)
  byId.set(cat.id, cat)
}

If this wasn't as common, I'd probably investigate making a function that takes a predicate and an array and makes an iterator of entries that new Map() can consume.
But like this, I only iterate once to populate multiple Maps.

Plus there's the cases where I receive an object (including from JSON), so normal iteration wouldn't work:

const cats = {
  Aeris: { id: 0x00, isFavourite: true },
  Juri: { id: 0x01 },
  Dante: { id: 0x03 },
  Frankenstein: { id: 0xff }
}
const byName = new Map()
const byId = new Map()
for (const name of Object.keys(cats)) {
  const cat = { name, ...cats[name] }
  byName.set(name, cat)
  byId.set(cat.id, cat)
}

Something like "multiple reducers in one iteration"

Collapse
jdsteinhauser profile image
Jason Steinhauser Author

That is an interesting case that I hadn't considered before. I will definitely have to look into it while exploring JavaScript ecosystem more thoroughly!

Collapse
qm3ster profile image
Mihail Malo • Edited

In OOP it would probably be this:

class CatLookup {
  byName = new Map
  byId = new Map
  add(cat) {
    this.byName.set(cat.name, cat)
    this.byId.set(cat.id, cat)
  }
  addMany(cats) {
    for (const cat of cats) this.add(cat)
  }
  constructor(cats = []) {
    this.addMany(cats)
  }
}

const c = new CatLookup([
  { name: "Aeris", id: 0x00, isFavourite: true },
  { name: "Juri", id: 0x01 },
  { name: "Dante", id: 0x03 },
  { name: "Frankenstein", id: 0xff }
])