I've always thought data structures are cool, but you know what's cooler? Seeing them in the wild!
While going through Elixir's doc, I saw that Elixir uses linked lists under the hood for their linear data structure. I thought this was cool, but something struck me; I understood arrays and linked lists, but I had no idea how it relates to programming languages and it's bothered me ever since and I need to find out why linked list was used, hence this article!
So back to the article, from what I've found so far, there are three reasons as to why Elixir does this (and I could be totally wrong, feel free to correct me!). Let's go through one by one:
Immutable Data
In Elixir, (most functional languages actually), data are immutable. This is an important first step to understanding why linked lists is used, so let's explore this.
Immutable means that once a data is declared, it cannot be mutated/changed anymore.
Assuming you know how arrays work under the hood (check out my other article if you want a refresher). Let's take a look at what happens if we try to implement immutability with an array!
Array is defined as a continous block of memory. The problem with this is that, an array of 5 elements is still just ONE array, and when we are adding/deleting an element, we are MUTATING it. How can we use immutability with arrays then? It's not impossible, but let's look at why that's not practical.
If we want to enforce true immutability in arrays, this means that we need to make a full copy of the old array everytime we want to add/delete in the array.
Which means, if you have an array of size 5, if you want to add a new item to the array, your memory usage is instantly doubled (because you need to keep the old array as is, and you also need to make a new array of same elements). And that's just space complexity, there's also time complexity that we need to think about!
A linked list doesn't have the same constraint, as the nodes are all stored separately in memory, which means we don't really need to worry about space/time complexity while adding/delete nodes in list.
This gives us our first reason as to why it uses a list, however that's not the whole story - here's where recursive structural/tail sharing jumps in and everything starts making sense.
Recursive structure
Did you notice that linked lists are recursive by definition?
For example, A -> B -> C -> D
is a linked list, but so is B -> C -> D
, C -> D
and so on, and each linked list is just a sub structure of another linked list!
Well that wasn't very exciting on its own, but this is vital to the next piece of puzzle!
Fun Fact: The recursive nature coupled with the fact that datas have to be immutable (so you can't have a loop counter) is why functional languages are usually associated with recursions - they kinda have to!
Structural/Tail Sharing
So, we know linked lists are recursive in nature. Combined with the immutable nature of the language, we know that the data can never change.
This is interesting, because now we can confidently say that A -> B -> C -> D
is a different list from B -> C -> D
(even though one recursively contains the other one), and because we have that guarantee (along with the fact that a list CAN'T change), we don't have to define the same datas twice, and we can reuse existing linked lists! This is called Structural sharing.
Awesome isn't it? Let's look at an example.
e.g:
list = [5,6,7,8]
list_one = [1 | list]
list_two = [2 | list]
Now we have THREE different lists! list
, list_one
and list_two
, but all of them share the same reference (the tail) and the only difference between them is the head pointer.
This means that there will be a total of 6 elements in memory. Adding to list has low memory cost, while retaining the immutability that we desire.
Reusable baby!
If you want to read a little more, you can look into Trie trees which have the exact same concepts of sharing datas/prefixes!
Garbage Collection & Caching?
These two I'm not quite sure, but I've heard that linked lists are good for GCs and that tail sharing makes a good candidate for locality of reference/caching (I don't get how, because they aren't stored in the same places). Would appreciate if someone wants to chime in!
Closing Note
Sidenote, in actuality it's not as much about Elixir since it compiles down to Erlang, but also not much about Erlang because all functional programming does pretty much same thing, but this is what prompted my curiousity hence the ties to Elixir.
While writing this article, I found that I had to write in depth on how arrays work before I was able to dive into the Elixir part, so I've published that as another article over here instead; do read that to gain a better understanding on what the tradeoff is!
I also did not really talk about Big O notations because I felt they might add unnecessary reading time and complexity to the article, but they're pretty vital and fundamental to computer science, so I suggest you brush up a little on them.
If you're a podcast kind-of person, there's the BaseCS by CodeNewbie, co-hosted by Vaidehi Joshi and Saron.
If you want to read though, Vaidehi Joshi's blogpost version (which is what inspired the podcast I believe) is great too on BaseCS Medium.
As for video, MyCodeSchool is beyond amazing and is pratically where I learned everything that I know now, highly recommended!
Other than that, hope you guys enjoyed the article as much as I enjoyed writing it!
Sources
https://elixir-lang.org/getting-started/basic-types.html#linked-lists - The piece that prompted this article
Top comments (11)
Hi, Edison! Thank you for your really interesting and useful articles!
One note regarding this article: this link
leads to dev homepage instead of this one :) dev.to/edisonywh/recursion-tail-ca...
Thanks for point out! Bu that's really weird, because I remember deleting that sentence altogether, and I can't see it now anywhere? Not sure how you saw the stale version!
Probably, it was a cache of the service worker. Now actually I see the article without those sentence.
Hello!
I don't understand "Adding to list has low memory cost, while retaining the immutability that we desire."
I have a list A -> B -> C
I want to add D to the list end.
next(C) == nil?
I expect that next(C) should be D.
I need to mutate C, so I must copy it.
But if C was changed then I need to change B...
So I get full list copy then I add element to the list end.
Looks like this is suboptimal algorithm.
Can you please explain how it really works?
Good point! Yes you're right, I should've clarified by adding I in fact meant prepending.
In the end, it's just a normal linked lists that Elixir uses under the hood, so it inherits the same appending Big O for time. What you're asking is a little more, which is the fact that we have to maintain immutability for structural sharing, so yes you're also right in that regard - we'd need to recopy the entire linked lists and then only append it.
If we look back on my structural sharing example:
we have 3 lists, but three of them point to the same existing list (structural sharing), but if I choose to append instead:
Although
list_three
has the same values[5,6,7,8]
, it does not point to the same nodes inlist_one
andlist_two
. I hope that makes sense!There's this another really good article that even comes with Benchmark that shows this in action :)
You can also see it on the official language guide here too
Just to add to that, some functional languages implement a “copy on write” mechanism that only copies the content of the list when mutated.
I'd love to hear about how that is implemented contemporarily, I'd really appreciate if you gave me a poke if you find/write a nice article about that.
AFAIK: In statically typed languages where the linked list is part of core/stdlib,
(as opposed to some kind of JS where you implement it as nested length-2 arrays (
[4,[3,[2,[1]]]]
))they are actually stored as arrays of significant size with pointer to next slice.
(And previous, as since the data structure is already quite large, they are often a doubly-linked list. What's another pointer?)
This makes them still more expensive to index into than a contiguous/dense array but makes both indexing AND iteration several orders of magnitude faster.
Pretty much the same thing that turned binary trees into massively wide trees used today.
There's more complexity, especially since this cancels some size optimizations of structural sharing, but that's what the compiler is for.
Interesting article! FYI "data" is plural, so "... data are immutable" instead of "... datas are immutable" (colloquially "... data is immutable" is also acceptable).
Cool, thanks for pointing out :D
Sure, the baby is reusable, but how many times is a baby really reused, in most contexts?
I don't know, don't you share your babies with other people? Seems really efficient that way!
Looks like I've just hit a jackpot billion dollar ideas, co-founder anyone?