Inspired by Adrian Colyer's the morning paper series, Fabian Giesen's papers I like posts and the Papers we love talks I decided to start writing posts about interesting papers I read.
Pat Helland's Immutability Changes Everything describes why we need immutability and how adopting it now that storage and computational power make it affordable.
The key concept around immutability is append only computing: observed facts are recorded and kept forever while the results that can derived from their analysis are computed on demand.
Is this new?
It's interesting the this isn't a completely new concept.
A well known example is accounting, which is entirely based on an append only record of transaction.
Accountants don't use erasers or they go to jail.
Pre-telephone "distributed systems" are another historical example.
Messages used to be logged in forms consisting in append only section so that all the participants always had access to the full message history.
DataSets and Big Data
The paper define DataSets as a fixed and (semantically) immutable collection of data with a unique identifier.
Immutability is especially important when dealing with Big Data.
If the amount of that that needs to be processed requires distribution, the only sane way to deal with it is to guarantee that it won't change.
This enable idempotent functional calculations, which can be distributed and that can adopt a let it fail approach: if something fails it's ok, it will be restarted.
It's important to note that the requirement is for immutability to be semantic: as long as we don't change the content, we can choose to physically arrange the data the way we want.
This allows us to organize the data in the most efficient way for the task and we can also have multiple different representation of the same DataSet.
Normalization is very important in a database designed for update.
The only reason to normalize immutable DataSets may be to reduce the storage necessity for them.
Immutability and evolving data
Sometimes we want to keep track of the evolution of some data.
We can do that in an immutable way, but we need to keep track of all the changes.
One way to do that is to keep track of version history.
This can be done in a strongly consistent way (linear version history) in an eventual consistent way (directed acyclic graph of version history, like what you usually work with in Git).
Versioning techniques have also been adopted to improve the performance of database systems.
With multi-version concurrency control updates don't overwrite existing data.
Instead, they create a new, isolated, version.
This technique has been adopted to allow concurrent access to a database by different transaction while protecting them from observing inconsistent data.
A log-structured merge-tree records changes by appending them to a log.
This allows for very efficient insertions but would required to keep the whole history around forever and it would cause read performance to be poor.
To get around this, periodically the log gets compacted removing duplicates and reorganizing the data for more efficient reads.
Immutability in file-systems
Immutable data has been widely employed to file-systems too.
Log-structured file-system treats the disk as a circular log.
This gives very good write performance (especially on magnetic disks) and it offers features like snapshotting and easy recovery from crashes.
Of course, this is not free: as the disk becomes full there is the need to reclaim the blocks that don't contain useful information anymore to make space for new data.
Distributed file-systems like GFS and HDFS exploit immutability to achieve high availability.
Files are composed of immutable blocks and they have a single writer.
With this design, block-level replication is possible without the need of worry about update anomalies.
Immutability in distributed systems
Immutability provides benefit even when working with a distributed datastore using consistent hashing.
If data is mutable we need to settle for eventual consistency while rebalancing.
With immutable data we don't need to worry about stale versions present in different nodes.
If a node has some information, that's the only value it will ever assume.
SSD wear leveling algorithms treat blocks as immutable and use copy-on-write to spread more evenly the writes across the whole unit.
Hard disks embrace immutability with an approach similar to log-structured file-systems improve performance and be able to operate at higher density.
The dark sides
Immutability is very useful but it isn't a silver bullet.
Embracing it requires us to do some trade-offs.
Denormalization increases the amount of storage required.
Copy-on-write increases disk activity, in a phenomenon called write amplification.
It's important to keep this in mind when designing a system.
This post originally appeared on my personal blog
Top comments (2)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.