Curtis Fenner

Posted on Aug 17, 2019

References, Identity, and Mutability

#programminglanguages #immutable #mutability #pointers

I see a lot of confusion stem from imprecise jargon. One particular piece of jargon which is terribly imprecise is "reference". A question I heard recently in an interview went like this:

Are objects passed by value or by reference in Java?

On the surface this looks like a reasonable question (especially if you hang around C++ programmers, who need to care very much about the difference). However, (and such a simple looking question being asked should be a hint) this is a trick question. It doesn't have a simple answer of either "by value" or "by reference", because those terms don't actually clarify anything.

So, if such a simple question is hard to answer directly, what is a reference anyway?

References

"References" to objects are common in procedural languages. If you know JavaScript, Java, Lua, C#, Ruby, Python (or many others), you've worked with object references.

In Python, you can see what we mean by "reference" in the following code:

a = [1, 2]
b = a
b.append(3)
print(a) # [1, 2, 3]

We directly modified b, but because b was referring to the same object as a, the modification b.append is also visible via a. In Python, you can see if two variables refer to the same object using the is keyword:

x = [1, 2]
y = [1, 2]
print(x is x) # True
print(x is y) # False
print(x == y) # True

This leads us to the two main properties of a reference:

A reference holds the identity of some object or value
A reference is a capability to mutate (aka modify) some object or value (or to view modifications that have occurred to some value)

The first snippet indicates the mutation property: we can use either a or b to modify the list that was initially [1, 2] and we can use either a or b to view the list resulting from those modifications.

The second snippet indicates the identity property: we can have two objects that have the same contents, but which are somehow still different objects. Being able to distinguish objects this way is most useful when in combination with the mutation property; if they are the same object, a modification will be visible through both references; if they are different objects, the modification will only affect the object pointed to by one reference.

The two properties are separate

While object references in most languages usually combine those two properties, you can have a programming language with only one, or neither!

No mutation, no identity: Haskell

Haskell is fairly different from most 'mainstream' programming languages. It is a purely functional programming language. Other purely functional languages include Elm, Pony, Erlang, and Idris. (If you want to become a better programmer, I recommend learning at least a little Haskell. Not because you need to program in Haskell, but because it's a totally different way of looking at code, and lessons apply to other languages. I recommend the easy-to-read and free Learn You a Haskell for Great Good!)

In Haskell, after a value is created, it cannot be modified. Thus, if I write

a = [1, 2, 3]

there is no subsequent way that I can change a so that it instead holds the value [1, 2, 3, 4]. Instead, I can create new variables and new values which are based on a:

-- keep `a` the same, but get a list which has
-- the contents of `a` followed by a `4`
-- and call it `a_with_four`:
a_with_four = a ++ [4]

The reason Haskell works this way is because you get a very powerful property, which is that when the same input goes in to any piece of a program, the same output will always come out. This is a very useful property when testing and refactoring code!

A consequence of this model is that objects don't have an identity beyond their contents. Unlike Python, there's no way to ask if [1, 2, 3] is the "same" list as another [1, 2, 3] -- they contain the same values, so they are the same.

No mutation, with identity: Scala & Lua's Strings

Scala is a language which is inspired by purely-functional languages like Haskell. Similar to Haskell, it's typical for objects to be copied-but-changed rather than directly modified. However, unlike Haskell, Scala does allow the programmer to mutate objects. This is largely because its lineage and interop with Java, but also because mutation can occasionally be a very useful programming tool.

This lands Scala in another place: Most values in Scala cannot be mutated, but they still have identities!

For example,

val a = Seq(1, 2, 3)
val b = Seq(1, 2, 3)
val x = a

a == b // true
a eq b // false
a eq x // true

Now, what use is identity if the object can't be changed? Why do we need to distinguish objects which are exactly the same?

One important reason may be performance. For example, in order to compare two lists, we have to compare each of their elements. If the lists are millions of elements long, this may take a while! However, if we know that they are the same object, we can skip all of that work.

One clear demonstration of this can be seen in Haskell. Haskell allows lists to be infinite, in which case comparing them will cause the program to hang, even if you compare an object to itself!

a = [1..]

a == a --> hangs forever

Generalizations of this approach may be extremely useful in the internals of some data structures and algorithms.

String Interning

A good example of the usefulness of identity in immutable objects is string interning, a feature of some programming languages and runtimes. This is where strings that have the same contents are made to live in the same places in memory. Some languages guarantee that this happens (Lua), and some do it only sometimes to reduce memory usage and speed up some comparisons (Java, JavaScript).

Strings are very frequently compared. For example, in JavaScript & Lua, obj.my_property needs to look up the string "my_property" and compare it to the other keys of obj. If doing this required checking every character in the string every time you used a key in an object, the language would be many times slower than it needed to be.

Lua interns all strings, such that you have a guarantee that if two strings have the same contents, they live at the same place in memory. This means that comparing two strings is extremely fast, as fast as comparing two numbers, because you can compare their location rather than their contents! This also has other benefits for keeping the implementation of Lua simple.

Mutation, but no identity: Ę̵͓̱͍̻̘̐ļ̸͎̩̀̽͗̇̐͜͝d̸̤̝͇̬̄̀̍͊͒̀̑͐͛r̶̡̢̬̦̲̟͓̪̲̯͉͔̖̈́͐͘͜ỉ̶̛͎̈́͒̚t̸̛̺̭̭̫̦̱̩̙̜͖̺͔̾̈͂̿̽͂͝c̸̢̡̦̝̯̗̘͒̀̐͂̃̃̈͌̿͋͜ḩ̸̓̃̿̽̈́̐̀̄̌̎̑͛̍̚

I introduced two axis, and said that you could choose one, or both, or neither. If you're a seasoned programmer or mathematician, you're probably already picturing this chart (if not, charts like this are a great way to find edge cases!):

_	Objects have identity	Objects don't have identity
Objects are immutable	Scala, interned strings	Haskell
Objects are mutable	Python, Java	?????

The question is, what goes in the last entry in the table?

I don't know of a real language that goes there. It's probably not a good idea to make a language that goes there. In fact, it's almost definitely a terrible idea to use any language that goes there. Since I don't know one that already exists, let's make our own!

Let's call the language Eldritch, because it would probably be appropriate for coding up the Necronomicon. Eldritch is going to be just like Lua, with one tiny change. Modifying an object will also modify all objects that are shaped the same way:

local a = {1, 2, 3}
local b = {1, 2, 3}

-- append `4` to the end of `a`
table.insert(a, 4)

-- here, we see the difference between Eldritch and Lua:
print(table.concat(b, ", ")) --> 1, 2, 3, 4

Now, the more you think about this, the more it should become obvious that using such a language would be, at the least, a massive pain.

Empty lists could not stay empty for long. You basically cannot use plain arrays, because if anyone else is using arrays, you're going to hit a problem if you end up with the same array. Accomplishing anything would require manually "tagging" objects to make sure that they never coincidentally look like other objects, negating all of the benefits of automatic memory management. Truly an eldritch abomination 😈.

However, that doesn't change the fact that you could design, implement, and then use such a language! Eldritch is the final corner of the table; you can have objects that don't have identity but can be mutated!

_	Objects have identity	Objects don't have identity
Objects are immutable	Scala, interned strings	Haskell
Objects are mutable	Python, Java	💀Eldritch🐉☠

Other pieces of references

Of course, this is just one way of breaking down what a "reference" is.

There are other things to look at, such as

whether or not references can be "repointed" at a different object. C++ & style references for one cannot, while most languages like Java, Python, JavaScript are repointed by using =
whether or not using the reference invalidates the object, as in "moving" an object in Rust or C++
whether or not assignment is "shallow" or "deep"; in C, assigning a struct value to a variable essentially copies the contents of the struct, while assigning a pointer is just making another reference to the same object. This can have dramatic implications for performance!
whether or not the reference is held by other variables, or other threads. This matters a lot for concurrent programming and is one of the main motivations behind Rust's borrow checker

...and many more things. Programming languages are varied and a very deep, interesting topic!

This was a very deep, rambly dive into just a single technical word. The takeaway should be that the concepts are more important than the terminology. While it's important to not say the wrong thing, it's much more important to actually understand the ideas you're talking about. Playing with ideas and taking them apart like this is a great way to get deeper understanding!