Robin Alexander Dorstijn

Posted on Aug 21, 2022

On Data and Objects

#programming #computerscience #rust #dataorientedprogramming

At my employer there is a functionally unlimited book budget, which I have been making good use of. Which is why lately I have been thinking a lot about meta-programming. You know what I am talking about, paradigms, programming styles, data structures as philosophical constructs represented in the real world with series of 1s and 0s. Computer science stuff.

Why I am confused

One of my main sources of confusion is the tedinski blog. It is really cool, but only armed with a physics degree, this stuff goes way over my head most of the time. More practical for me is the book "Data oriented programming" by Yehonathan Sharvit.

Tedinski in particular tries to describe the difference between objects and data, but he stops short of actually defining the difference. Sharvit simply assumes the difference to be understood. This inspired some meditation on my part that is materializing as this post.

My (hypo)thesis

Thinking about this, I formed a hypothesis for a good definition of data and objects. Through writing this post I feel like I have come to feel rather pleased with my understanding and would consider it if not correct at least meritorious. My definition for objects and data are as follows:

Data is a collection of queryable facts. An object is a collection of executable behaviors.

For facts I had a clear understanding:

A queryable fact is a independent statement that can be derived from a machine state that is assumed to be truthful.

With queryable I refer to the fact that with my machine I am able to retrieve particular aspects about the data in any order, so that I can pass it on to subprocesses that can perform actions based upon this data.

Describing behavior

Prolog and graph databases

What remained unclear to myself was my understanding of behavior. I felt like I lacked a clear definition for it. I also did not have the ability to point out in code what exactly was "data" and what was "objects".

This brings me to prolog. Now heads up, I am as clueless about prolog as the next guy. I am completely incapable of writing anything useful in that language, but the ideas it presents are useful to me, especially since it works with symbols that are extremely close to my definition of data.

In the book designing data intensive applications Martin Kleppmann introduces the core concept of graph databases using prolog. The core idea is as follows: in prolog we can define relations as follows:

/* Contents of relation.pl */
relation(a, b).
relation(b, c).
related(X, Y) :- relation(X, Z), relation(Z, Y).

In the prolog command line we can then ask if these facts are true:

| ?- relation(a, b).
yes

and if a and c are related:

| ?- related(a, c).
true

or if d and e are related

| ?- related(d, e).
no

With this we can create edges that describe data. For example if we want to state: john has a job as programmer, lives in Illinois and a Illinois is in the USA. We can represent that as follows in prolog.

/* Contents of edges.pl */
edge(john, job, programmer).
edge(john, in, il).
edge(il, in, usa).
in(X, Y) :- edge(X, in, Y).
in(X, Y) :- in(X, Z), in(Z, Y).

The latter two lines are rules, these state that in(a, b) is true if the fact edge(a, in, b) exists or if a recursively is in b. I personally think this is beautiful. Rules like this allows you to write rather complex queries with relative ease. Take for example the query "Which people have jobs as programmers in the united states?". This can be expressed rather easily with the following prolog query.

GNU Prolog 1.5.0 (64 bits)
Compiled Apr 23 2022, 09:20:55 with gcc
Copyright (C) 1999-2022 Daniel Diaz

| ?- [edges]. /* Load facts in edges.pl */
compiling edges.pl for byte code...
edges.pl compiled, 6 lines read - 985 bytes written, 6 ms

(1 ms) yes
| ?- in(X, usa), edge(X, job, programmer).

X = john ? 

yes

And with that we have created a very basic graph database. Neat.

Prolog syntax as the archetype data/behavior language

Summarizing the above I would describe prolog in the following way: there are querys that prolog evaluates using facts and rules to answer what conditions would make the query true. I would like to fit this to my definition of data and objects as I presented at the beginning of the article. Directly from my definition it follows directly that prolog facts are data. With rules and queries the considerations are more complicated.

Rules are facts in that "a fact in(X, Y) can be generated from the facts in(X, Y) and in (Y, Z)", but it feels unfair to call it data. You could argue that when writing this to a file, in the context of the operating system this is data. The query "what is the current state of the file edges.pl?" has a factual answer, which can be retrieved by the machine that is my computer. However in the context of prolog, I would argue that it is not data, since it is not independent. This statement can only generate new facts in the presence of other facts.

I would rather call the symbol "rule" behavior, since it describes how the system behaves. This in turn would make rule symbols objects. Trying to generalize this, I would say that behaviors have no meaning without data. If we take this to be true, then also every statement in a programming language that is meaningful on its own, must be data.

Applying the theory: rust

This makes sense to me when I look at rust. Rust is a relatively new language that has really challenged a lot of my understanding of programming in excellent way. I have even written some productive code in it already.

In rust you can define structs. Structs in rust are very similar to structs in C or C++. Here is some syntax, which I find very self explanatory.

struct Person {
   name: String,
   job: String,
   location: (f64, f64)
}

One of the sources of confusion for me is how the word object is used colloquially. Namely, often I see instantiated structs referred to as "objects". Let's look at instantiation in rust.

let john = Person {
  name: "John",
  job: "Programmer",
  location: (41.8793343, -87.6289326)
}

Previously I would say that "the variable john holds a Person object." However in a discussion of objects and data I would rather say "the variable john holds Person data".

Rust does allows for the definition of behaviors on structs however. It does so through impl blocks. Let me show an example.

impl Person {
  fn greet (&self) {
    println!("Hello, I am {}.", self.name);
  }
}

This would imply that now I can make john greet his new friends.

john.greet();

However now it would be much harder to describe the contents of the john variable. Is it an object or is it data? I would say that john is now a stateful object, or that the object that in john has encapsulated person data, especially since rust makes fields private by default.

Resolving a confusion

This would also explain why json like dictionary structures are referred to as data by Sharvit. Though technically when loading a json file, you create dictionary and list objects, which even have methods, really this structure is much more like data, since it presents a number of queryable facts. Which also explains why libraries like lodash (and all of its offspring in other languages) do not extend (or subclass) the map/array/number classes, but instead define a set of functions that take those as generic structures as arguments.

Conclusion

I feel like this could be the start of an exploration of a more rigorous definition of data, objects and queries. With these definitions in hand, it would be simpler to explore data oriented programming principles, as well as deep diving into topics like immutability. Definitions have purpose and their quality often is defined by the rigor with which they are stated.

Top comments (1)

Tulio Rodrigues • Sep 23 '22

In the case of your Rust example I would call the structure as a class and the impl a companion object. You can have a look in Scala lang they do this distinction.