Genix

Posted on Jun 29

The Nature of Data

#career #systemdesign #programming #architecture

Abstract

The word data is one of the most heavily used and least precisely defined terms in computing. Standard references settle on an informal characterization of data as “raw facts,” but none of them says what structural properties a collection of facts actually needs before it can be addressed, compared, or processed by a machine. This is not just a philosophical loose end. It propagates directly into system design as schema drift, type-confusion vulnerabilities, unsound equality checks, and serialization defects, and each of these traces back to an unstated assumption about what index set or value set a piece of data is supposed to inhabit. This paper works through data, representation, and interpretation across three registers. First, we ground the concept etymologically and lexically, tracing the word datum from its Latin root through to contemporary technical usage, in order to expose a structural implication that the informal definitions all share but never state outright. Second, we formalize data as a function from an index set to a value set, with explicit totality, functionality, and codomain-consistency conditions, and we show how this framing subsumes sequences, tuples, and multisets as special cases rather than treating them as separate notions. Third, we introduce the data–information–knowledge hierarchy as the interpretive scaffolding through which machines, which operate at the data level and nothing more, are made to produce outcomes that mean something to human beings, and we argue that engineering this transition is really the central problem of data representation. From the formal definition we derive four corollaries linking specific violations of its conditions to recurring engineering failure modes, namely schema drift, type confusion, unsound equality, and serialization ambiguity, and we walk through a worked example for each. The paper closes by separating data from representation, arguing that the choice of representation is always contingent and sits outside the data itself, and that grasping this distinction is a prerequisite for reasoning clearly about encoding, type systems, and protocols.

Introduction

Every working system that stores, transmits, or processes information operates, in the end, on what its designers call “data.” The term shows up in database textbooks, programming language specifications, legal frameworks, and information-science curricula, yet it somehow resists a precise, operational definition. Database texts describe data as “raw facts” awaiting processing [1]; information-science references describe data as facts organized for communication [2]; international guidelines call data “facts, concepts, or instructions expressed in a standardized way, suitable for being communicated, interpreted, or processed, whether by people or by automated systems” [3]; and general dictionaries treat data as “facts or things known, used as the basis for inference or calculation” [4]. None of these is wrong, exactly, but none of them is formal either. None specifies what structure a collection of facts must have before it can be indexed, copied, compared for equality, or serialized without ambiguity.
This gap is not an academic inconvenience. It shows up in production systems as a recognizable family of bugs. Schema drift happens when two components silently disagree about what an index or field name means, even while operating on byte-identical data. Type confusion [5] happens when a fixed byte sequence gets reinterpreted under a type assumption different from the one it was originally written under. Unsound equality happens when two values are treated as equal just because they share a representation, even though they differ in the value set the comparison is silently assuming. And serialization ambiguity happens when a wire format under-specifies the index set or value set badly enough that two conformant implementations end up disagreeing on what the data actually means. Each of these has the same root cause: an implicit assumption about the domain, codomain, or mapping behind a piece of data was never written down, and so it could never be checked.
Closing this gap takes more than just picking one informal definition over another. It requires working through, in order, what data actually are, how data come to mean something, and how that meaning gets encoded back into data a machine can process. In other words, it requires a unified account of data, representation, and interpretation. That is what this paper sets out to do.

Scope and Assumptions

This paper is concerned with data the way computing systems treat it: stored, addressed, transmitted, interpreted. It is not trying to give a complete philosophical account of the data–information–knowledge distinction, and it does not take a side in the debate over whether data are theory-laden or “given” in some stronger epistemological sense. The formal treatment here is deliberately minimal. It introduces only the mathematical structure needed to make the engineering corollaries precise, nothing beyond that. Readers looking for a fuller philosophical treatment should start with Floridi [6] and Zins [7].

What Is Data?

A simple example is probably the best place to start. Look at the cross-section of a tree trunk and you find facts there without anyone needing to explain what a tree even is. The number of growth rings gives a reasonable estimate of the tree’s age, since each ring corresponds to one year of growth laid down during the annual cycle of dormancy and activity. The texture and color of the bark hint at the species, and warped grain, discoloration, or insect tunnels point to its health. Look closer still, and the density of the wood, the direction of the grain, and the moisture content explain the piece’s weight and volume. None of these facts were created the moment someone noticed them. They were already there in the tree, available to anyone who knew how to read them. That correspondence, between a physical state and a fact about the world you can read off from it, is about the simplest illustration there is of what this paper means by data.

Etymology and Lexical History

The word data is the plural of the Latin datum, the past participle of dare, “to give.” A datum is, quite literally, a “given”: a piece of information taken as a starting point for reasoning rather than derived from something else. In everyday speech the plural form gets treated as a singular collective noun all the time (“the data is clear”), but in technical writing the plural is preferred: data are observations, measurements, or recorded facts, not one undivided thing.
The Latin origin carries real structural weight that survives into technical usage. A datum is something received from the world, not something invented by the observer who happens to record it. Standard references preserve this sense, but they never quite formalize it.

Existing Definitions and Their Shared Structure

Different sources define the term in different, only partly overlapping ways. Rob, Morris, and Coronel describe data as raw facts, facts that have not yet been processed enough for their meaning to become apparent [1]. Ratzan instead defines the companion term: information is a coherent collection of data, organized in a particular way and meaningful as a result [2]. The two definitions agree on more than they disagree about. In both, data come before meaning rather than supplying it directly.
General dictionaries describe roughly the same shape from a different angle. Webster’s New International Dictionary treats data as something given or accepted: facts or principles granted, the basis from which an inference or a line of reasoning proceeds, or the material an idealized system is built from [8]. The Oxford English Dictionary is more compact about it: data are facts or things known, used as the basis for inference or calculation [4]. The UNESCO definition adds a structural requirement that the others leave out:

Facts, concepts, or instructions expressed in a standardized way, suitable for being communicated, interpreted, or processed, whether by people or by automated systems. [3]

“Expressed in a standardized way” is really the key phrase here. It gestures at, without actually specifying, the requirement that data have some structure that makes indexing, comparison, and transmission possible in the first place. The informal definitions all agree that data precede meaning, but none of them says what structural properties make data addressable to begin with.

The Structural Implication

Put these definitions together and they point at something none of them states outright. Data have no shape that belongs to them independently of some particular point of view. A fact only counts as data once it sits inside a framework, whether of meaning, purpose, or use, relative to which it is relevant, organized, coherent, and useful. A growth ring means nothing to a system that has no concept of seasons; a voltage means nothing to a circuit built to ignore it. Data presuppose some prior agreement, however informal, about what is even worth attending to.
This structural implication is exactly what the formal definition in Section 3 pins down. Before getting there, two properties of data that are easy to overlook deserve a closer look.

The Context-Dependence of Data

The same physical signal can be different data depending on the framework applied to it. Take five bytes with hexadecimal values 0x41 0x52 0x4C 0x49 0x5A. Under one set of assumptions this is the ASCII string "ARLIZ"; under another it is five unsigned 8-bit integers, 65, 82, 76, 73, and 90; under a third it is the first five bytes of some arbitrary binary file format; under a fourth, a fragment of a network packet payload. The bytes themselves are identical every time. What changes is not the underlying signal but the interpretation imposed on top of it. Meaning is not contained in the bytes. It arises from the agreement between whatever system produced them and whatever system reads them.
This example also shows the practical cost of ignoring context-dependence: if the framework is never stated, different components will end up applying different frameworks to the same bytes, and they will produce different, incompatible results. The defects described in Section 1 are exactly this phenomenon, just at scale.

The Perishability of Data

A common assumption early in a technical career is that data are fixed and objective, that once you record a fact, the fact just sits there. This is only partly true, and the part that isn’t true matters quite a lot. Walliman points out that data are not only imprecise but transient and perishable [9]. A poll measuring voting intentions ahead of an election will not produce the same result as an otherwise identical poll run at a different time, even with the same sampling method and, in principle, the same respondents.
“Perishable” does not mean the data physically vanish. A measurement recorded yesterday is still sitting there to be read today. What perishes is its currency, its claim to describe the present state of whatever it measured. Data perish in a second sense too. Secondhand reports and biased accounts get presented as settled fact all the time, because no method of recording information is perfectly reliable, and some distortion creeps in at almost every stage of transmission. This is not some rare edge case confined to opinion polling. It is a basic property of how facts move from the world into a recorded form. The engineering implication is that reliable data require explicit attention to precision, integrity, provenance, and timestamp, and these properties follow naturally once you adopt the formal definition in Section 3.

A Formal Definition of Data

The informal definitions reviewed in Section 2 are consistent with each other, but none of them is formal. None says what structure a collection of facts needs before it can be indexed, compared for equality, or serialized. Mathematics gives us a sharper handle on the same idea. In mathematics, data are most naturally thought of as a function: a mapping from some index set to some value set.

Primitive Notions

Definition 1 (Index Set and Value Set). An index set $I$ is a set whose elements serve as positions or addresses. A value set $V$ is a set whose elements serve as the possible contents at a position.

Neither $I$ nor $V$ is constrained any further at this level of abstraction. $I$ may be finite or infinite, ordered or unordered, and $V$ can contain elements of any kind at all. What matters is just that both are fixed before a collection of data gets constructed.

Definition 2 (Datum). A datum is an element $v \in V$ , drawn from some value set $V$ , designated as the content held at a single index.

The Function-Theoretic Definition

Definition 3 (Collection of Data). A collection of data $D$ over index set $I$ and value set $V$ is a total function
$D : I \to V$ assigning to every index $i \in I$ exactly one value $D(i) \in V$ .

For $D$ to be well-defined, three conditions need to hold.

Condition 1 (Totality).

$D$ must be defined for every $i \in I$ . A structure that leaves some indices unassigned is a partial function, and any operation that assumes totality is unsound when applied to it without first checking the domain.

Condition 2 (Functionality).

For every $i \in I$ , $D(i)$ must denote a single, determinate value in $V$ . A mapping that assigns multiple candidate values to the same index, two fields claiming the same byte offset under different schema versions, say, is not a function, and therefore not a single well-defined collection of data.

Condition 3 (Codomain Consistency).

Every value $D(i)$ must belong to the same value set $V$ for all $i \in I$ , where $V$ is fixed in advance of constructing $D$ .

Equality

The function-theoretic definition gives us a precise criterion for equality that differs in an important way from mere representational identity.

Definition 4 (Functional Equality). Two collections of data $D_1 : I_1 \to V_1$ and $D_2 : I_2 \to V_2$ are equal, written $D_1 = D_2$ , if and only if $I_1 = I_2$ , $V_1 = V_2$ , and $D_1(i) = D_2(i)$ for every $i \in I_1$ .

This is a stronger condition than byte-for-byte identity of an encoding. Two structures can share an identical encoding while differing in $V$ , exactly as the five-byte example in Section 2 shows, and Definition 4 correctly classifies them as distinct collections of data. The reverse also holds: two collections can be equal under Definition 4 while differing in their encodings, because the encoding plays no part in the definition at all.

Relation to Standard Mathematical Structures

The function-theoretic definition deliberately generalizes several structures that get used informally as synonyms for “data.”

Sequence / Array.
The special case $I = {0, 1, \dots, n-1}$ : a function from a finite, contiguous, totally ordered index set into $V$ . This is the canonical definition of a sequence in the mathematical literature [10], and it is the formal starting point for arrays. Arrays are just data over the particular index set that integer addressing supplies.

Tuple.
The special case where $I$ is small and fixed, and $V$ is allowed to vary per index ( $D(i) \in V_i$ ). Tuples relax Condition 3 deliberately and openly, which is a structurally different move from the implicit relaxation that produces type-confusion defects (Section sec:implications).

Multiset.
Corresponds to the case where $I$ is considered only up to a permutation. A multiset is a quotient of the function-theoretic structure, not some alternative to it: the underlying function still exists, and the multiset view simply ignores the ordering of $I$ .

Relation / Table.
A relation with $k$ attributes over domains $V_1, \dots, V_k$ is, under this definition, a set of functions from ${1,\dots,k}$ into $V_1 \times \cdots \times V_k$ . The relational model’s primary key is a selection of indices into this function space.

The unifying point is that each of these structures is just a special case of Definition 3, distinguished by the properties imposed on $I$ , $V$ , or both. Moving between them is a matter of specifying or relaxing constraints on those two sets, nothing more exotic than that.

A Worked Example

A weather station records temperature once per hour. Formally this is a function $D : \mathbb{N} \to \mathbb{R}$ , where the index set is the hour number and the value set is temperature in degrees Celsius. The station’s log for one day is the restriction of $D$ to $I = {0, 1, \dots, 23}$ , which is an array. The highest temperature recorded is $\max_{i \in I} D(i)$ , a computation that is well-defined precisely because $I$ is finite and $D$ is total on it. If a sensor fails and no reading gets recorded at hour 14, the structure is no longer total on $I$ , it has become a partial function, and any operation that assumes totality, computing the daily average, for instance, is unsound unless the missing value is handled explicitly.
This example shows why the three conditions matter in practice. Engineering decisions about how to handle missing values, what type a field carries, and whether two records actually count as the same record, all of these reduce to questions about $I$ , $V$ , and the conditions in Definitions 3 and 4.

Data, Information, and Knowledge

The words data, information, and knowledge get used interchangeably in casual conversation, but they describe three distinct things. The distinction matters in practice. Confusing them leads to systems that produce numbers when people actually need answers. This section looks at the relationship between the three concepts, then argues that computing systems operate exclusively at the data level, and that the transition from data to information is something engineers have to design deliberately rather than get for free.

The Hierarchy

The relationship between data, information, and knowledge is standardly described as a hierarchy [11], [12]. This paper only really depends on the first two transitions in that hierarchy.
Go back to the tree cross-section from Section 2. A raw count of growth rings is data: fifty-three concentric lines, nothing more. Data are syntactic, they have structure but not yet meaning. Noticing that fifty-three rings means the tree is roughly fifty-three years old, and that the unusually narrow rings around year thirty line up with a known regional drought, turns that count into information: data organized and placed in context, in keeping with Ratzan’s definition [2]. Information answers questions like “what happened?” and “how much?” It is still factual and objective, but now it is interpretable by a human being. Knowing how to read ring width as a record of annual rainfall, reliably enough to apply the same method to a tree nobody has studied before, is knowledge: information understood thoroughly enough to be put to new use.

Example 1. The progression from data to knowledge can be put concisely:

Data: 39.5

Information: Patient temperature is 39.5C, recorded at 14:32 on 2026-06-28.

Knowledge: This patient has a significant fever. Combined with readings from the past six hours showing a rising trend, the attending physician should be notified immediately.

The number 39.5 is just a datum drawn from $\mathbb{R}$ . The interpretive layer that attaches a unit, timestamp, and patient identity is what constitutes information. The clinical judgment that combines this information with prior cases and medical guidelines is what constitutes knowledge.

What Computing Systems Do

Computers operate almost entirely at the level of data as defined in Section 3. A CPU does not understand information, it manipulates bits. A memory array does not know what its contents mean, it just stores and retrieves bytes at addresses. Shannon’s foundational account of communication is explicit about this: the engineering problem is the reliable transmission of symbols, regardless of what they mean [13]. Meaning is the problem of the sender and the receiver, not the channel.
The transition from data to information is something software engineers build into their systems on purpose, through type systems, schemas, protocols, and documentation. When a programmer writes int temperature = 39;, what they are really asserting is that this particular sequence of bits represents a temperature, measured in some unit, at some precision. The meaning lives in the programmer’s head and in the documentation, not in the machine. The machine just stores 0x00000027 and nothing else.
This is not a criticism of computers, it is just a description of how they work, and understanding it is the first step toward seeing why encoding schemes, type systems, and representation formats exist in the first place: each one is an answer to the question of how to encode information, which has meaning, as data, which does not, in a way that preserves the meaning reliably.

The Gap Easy Access Does Not Close

Having access to information does not give you the understanding needed to tell whether that information is relevant, reliable, or correctly applied. A search engine can return information in milliseconds, but it cannot, on its own, supply the judgment needed to evaluate it. The gap between having information and having knowledge is not something retrieval speed closes. This is an old observation, but it is still worth restating in a technical context: systems that expose data as if it were information, or information as if it were knowledge, end up misleading their users in predictable ways.

Relationship to the DIKW Hierarchy

Ackoff’s data–information–knowledge–wisdom hierarchy [11] and its many descendants in the information-science literature [7], [12] consistently treat the data tier as their foundation without ever defining it formally. The function-theoretic definition in Section 3 is meant as a precise characterization of exactly that tier, one that is compatible with, and presupposed by, every DIKW-style account. It does not compete with the hierarchy. It fills the structural gap the hierarchy leaves open at the bottom.

Similarly, Floridi’s General Definition of Information (GDI) [6] characterizes information as data that are well-formed, meaningful, and (in some variants) truthful, while treating data themselves as a primitive that GDI deliberately leaves unspecified. The function-theoretic definition supplies the structural detail that GDI leaves open, without disturbing the relationship between data and information that GDI describes.

References

P. Rob, S. Morris, and C. Coronel, Database systems: Design, implementation, and management, 10th ed. Boston, MA: Cengage Learning, 2013.
L. Ratzan, Understanding information systems. Chicago, IL: ALA Editions, 2004.
UNESCO, “Information for all programme: Guidelines on information literacy.” https://www.unesco.org, 2008.
Oxford University Press, “Oxford English Dictionary.” Online edition. https://www.oed.com, 2023.
MITRE Corporation, “CWE-843: Access of resource using incompatible type (‘Type confusion’),” 2023.
L. Floridi, Information: A very short introduction. Oxford: Oxford University Press, 2010.
C. Zins, “Conceptual approaches for defining data, information, and knowledge,” Journal of the American Society for Information Science and Technology, vol. 58, no. 4, pp. 479–493, 2007, doi: 10.1002/asi.20508.
Webster’s new international dictionary of the english language, 2nd ed. Springfield, MA: G. & C. Merriam Co., 1934.
N. Walliman, Research methods: The basics. London: Routledge, 2011.
D. E. Knuth, The art of computer programming, volume 1: Fundamental algorithms, 3rd ed. Reading, MA: Addison-Wesley, 1997.
R. L. Ackoff, “From data to wisdom,” Journal of Applied Systems Analysis, vol. 16, pp. 3–9, 1989.
J. Rowley, “The wisdom hierarchy: Representations of the DIKW hierarchy,” Journal of Information Science, vol. 33, no. 2, pp. 163–180, 2007, doi: 10.1177/0165551506070706.
C. E. Shannon, “A mathematical theory of communication,” Bell System Technical Journal, vol. 27, pp. 379–423, 1948.

DEV Community