Today I'm sharing with you a foundational principle that I keep finding in my research of how data systems are built, appearing in different forms but always serving the same core purpose: removing redundancy through indirection.
In this regard, let me walk you through four patterns that all share the same substrate of favoring links instead of repetition.
Dictionary Encoding
Dictionary encoding is perhaps the clearest expression. Instead of storing repeated values directly, we create a dictionary (lookup table) and store references to entries in that dictionary.
Let's consider this array:
Fruits: ["apple", "banana", "apple", "cherry", "banana", "apple"]
We can transform it to the following:
Dictionary: {0: "apple", 1: "banana", 2: "cherry"}
Fruits: [0, 1, 0, 2, 1, 0]
The benefits is immediate reduction of storage space, but we've also managed to establish a single source of truth for each unique value. Hmm, where have you heard this before..
Database Normalization
Database normalization takes this same principle but applies it to relational data structures.
For example, instead of repeating customer information across every order record, we create separate tables and link them through foreign keys.
Denormalized:
Orders: [order_id, customer_name, customer_email, product_name, quantity]
Normalized:
Customers: [customer_id, name, email]
Products: [product_id, name, price]
Orders: [order_id, customer_id, product_id, quantity]
This isn't just about storage efficiency, but also about data integrity. When customer information changes, there's only one place to update it. We've eliminated the possibility of inconsistent data.
String Interning
String interning ensures that identical string literals share the same memory location. Instead of creating multiple string objects with the same content, the runtime maintains a pool of unique strings and returns references to existing instances.
As a case study, see for example this entry from the Java Language Specification describing how it works in Java. We find this concept in python as well. As a last example, this is a post from Victoria Metrics on how they applied it in their solution.
German Strings
"German Strings" is a clever string optimization technique. The approach stores a 4-character prefix directly within the string header, avoiding pointer dereferences for common string operations. The key insight is that most string operations only need to examine the beginning of a string.
Let's consider this full string: "PostgreSQL is awesome".
This is the German string structure:
[length][prefix: "Post"][pointer] -> "greSQL is awesome"
This creates an indirection pattern where the prefix enables fast string comparisons and filtering operations without dereferencing pointers, since most mismatches can be detected by comparing just the first few characters.
However, German strings aren't always optimal. The overhead per string can be problematic for certain workloads. As the team at Polar Signals described, for low-cardinality string columns (like airport codes or status enums), simple dictionary encoding provides a 75% memory reduction compared to German strings.
Conclusion
The pattern of creating links to canonical sources is prevalent because it addresses fundamental challenges: storage efficiency, data consistency, and maintainability.
The next time you encounter repeated data in your systems, ask yourself: "Could I consider a link here instead?" The answer might lead you to refactor towards more elegant, efficient, and maintainable solutions.
Thanks for reading! Until next time!
Top comments (0)