Links instead of repetition

#programming #database #webdev #backend

Today I'm sharing with you a principle that I keep encountering in several systems I'm researching, appearing in different forms but always serving the same underlying goal: removing redundancy through indirection.

I've structured this post into four separate patterns, let me walk you through them and you will see how all share the same substrate of favoring links instead of repetition.

Dictionary Encoding

Dictionary encoding is perhaps the clearest expression. Instead of storing repeated values directly, we create a dictionary (lookup table) and store only references to entries in that dictionary.

Let's consider this array:
Fruits: ["apple", "banana", "apple", "cherry", "banana", "apple"]

We can transform it to:
Dictionary: {0: "apple", 1: "banana", 2: "cherry"}
Fruits: [0, 1, 0, 2, 1, 0]

The same information is stored, however this approach has the benefit of immediate reduction of storage space, but we've also managed to establish a single source of truth for each unique value. Hmm, where have you heard this before..

Database Normalization

Database normalization takes this same principle but applies it to relational data structures.

For example, instead of repeating customer information across every order record, we create separate tables and link them through foreign keys.

Denormalized:
Orders: [order_id, customer_name, customer_email, product_name, quantity]

Normalized:
Customers: [customer_id, name, email]
Products: [product_id, name, price]
Orders: [order_id, customer_id, product_id, quantity]

This isn't just about storage efficiency, but also about data integrity. When customer information changes, there's only one place to update it. We've eliminated the possibility of inconsistent data.

String Interning

String interning is more of a programming language runtime feature, and it ensures that identical string literals share the same memory location. Instead of creating multiple string objects with the same content, the runtime maintains a pool of unique strings and returns references to existing instances.

As a case study, see for example this entry from the Java Language Specification describing how it works in Java. We find this concept in python as well.

Outside of the programming languages domain, this is a post from Victoria Metrics on how they applied it in their solution.

German Strings

"German strings" is a clever string optimization technique to know about when you're building a database. The approach stores a 4-character prefix directly within the string header, avoiding pointer dereferences for common string operations. The key insight is that most string operations only need to examine the beginning of a string.

Let's consider this full string: "PostgreSQL is awesome".

This is the German string structure:
[length][prefix: "Post"][pointer] -> "greSQL is awesome"

This creates an indirection pattern where the prefix enables fast string comparisons and filtering operations without dereferencing pointers, since most mismatches can be detected by comparing just the first few characters.

However, German strings aren't always optimal. The overhead per string can be problematic for certain workloads. As the team at Polar Signals described, for low-cardinality string columns (like airport codes or status enums), simple dictionary encoding provides a 75% memory reduction compared to German strings.

Conclusion

The pattern of creating links to canonical sources is prevalent because it addresses fundamental challenges: storage efficiency, data consistency, and maintainability.

The next time you encounter repeated data in your systems, ask yourself: "Could I consider a link here instead?" The answer might lead you to refactor towards more elegant, efficient, and maintainable solutions.

Thanks for reading! Until next time!

DEV Community