Sorting encrypted data without decryption: a practical trick

#security #cryptography #python #database

TL;DR: I sort encrypted strings using an external index: the sum of weighted Unicode code points for the first N characters with exponential positional weights, followed by quantization. Monotonicity is preserved, but accuracy predictably degrades after the first few characters. Not a cryptographic scheme; some ordering information leaks by design.

The essence of the problem

Some time ago, while implementing a project, I ran into the problem of sorting encrypted data in a database. I’d like to share the solution.

I won’t go into detail describing the entire application. I'll just say that, according to the required architecture, almost all data in the database must be stored exclusively in encrypted form: usernames, file names, tags, comments, dates, and so on (with the exception of identifiers and some system fields). That is, the table structure should be open, but the contents should not be. The encryption is symmetric: the same key is used for both encryption and decryption. This means that without the encryption key, even with a full database dump an attacker should not obtain any original data. And this is where two problems immediately arose: searching by the data without fully decrypting it, and sorting encrypted data.

The first problem, with some caveats, is solved fairly simply. To search encrypted data, it is enough to additionally store hashes of the original values we plan to search on. This allows exact-match lookups (for example, users by login or files by tag) without revealing the original values. Yes, this won’t allow pattern searches, but it’s quite acceptable for the project’s goals.

But sorting encrypted data turned out to be much more serious. So much so that I had to think hard about a solution.

The solution

A quick search showed that the problem is far from new, but there are no standard approaches to solving it. In other words, each specific task is solved individually. Yes, there are some clever algorithms that allow sorting directly on encrypted data, but they are anything but simple. In a nutshell, here’s what exists at the moment:

OPE (Order-Preserving Encryption) — the order of ciphertexts matches the order of plaintexts. Convenient for ORDER BY, but frequency analysis can recover the original data.
ORE (Order-Revealing Encryption) — the order does not match directly, but there is a comparison procedure for ciphertexts. Safer than OPE, but still reveals relative ordering.
SSE (Searchable Encryption) and FHE (Fully Homomorphic Encryption) — look serious and could theoretically fit (though almost certainly with colossal computational overhead).

I rejected the first option myself. The other options rejected me: there were no simple solutions for PostgreSQL (which I was using at the time), and diving that deep into cryptography — let alone writing my own database extension — was not in my plans. I had to think.

So, we need to sort encrypted strings in the database without decrypting them. The solution should be simple to implement and acceptable from a security standpoint. Small sorting inaccuracies are allowed.

The only option that came to mind was to compute some numeric indices for the data we intend to sort by and store them separately. That is, instead of sorting by the data itself, we can sort by their indices. The index values should correlate with the encrypted strings but should not allow reverse reconstruction of those strings. The approach seemed reasonable, so I started thinking along these lines.

First, I decided it would be most convenient to convert strings into numbers whose magnitude would be relevant to the content of the original strings. Unicode code points can be used for the conversion.

Second, for indexing, we can consider only the first N characters rather than the entire string. Thus, each of these N characters should contribute to the final index, with the first character given the highest weight and each subsequent one progressively less.

Initially, I tried positional scaling with base 10. That is, I multiplied the character code points by 10 raised to a power decreasing from the start of the substring, then summed the result. That base turned out to be insufficient: due to carry propagation, many sequences of code points produced the same sums (collisions were rampant). But increasing the base fixed this.

As a result, the weight of an individual character is computed as the product of its code point and a constant — the positional weight — raised to the power of its position from the end of the substring:

WEIGHT = CODEPOINT × (WEIGHT_FACTOR ^ (N - 1 - POSITION))

Before the calculation, we need to canonicalize the string so that visually and lexically equivalent character variants have the same weight (with additional special-case handling for some alphabets).

With the right choice of parameters, the weights of the higher positions exponentially dominate the lower ones. We just need the final index to comfortably fit into a 64-bit BIGINT for database storage.

It would seem the solution is acceptable. But it turned out that this index is too informative: the large difference in character weights and the limited alphabet allow a dictionary attack (we can recover a substring by sequentially subtracting precomputed weights from the index). We need to “smear” the result to increase the number of collisions while preserving deterministic sort order. This is solved by quantization: it is enough to divide the index by a fixed divisor and discard the remainder:

FINAL_WEIGHT = WEIGHT_SUM // (2 ^ COARSE_SHIFT)

This preserves monotonicity but turns the set of exact values into a finite number of buckets. Substrings that fall into the same bucket will be considered equal (the database will resolve their relative order on its own).

After a bit of experimentation, I settled on N = 6, WEIGHT_FACTOR = 40, COARSE_SHIFT = 17. Essentially, this is an engineering compromise that provides deterministic sorting (heavily biased toward the initial characters) regardless of the internal representation (UTF-16/UTF-32). The result is resistant to clumping on the first few characters but does allow an attack with a limited dictionary: the first two of six characters can be recovered with very high probability. The rest are concealed by quantization. At the same time, sorting error beyond the second character inevitably increases. The code:

from typing import Union
import unicodedata

INDEX_LENGTH = 6
WEIGHT_FACTOR = 40
COARSE_SHIFT = 17

# Precomputed position weights
_WEIGHTS = tuple(WEIGHT_FACTOR ** i for i in range(
    INDEX_LENGTH - 1, -1, -1))

# Precomputed quantization divisor
QUANT_DIVISOR = 2 ** COARSE_SHIFT

# Unicode normalization form used during canonicalization
NORMALIZATION_FORM = "NFC"


def _canonicalize(value: str) -> str:
    """Case-insensitive canonicalization of a string."""
    return unicodedata.normalize(NORMALIZATION_FORM, value.casefold())


def get_index(value: Union[str, int]) -> int:
    """Compute a quantized sorting index for ORDER BY clauses."""
    if isinstance(value, int):
        value = str(value)

    s = _canonicalize(value)

    idx = 0
    for pos, ch in enumerate(s[:INDEX_LENGTH]):
        idx += ord(ch) * _WEIGHTS[pos]

    return idx // QUANT_DIVISOR

Thus, the original data in the database remain hidden, but they can be sorted by the first six characters. At the same time, only the first two “leak”. I considered the resulting solution good enough. Have thoughts on this? I’d love to hear them in the comments.

Project repository — github.com/artabramov/hidden (at the moment, another solution is used).