Artem

Posted on Jun 13

Sorting Encrypted Strings with a Leaked-Order Index

#security #database #python

TL;DR: This is not a cryptographic construction. It is a pragmatic engineering compromise for applications where encrypted storage is required but approximate alphabetical ordering is still useful. I sort encrypted strings using an external index: the sum of weighted Unicode code points for the first N characters with exponential positional weights, followed by quantization. Monotonicity is preserved, but accuracy predictably degrades after the first few characters. Not a cryptographic scheme; some ordering information leaks by design.

The problem

Some time ago, while implementing a project, I ran into the problem of sorting encrypted data in a database. I’d like to share the solution.

I won’t go into detail describing the entire application. I’ll just say that, according to the required architecture, almost all data in the database must be stored exclusively in encrypted form: usernames, file names, tags, comments, dates, and so on (with the exception of identifiers and some system fields). That is, the table structure should be open, but the contents should not be. The encryption is symmetric: the same key is used for both encryption and decryption. This means that without the encryption key, even with a full database dump an attacker should not obtain any original data. And this is where two problems immediately arose: searching by the data without fully decrypting it, and sorting encrypted data.

The first problem, with some caveats, is solved fairly simply. To search encrypted data, it is enough to additionally store hashes of the original values you plan to search on. This allows exact-match lookups (for example, users by login or files by tag) without storing the original values in plaintext. Yes, this won’t allow pattern searches, but it’s quite acceptable for the project’s goals.

But sorting encrypted data turned out to be significantly more difficult.

The solution

A quick search showed that the problem is far from new, but there are no standard approaches to solving it. In other words, each specific task is solved individually. Yes, there are some clever algorithms that allow sorting directly on encrypted data, but they are anything but simple. In a nutshell, here’s what exists at the moment:

OPE (Order-Preserving Encryption) — the order of ciphertexts matches the order of plaintexts. Convenient for ORDER BY, OPE leaks significant ordering and distribution information, making statistical attacks practical in many scenarios.
ORE (Order-Revealing Encryption) — the order does not match directly, but there is a comparison procedure for ciphertexts. Safer than OPE, but still reveals relative ordering.
SSE (Searchable Encryption) and FHE (Fully Homomorphic Encryption) — look serious and could theoretically fit (though almost certainly with colossal computational overhead).

I rejected the first option myself. The remaining options were either too complex to deploy or required substantial cryptographic infrastructure: there were no simple solutions for PostgreSQL (which I was using at the time), and diving that deep into cryptography—let alone writing my own database extension—was not in my plans. So I started looking for a simpler alternative.

So, we need to sort encrypted strings in the database without decrypting them. The solution should be simple to implement and acceptable from a security standpoint. Small sorting inaccuracies are allowed.

The only option that came to mind was to compute some numeric indices for the data we intend to sort by and store them separately. That is, instead of sorting by the data itself, we can sort by their indices. The index values should correlate with the original strings but should not allow straightforward reconstruction of them. The approach seemed reasonable, so I started thinking along these lines.

First, I decided it would be most convenient to convert strings into numbers whose magnitude would be relevant to the content of the original strings. Unicode code points can be used for the conversion.

Second, for indexing, you can consider only the first N characters rather than the entire string. Thus, each of these N characters should contribute to the final index, with the first character given the highest weight and each subsequent one progressively less.

Initially, I tried positional scaling with base 10. That is, I multiplied the character code points by 10 raised to a power decreasing from the start of the substring, then summed the result. That base turned out to be insufficient: due to carry propagation, many sequences of code points produced the same sums (collisions were rampant). But increasing the base fixed this.

As a result, the weight of an individual character is computed as the product of its code point and a constant—the positional weight—raised to the power of its position from the end of the substring:

WEIGHT = CODEPOINT × (WEIGHT_FACTOR ^ (N - 1 - POSITION))

Before the calculation, you need to canonicalize the string so that visually and lexically equivalent character variants have the same weight (with additional special-case handling for some alphabets).

With the right choice of parameters, the weights of the higher positions exponentially dominate the lower ones. You just need the final index to comfortably fit into a 64-bit BIGINT for database storage.

At first glance, the solution seemed acceptable. But it turned out that this index is too informative: the large difference in character weights and the limited alphabet allow a dictionary attack (you can recover a substring by sequentially subtracting precomputed weights from the index). We need to “smear” the result to increase the number of collisions while preserving deterministic sort order. This is solved by quantization: it is enough to divide the index by a fixed divisor and discard the remainder:

FINAL_WEIGHT = WEIGHT_SUM // (2 ^ COARSE_SHIFT)

This preserves monotonicity but turns the set of exact values into a finite number of buckets. Substrings that fall into the same bucket will be considered equal (the database will resolve their relative order on its own).

After a bit of experimentation, I settled on N = 6, WEIGHT_FACTOR = 40, COARSE_SHIFT = 17. Essentially, this is an engineering compromise that provides deterministic sorting (heavily biased toward the initial characters) regardless of the internal representation (UTF-16/UTF-32). The result is resistant to clumping on the first few characters but does allow attacks using a constrained dictionary. In practice, with the parameters above, the first few characters can often be inferred with high probability, while the remaining positions are obscured by quantization. At the same time, sorting error beyond the second character inevitably increases. The code:

from typing import Union
import unicodedata

INDEX_LENGTH = 6
WEIGHT_FACTOR = 40
COARSE_SHIFT = 17

# Precomputed position weights
_WEIGHTS = tuple(WEIGHT_FACTOR ** i for i in range(
    INDEX_LENGTH - 1, -1, -1))

# Precomputed quantization divisor
QUANT_DIVISOR = 2 ** COARSE_SHIFT

# Unicode normalization form used during canonicalization
NORMALIZATION_FORM = "NFC"


def _canonicalize(value: str) -> str:
    """Case-insensitive canonicalization of a string."""
    return unicodedata.normalize(NORMALIZATION_FORM, value.casefold())


def get_index(value: Union[str, int]) -> int:
    """Compute a quantized sorting index for ORDER BY clauses."""
    if isinstance(value, int):
        value = str(value)

    s = _canonicalize(value)

    idx = 0
    for pos, ch in enumerate(s[:INDEX_LENGTH]):
        idx += ord(ch) * _WEIGHTS[pos]

    return idx // QUANT_DIVISOR

Thus, the original data in the database remain hidden, but they can be sorted by the first six characters. This is not intended to provide cryptographic security of the sort key itself. The goal is to keep the original values encrypted while exposing only enough information to support approximate alphabetical ordering.

Example

Using the parameters described above (N = 6, WEIGHT_FACTOR = 40, COARSE_SHIFT = 17), the following strings produce sortable indices:

Value	Index
Alan	77939
Albert	77939
Alice	77943
Brian	78841
Brittany	78841
Charles	79423
Daniel	80074

Notice that Alan and Albert fall into the same bucket after quantization. The same happens with Brian and Brittany. From the database's perspective, each pair has the same sort index.

At the same time, the index is influenced by more than just the first character. For example, Alice receives a different index than Alan and Albert, even though all three names begin with A.

This is intentional. Quantization reduces precision and introduces collisions, making exact reconstruction of the indexed prefix more difficult. At the same time, useful ordering information is preserved: names with similar prefixes tend to receive nearby indices, while lexicographically later names generally receive larger indices.

The result is not perfectly alphabetical ordering. Instead, it is a deterministic approximation that preserves useful ordering information while intentionally sacrificing precision.

Limitations

This approach has several important limitations:

It is not a cryptographic sorting scheme.
Prefix information leaks by design.
Sorting accuracy decreases after the first few characters.
Large numbers of similar prefixes may collapse into the same bucket.
The approach is unsuitable when order leakage is unacceptable.
The index should be treated as auxiliary metadata rather than encrypted data.

For my use case, these trade-offs were acceptable. The goal was not perfect secrecy of the sort key, but rather practical ordering of encrypted data without introducing complex cryptographic infrastructure.

DEV Community

Sorting Encrypted Strings with a Leaked-Order Index

The problem

The solution

Example

Limitations

Top comments (0)