By: Rafael Calderon Robles | LinkedIn
Official Resources: NPM Package · Documentation · Live Demo · GitHub
In software engineering and security, we typically operate under the paradigm of exactness. Cryptographic hash functions like SHA-256 are designed for fragility: the "avalanche effect" ensures that a single bit flip in the input results in a statistically uncorrelated output. While this is ideal for integrity verification, it renders standard cryptography useless for similarity detection.
Real-world data is noisy. File formats mutate, code is refactored, and data is padded. To address this, we developed LavinHash, a high-performance fuzzy hashing library implemented in Rust. Its objective is not binary identity, but the quantification of semantic distance between digital objects.
This article dissects the internal architecture of the algorithm, analyzing how LavinHash transforms arbitrary byte streams into compact, comparable signatures using O(n) time complexity and O(1) space complexity.
1. The DLAH Concept: Orthogonality in Analysis
Most existing fuzzy hashing algorithms suffer from a specific blindness: they either focus exclusively on textual content (making them sensitive to reordering) or global statistics (making them insensitive to local detail).
LavinHash implements an architecture we designate as Dual-Layer Adaptive Hashing (DLAH). The core premise is to decouple the analysis into two orthogonal dimensions processed in parallel:
- Structural Layer: Analyzes the topology and information density of the file.
- Content Layer: Extracts position-invariant features based on byte patterns.
This separation allows the system to detect similarity even under adversarial conditions, such as heavy obfuscation (changing content but preserving structure) or restructuring (moving data blocks while preserving content).
2. Ingestion Phase: Normalization and Lazy Evaluation
The first engineering challenge is preprocessing. To ensure robustness against trivial noise (such as casing differences or variable whitespace in text files), the search space must be reduced without incurring a memory penalty.
Processing a 10 GB file by allocating a "clean" copy in RAM is inefficient. LavinHash solves this by leveraging Rust Iterators to implement Lazy Evaluation.
- Zero-Copy Abstraction: The data flows byte-by-byte through a transformation pipeline that normalizes the stream "on the fly," exactly one CPU cycle before consumption by the hashing functions.
- Result: The spatial complexity of the preprocessing phase remains constant O(1), regardless of input size.
3. Structural Layer: Entropy Topology
In this layer, the algorithm ignores what the bytes say and focuses on how they are distributed. We analyze information density to generate a topological footprint of the file.
3.1 Shannon Entropy Profiling
The stream is segmented into fixed-size blocks (Ws = 256 bytes). For each block, we calculate Shannon Entropy, a mathematical measure of data unpredictability:
H(X) = -Σ p(xᵢ) × log₂(p(xᵢ))
- Low Entropy (H ≈ 0): Padding, zero-filled regions.
- High Entropy (H ≈ 8): Encrypted data, compressed streams, or dense machine code.
To optimize this computationally intensive operation, we utilize SIMD intrinsics (AVX2), allowing the CPU to compute multiple byte histograms and logarithms in parallel.
3.2 Vector Quantization
The resulting entropy value is a floating-point number. To optimize storage, we apply Linear Quantization. We map the continuous range [0.0, 8.0] to a discrete 4-bit space (a nibble, integers 0-15).
Q = ⌊H × 1.875⌋
This effectively performs aggressive dimensionality reduction, compressing the file's structure into a dense vector where every byte represents the topology of 512 bytes of original data.
4. Content Layer: Alignment and Adaptive Sampling
The fundamental failure mode of traditional hashing in similarity contexts is shift sensitivity. Inserting a single byte at the beginning of a file misaligns all subsequent fixed blocks, rendering the hashes disjoint. To solve this, LavinHash employs Context-Triggered Piecewise Hashing (CTPH).
4.1 Rolling Hash (BuzHash)
Instead of static blocking, we employ a sliding window of 64 bytes governed by a Rolling Hash. We utilize a cyclic polynomial algorithm (BuzHash). As the window advances by one step, the hash is updated by mathematically "subtracting" the outgoing byte and "adding" the incoming byte. This allows for O(1) update time per byte.
4.2 Deterministic Feature Selection
Storing the hash of every window is infeasible. We rely on a deterministic trigger condition:
H_rolling mod M ≡ 0
When the window's hash satisfies this modulo congruence, the current window is "captured" as a feature. This synchronizes the algorithm with the content: if a text block moves position, its internal hash remains constant, and the trigger fires at the exact same relative point, recovering alignment.
- Adaptive Modulus (M): A critical innovation is the dynamic calculation of M based on file size. This ensures a uniform sampling density target (~1200 features), preventing oversampling in large files (performance degradation) or undersampling in small files (statistical insignificance).
5. Probabilistic Storage: The Bloom Filter
Upon completion of the sampling phase, the system holds a collection of representative feature strings. Storing these raw strings is spatially inefficient. LavinHash projects these features into a Bloom Filter, a space-efficient probabilistic data structure.
Specification:
- Size (m): 8,192 bits (1024 bytes).
- Hash Functions (k): 3 (implemented via MurmurHash3 with distinct seeds).
Insertion Process:
For every captured feature:
- The feature is hashed three times.
- The resulting values determine three distinct bit indices in the array.
- These bits are set to
1.
Engineering Justification:
The Bloom Filter allows us to test for set membership with extreme spatial efficiency. We accept a controlled theoretical false positive rate (~1.9%) in exchange for reducing megabytes of potential feature data into a fixed 1 KB block that fits entirely within the L1 cache of modern processors.
6. Similarity Metrics
The output of LavinHash is not a binary match but a similarity score Δ derived from two distinct metric distances.
6.1 Normalized Levenshtein Distance (Structure)
Applied to the quantized entropy vectors. This measures the minimum number of edit operations (insertions, deletions, substitutions) required to transform the topology of file A into file B. This metric is robust against the injection of malicious code blocks or section reordering.
6.2 Jaccard Index (Content)
Applied to the Bloom Filters via bitwise operations.
J(A,B) = |A ∩ B| / |A ∪ B|
= popcount(BitsA ∧ BitsB) / popcount(BitsA ∨ BitsB)
We calculate the intersection of active bits over the union. This provides a pure measure of shared content, completely independent of the order in which data appears in the file.
Conclusion
LavinHash demonstrates how the synthesis of Information Theory (Entropy), Streaming Algorithms (Rolling Hashes), and Probabilistic Data Structures (Bloom Filters) can solve problems that standard cryptography cannot address.
By decoupling structure from content and optimizing each layer for both throughput and memory footprint, LavinHash provides a robust mechanism for identifying the semantic "essence" of data amidst the noise of the real world.
Further Reading:




Top comments (0)