Hello, I'm Maneshwar. I'm working on FreeDevTools online currently building *one place for all dev tools, cheat codes, and TLDRs* — a free, open-source hub where developers can quickly find and use tools without any hassle of searching all over the internet.
At a high level:
- SimHash is a hashing/fingerprinting algorithm developed by Moses Charikar (originally 2002) for near-duplicate detection.
 - Unlike a cryptographic hash (where you want completely different outputs for slightly different inputs), SimHash is built so that similar inputs produce similar hashes (i.e., small Hamming distance between their fingerprints).
 - It is often used in large scale document/Web-page deduplication, spam detection, content clustering, etc.
 - In practice, the fingerprint size often used is 64 bits — which gives a good balance of collision resistance and manageable size.
 
So the big idea: given a document (or text, or other large data), compute a 64-bit fingerprint.
When you compare two documents, compute the Hamming distance of their fingerprints (i.e., number of differing bits).
A small distance means the documents are likely very similar; a large distance means not so much.
Hamming distance – what is it
- The Hamming distance between two equal-length bit-strings is simply the number of bit positions in which they differ.
 - Example:
 
  1011001  
  1001101  
   ^ ^^ ^  ← differences  
  Hamming distance = 4
- In binary terms: if you have two 64-bit integers 
aandb, you computed = a XOR b, then count the number of ‘1’ bits ind. That count = Hamming distance. - Why useful: because fingerprinting reduces large documents into a small fixed length; then you can compare with O(1) cost per pair (just XOR + popcount) instead of comparing full documents.
 
Why 64 bits
- A 64-bit fingerprint gives 
2⁶⁴possible values. Very large, reducing chance of accidental collisions. For many practical purposes this is enough. - Also computationally efficient: 64 bits is a native word size on many machines, easy to manipulate.
 - Many implementations of SimHash use 64 bits (or sometimes 32, but 64 is preferred).
 
SimHash + Hamming => “similar inputs → similar output”
- Because of how SimHash is constructed (we’ll detail in next section), similar documents will differ only in a few feature-bits, so their fingerprints differ only in a few bits → small Hamming distance. For truly random/unrelated inputs, you expect about half the bits to differ (so Hamming distance ~32 for 64-bit).
 - That property makes SimHash part of the family of locality-sensitive hashing (LSH) techniques: you can efficiently search for “other items within Hamming distance ≤ k” across many fingerprints.
 
How SimHash Works – step by step in detail
Let’s walk through the algorithm more deeply.
1. Feature extraction
- Given a document (text, web page, etc), you extract a set of features. These might be words, word-n-grams, shingles, or other units relevant to your domain.
 - Each feature is assigned a weight (for example frequency of word, TF-IDF weight, or importance). The idea: more important features should contribute more in the fingerprint.
 
2. Hash each feature
- For each feature, compute a standard hash (say 64‐bit hash) of that feature (e.g., via FNV, Murmur, or another hash).
 - That results for each feature in a 64-bit vector of bits (for 64 bit fingerprint).
 
3. Accumulate signed vector
- Initialize a vector V of size 64, each element 0 (or real number).
 - For each feature:
- Look at its 64-bit hash. For each bit position i (0 to 63):
 - If the bit is 1 → add the feature’s weight to V[i]
 - If the bit is 0 → subtract the feature’s weight from V[i]
 
 - After processing all features, each element V[i] is the sum of weights (+ and -) for that bit position.
 
4. Produce fingerprint
- Now convert the vector V into a final 64-bit fingerprint F:
- For each position i: if V[i] ≥ 0 → set bit i of F to 1; else (V[i] < 0) → set bit i to 0.
 
 - So F is a 64-bit value. Two documents that share many features and weights will tend to have many positions where V[i] sign is the same → their fingerprints will share many bits.
 
5. Compare fingerprints
- Given two fingerprints F1 and F2 (64-bit each): compute 
d = HammingDistance(F1, F2)(i.e., count of differing bits) by doingXOR = F1 ^ F2then popcount of XOR. - The lower the 
d, the more similar the original documents likely are. ([hexdocs.pm][11]) 
Why you need distribution
- Because “distance = 9” doesn’t mean much by itself unless you know what typical distances are in your domain.
 - By analysing many pairs, you can see what distances correlate with “documents are same template”, “documents are unrelated”, etc.
 - Then you can say e.g., “If distance ≤ 10 → treat as ‘very similar’, if distance ≤ 20 → ‘some similarity’, if > 30 → ‘probably unrelated’”.
 
Example threshold guide (not absolute)
| Distance (out of 64) | Approx % similarity | Interpretation | 
|---|---|---|
| 0 | 100% | Identical docs | 
| 1–3 | 95%+ | Minor edits (e.g. punctuation) | 
| 4–10 | ~84–94% | Same template, small content changes | 
| 11–25 | ~61–84% | Some similarity (layout or partially same text) | 
| ~26–40 | ~38–60% | Likely unrelated but maybe some shared boilerplate | 
| ~40–64 | ≤ ~37% | Unrelated documents (essentially random) | 
(Of course exact percentages depend on how you compute “similarity” = (64 – distance)/64).
Why you rarely see distance = 64
- Because for two random 64-bit values, the expected distance is ~32.
 - The probability of all 64 bits differing (i.e., distance=64) is (1/2^{64}) extremely small.
 - So even for completely unrelated docs, you’ll typically see ~30 bits differ, not 64.
 
Worked Example (with hex & bits)
Let’s do a detailed worked walkthrough.
Say we have document A and document B. We compute SimHash fingerprints:
- A → fingerprint F₁ = 
0x8c3a5f7e9ecb3f35(hex) - B → fingerprint F₂ = 
0x8873bd1eaeeba7af(hex) 
Ex: Convert to binary (showing only first few bits for illustration):
F₁ = 1000 1100 0011 1010 0101 1111 0111 1110 1001 1110 1100 1011 0011 1111 0011 0101
F₂ = 1000 1000 0111 0011 1011 1101 0001 1110 1010 1110 1110 1011 1011 1011 0101 1111
(Note: full 64 bits omitted here for brevity.)
Compute XOR = F₁ ^ F₂. Then count number of 1 bits in XOR → suppose you find 35 bits differ.
Hamming distance = 35.
Similarity (by simple measure) = (64 – 35)/64 ≈ 29/64 ≈ 0.4531 -> 45.31%.
Interpretation: A and B share ~45% of fingerprint bits, implying they are largely unrelated or only marginally related (since ~50% difference). This matches your earlier result.
If distance had been 9, similarity ≈ (64-9)/64 = 55/64 = ~85.94% → much more similar in fingerprint space.
Similarity Example
If you are building a checklist/static-site/icon-library/search/filter/edit service, potentially you’ll have lots of content pages (SVGs, icon metadata, etc). Using SimHash (or similar) you can:
- Detect duplicates or near-duplicates of icon pages, descriptions, etc.
 - Filter out redundant content (e.g., same icon in multiple libraries) or merge similar items.
 - Provide “similar icon” recommendations: you can fingerprint page content and find pages with small Hamming distance.
 - Monitor changes over time: if a page is updated slightly, you’ll see small change in fingerprint (small distance from previous). Big change in fingerprint → major update.
 
Given this, you’ll want to build a distribution of Hamming distances across your corpus so you know what “distance = 10” means for your data, and pick thresholds (e.g., “distance ≤ 8 → nearly identical”, “distance between 9–20 → similar icons/layout”, etc).
I’ve been building for FreeDevTools.
A collection of UI/UX-focused tools crafted to simplify workflows, save time, and reduce friction in searching tools/materials.
Any feedback or contributors are welcome!
It’s online, open-source, and ready for anyone to use.
👉 Check it out: FreeDevTools
⭐ Star it on GitHub: freedevtools
              
    
Top comments (0)