What is SimHash?

#programming #webdev #ai #algorithms

Hello, I'm Maneshwar. I'm working on FreeDevTools online currently building *one place for all dev tools, cheat codes, and TLDRs* — a free, open-source hub where developers can quickly find and use tools without any hassle of searching all over the internet.

At a high level:

SimHash is a hashing/fingerprinting algorithm developed by Moses Charikar (originally 2002) for near-duplicate detection.
Unlike a cryptographic hash (where you want completely different outputs for slightly different inputs), SimHash is built so that similar inputs produce similar hashes (i.e., small Hamming distance between their fingerprints).
It is often used in large scale document/Web-page deduplication, spam detection, content clustering, etc.
In practice, the fingerprint size often used is 64 bits — which gives a good balance of collision resistance and manageable size.

So the big idea: given a document (or text, or other large data), compute a 64-bit fingerprint.

When you compare two documents, compute the Hamming distance of their fingerprints (i.e., number of differing bits).

A small distance means the documents are likely very similar; a large distance means not so much.

Hamming distance – what is it

The Hamming distance between two equal-length bit-strings is simply the number of bit positions in which they differ.
Example:

  1011001  
  1001101  
   ^ ^^ ^  ← differences  
  Hamming distance = 4

In binary terms: if you have two 64-bit integers a and b, you compute d = a XOR b, then count the number of ‘1’ bits in d. That count = Hamming distance.
Why useful: because fingerprinting reduces large documents into a small fixed length; then you can compare with O(1) cost per pair (just XOR + popcount) instead of comparing full documents.

Why 64 bits

A 64-bit fingerprint gives 2⁶⁴ possible values. Very large, reducing chance of accidental collisions. For many practical purposes this is enough.
Also computationally efficient: 64 bits is a native word size on many machines, easy to manipulate.
Many implementations of SimHash use 64 bits (or sometimes 32, but 64 is preferred).

SimHash + Hamming => “similar inputs → similar output”

Because of how SimHash is constructed (we’ll detail in next section), similar documents will differ only in a few feature-bits, so their fingerprints differ only in a few bits → small Hamming distance. For truly random/unrelated inputs, you expect about half the bits to differ (so Hamming distance ~32 for 64-bit).
That property makes SimHash part of the family of locality-sensitive hashing (LSH) techniques: you can efficiently search for “other items within Hamming distance ≤ k” across many fingerprints.

How SimHash Works – step by step in detail

Let’s walk through the algorithm more deeply.

1. Feature extraction

Given a document (text, web page, etc), you extract a set of features. These might be words, word-n-grams, shingles, or other units relevant to your domain.
Each feature is assigned a weight (for example frequency of word, TF-IDF weight, or importance). The idea: more important features should contribute more in the fingerprint.

2. Hash each feature

For each feature, compute a standard hash (say 64‐bit hash) of that feature (e.g., via FNV, Murmur, or another hash).
That results for each feature in a 64-bit vector of bits (for 64 bit fingerprint).

3. Accumulate signed vector

Initialize a vector V of size 64, each element 0 (or real number).
For each feature:
- Look at its 64-bit hash. For each bit position i (0 to 63):
- If the bit is 1 → add the feature’s weight to V[i]
- If the bit is 0 → subtract the feature’s weight from V[i]
After processing all features, each element V[i] is the sum of weights (+ and -) for that bit position.

4. Produce fingerprint

Now convert the vector V into a final 64-bit fingerprint F:
- For each position i: if V[i] ≥ 0 → set bit i of F to 1; else (V[i] < 0) → set bit i to 0.
So F is a 64-bit value. Two documents that share many features and weights will tend to have many positions where V[i] sign is the same → their fingerprints will share many bits.

5. Compare fingerprints

Given two fingerprints F1 and F2 (64-bit each): compute d = HammingDistance(F1, F2) (i.e., count of differing bits) by doing XOR = F1 ^ F2 then popcount of XOR.
The lower the d, the more similar the original documents likely are. ([hexdocs.pm][11])

Why you need distribution

Because “distance = 9” doesn’t mean much by itself unless you know what typical distances are in your domain.
By analysing many pairs, you can see what distances correlate with “documents are same template”, “documents are unrelated”, etc.
Then you can say e.g., “If distance ≤ 10 → treat as ‘very similar’, if distance ≤ 20 → ‘some similarity’, if > 30 → ‘probably unrelated’”.

Example threshold guide (not absolute)

Distance (out of 64)	Approx % similarity	Interpretation
0	100%	Identical docs
1–3	95%+	Minor edits (e.g. punctuation)
4–10	~84–94%	Same template, small content changes
11–25	~61–84%	Some similarity (layout or partially same text)
~26–40	~38–60%	Likely unrelated but maybe some shared boilerplate
~40–64	≤ ~37%	Unrelated documents (essentially random)

(Of course exact percentages depend on how you compute “similarity” = (64 – distance)/64).

Why you rarely see distance = 64

Because for two random 64-bit values, the expected distance is ~32.
The probability of all 64 bits differing (i.e., distance=64) is (1/2^{64}) extremely small.
So even for completely unrelated docs, you’ll typically see ~30 bits differ, not 64.

Worked Example (with hex & bits)

Let’s do a detailed worked walkthrough.

Say we have document A and document B. We compute SimHash fingerprints:

A → fingerprint F₁ = 0x8c3a5f7e9ecb3f35 (hex)
B → fingerprint F₂ = 0x8873bd1eaeeba7af (hex)

Ex: Convert to binary (showing only first few bits for illustration):

F₁ = 1000 1100 0011 1010 0101 1111 0111 1110 1001 1110 1100 1011 0011 1111 0011 0101
F₂ = 1000 1000 0111 0011 1011 1101 0001 1110 1010 1110 1110 1011 1011 1011 0101 1111

(Note: full 64 bits omitted here for brevity.)

Compute XOR = F₁ ^ F₂. Then count number of 1 bits in XOR → suppose you find 35 bits differ.

Hamming distance = 35.

Similarity (by simple measure) = (64 – 35)/64 ≈ 29/64 ≈ 0.4531 -> 45.31%.

Interpretation: A and B share ~45% of fingerprint bits, implying they are largely unrelated or only marginally related (since ~50% difference). This matches your earlier result.

If distance had been 9, similarity ≈ (64-9)/64 = 55/64 = ~85.94% → much more similar in fingerprint space.

Similarity Example

If you are building a checklist/static-site/icon-library/search/filter/edit service, potentially you’ll have lots of content pages (SVGs, icon metadata, etc). Using SimHash (or similar) you can:

Detect duplicates or near-duplicates of icon pages, descriptions, etc.
Filter out redundant content (e.g., same icon in multiple libraries) or merge similar items.
Provide “similar icon” recommendations: you can fingerprint page content and find pages with small Hamming distance.
Monitor changes over time: if a page is updated slightly, you’ll see small change in fingerprint (small distance from previous). Big change in fingerprint → major update.

Given this, you’ll want to build a distribution of Hamming distances across your corpus so you know what “distance = 10” means for your data, and pick thresholds (e.g., “distance ≤ 8 → nearly identical”, “distance between 9–20 → similar icons/layout”, etc).