Stack Overflowed

Posted on Jan 2

Find Duplicate Subtrees: Solution Explained

#webdev #coding #challenge

The “Find Duplicate Subtrees” problem asks you to identify all subtrees in a binary tree that appear more than once, meaning they have the same structure and the same node values. When duplicates exist, you return the root node of one instance of each duplicated subtree pattern. Two subtrees are considered duplicates if you can match every node in one subtree to the corresponding node in the other subtree with identical values and identical left and right child structure.

This is a deceptively deep tree problem because duplicates are not defined by node values alone. You are matching entire shapes. Two nodes with the same value are not necessarily duplicates unless their full descendant structure also matches.

Why naive comparisons are too slow

A brute-force way to solve this would compare every subtree against every other subtree. For each pair of candidate roots, you could run a deep equality check of both subtrees to confirm whether they match. That approach quickly becomes too slow because there can be many nodes, and deep comparisons are expensive.

The problem needs a strategy that can recognize repeated structures efficiently without repeatedly re-walking the same parts of the tree.

The key insight: represent each subtree with a unique signature

To detect duplicates efficiently, you need a way to describe a subtree so that identical subtrees produce identical descriptions. The standard idea is to create a “signature” for each subtree and then count how often each signature appears.

A signature must encode both structure and values. If it encodes only values, many different shapes could collide. If it encodes only structure, different value arrangements could collide. A correct signature includes both.

Serialization as a practical signature

A common signature approach is subtree serialization. Conceptually, you compute a string or token sequence that represents the subtree rooted at a node. The representation includes the node’s value and the serialized forms of its left and right subtrees, plus markers for null children, so structure is preserved.

When you compute this serialization consistently, two subtrees will serialize to the same sequence if and only if they are duplicates by the problem definition. That gives you a direct way to detect repeats.

Why postorder traversal fits perfectly

To build a subtree signature, you need the signatures of the left and right subtrees first. That naturally suggests postorder traversal, where you process children before the parent.

With postorder traversal, by the time you are computing a node’s signature, you already know the signatures of its children. This makes the computation straightforward and prevents partial or inconsistent representations.

Counting signatures to find duplicates

As you traverse the tree and compute a signature for each node, you record it in a frequency map. The first time a signature appears, you simply store it. The second time the same signature appears, you have found a duplicate pattern, and you add the current node to the result list as the representative of that duplicate subtree type.

This “add on the second occurrence” rule is important. If a subtree appears three or more times, you still only want to return it once. Recording it only when the count reaches two ensures you capture each duplicated pattern exactly once.

Why naive string serialization can be optimized

String serialization is conceptually clear, but if implemented carelessly, it can be slow because concatenating large strings repeatedly can become expensive. A common optimization is to assign compact identifiers to unique subtree patterns rather than storing long strings. Instead of building a full string, you treat the subtree as a triple consisting of the node value, left identifier, and right identifier, then map that triple to a unique integer id.

This turns signature construction into a constant-time operation per node after the mapping is set up. The underlying idea is the same: identical subtrees get identical IDs, and duplicates are detected via counting.

Why this approach guarantees correctness

Correctness comes from a precise one-to-one relationship between the subtree structure and its signature. If the signature encodes the node’s value and the full left and right subtree signatures, then two subtrees match exactly when their signatures match.

By traversing every node, you compute a signature for every possible subtree root. By counting those signatures, you identify exactly which subtree patterns appear more than once. Returning one root for each pattern on its second occurrence matches the problem’s output requirement.

Time and space complexity considerations

The traversal visits each node once, and each node’s signature is computed once. With an efficient signature representation, the runtime scales linearly with the number of nodes. Space usage comes from storing the signature map and frequency counts, which can also scale linearly in the worst case because many unique subtrees may exist.

This is an unavoidable cost because you must remember what you have seen in order to detect repeats reliably.

Why this problem is common in interviews

Find Duplicate Subtrees appears frequently in interviews because it tests whether candidates can transform a structural matching problem into a hashing or canonical-representation problem. Many people try to compare subtrees directly and get trapped in quadratic complexity.

It also checks whether you understand traversal order and why postorder is necessary when parent computation depends on child results.

What this problem teaches beyond duplicate detection

Beyond this specific task, the problem teaches a general technique for identifying repeated structures in recursive data. Whether it’s subtrees, sub-expressions, or nested configurations, canonical representation plus counting is a powerful pattern.

If you can clearly explain why duplicates require matching both structure and values, how postorder enables bottom-up signatures, and why counting signatures reveals duplicates efficiently, you demonstrate strong algorithmic maturity. That depth of understanding makes “Find Duplicate Subtrees” an excellent exercise in tree hashing, canonicalization, and structural pattern detection.

If you want more coding problems explained, check out:

DEV Community