DEV Community

Cover image for Autocomplete Algorithms
Khaled Elmorsy
Khaled Elmorsy

Posted on • Edited on

Autocomplete Algorithms

Disclaimer

First let's start by clearing something up, the key to fast autocomplete is in data structures.

When naïve ol' me set out to implement an autocomplete feature in the website I'm building, I searched for 'autocomplete algorithms', so this post is for fellow codepatratiots who think the same way.

1. The Problem

We all know what autocomplete outputs, but let's lay out some conventions that we'll use as we go along.

Autocomplete's primary function:

Given a query string ss of length s=m|s|=m , find all strings SS in a defined set D=S0,S1,S2...SnD={S_0, S_1, S_2 ... S_n} in which ss is a substring of SS .

For any brute force enthusiasts out there already rolling up their sleeves and preparing some for loops:

The solution should scale at most sublinearly with the number of strings. AKA, don't check all the strings...

2. Arrays

2.1 Naïve Search

First, the simplest approach we can use, which doesn't follow the added rule we imposed on ourselves, is a straightforward traversal of the string set S0,...Si,...Sn:S=mS_0, ... S_i, ... S_n: \overline{ |S| } = m and check whether s[0,k]s_{[0,k]} is a substring of SiS_i .

function autocomplete(query, strings) {
  const matches = []
  for (let string of strings) {
    if (string.includes(query)) matches.push(string)
  }
  return matches
}
Enter fullscreen mode Exit fullscreen mode
function autocomplete(query, strings) { const matches = [] for (let string of strings) { if (string.includes(query)) matches.push(string) } return matches } const strings = ['slow', 'sad', 'dont', 'do', 'this'] const matches = autocomplete('s',strings) console.log(matches) // It doesn't need to start with 's'

With this approach, we have to visit nn , and do mkmk comparisons each time, making it O(nmk)O(nmk) .

Don't let the example fool you, it may not seem like a lot of operations, but scale it to an array of 100,000 display names, each up to 50 characters long, then autocompleting just a 4 character string could need up to 20 million comparisons, and we need to run and feed the output off a server while the user is typing.(Given a small debounce but still..)

So we know we need to significantly cut down the number of operations we have to do, and we can actually achieve this with arrays too.

2.2 Prefix Binary Search

Instead of traversing the whole string set, we can use a binary search to cut down on the number of comparisons, but we need to address a major caveat.

Binary search relies on having a sorted search space to be able to discard half of it each time. The issue here is that we can't just sort the strings because we don't have a basis of comparison that accounts for the presence of substrings.

Strings apple cake table
Query able

In the alphabetically sorted string set above a binary search would check both 'cake' and 'apple', but what we really want to match is 'table' since it contains our query. This fact isn't represented above though, so we need a way to do that.

Also note that we need a way to capture multiple matches, still without fully traversing the complete set. We'll see that this fairly simple, almost as easy as going for a walk 😉.

2.2.1 Finish my Sentences

Let's dumb down our primary function definition into something more basic that only matched strings that start with our query, i.e. the query is the strings' prefix. So if our query was "cake", it could match "cake recipes" and "cake calories" but not "chocolate cake" or "Why don't I like cake" (you should go get that checked).

In that case, our basis of comparison is a left to right character match. This means that we can sort the string set according to each character from left to right, AKA alphabetically.

Now we can run a binary search and match a string with our query, and since we want multiple matches and we know that our string set is sorted alphabetically, i.e. according to their prefixes, we can simple walk up and down the array from our match, grabbing every other match and stopping when our full query isn't the prefix of the next string.

function autocomplete(query, strings) {
  const sortedStrings = strings.sort((a, b) => (a < b ? -1 : 1)); // Sort strings alphabetically
  const matchIndex = findIndex(query);
  if (matchIndex === -1) return [];
  const matches = [sortedStrings[matchIndex]];
  return matches.concat(farmMatches(query, sortedStrings, matchIndex));

  // Binary search for one of the matches.
  function findIndex(query, start = 0, end = sortedStrings.length - 1) {
    if (start > end) return -1;
    const midpoint = start + ((end - start) >> 1); // Bitshift for a quick floored halve
    // String.startsWith(), since we're checking the beginning of the match
    const string = sortedStrings[midpoint];
    if (string.startsWith(query)) return midpoint;
    [start, end] =
      query < sortedStrings[midpoint]
        ? [start, midpoint - 1]
        : [midpoint + 1, end];
    return findIndex(query, start, end);
  }

  /* Walk up and down the array from our current match looking for more matches.
       When we come across a non match stop. */
  function farmMatches(query, index) {
    const matches = [];
    for (let i = index + 1; i < sortedStrings.length; i++) {
      const string = sortedStrings[i];
      if (!string.startsWith(query)) break;
      matches.push(string);
    }
    for (let i = index - 1; i >= 0; i--) {
      const string = sortedStrings[i];
      if (!string.startsWith(query)) break;
      matches.push(string);
    }
    return matches;
  }
}
Enter fullscreen mode Exit fullscreen mode

Now to match all the strings in S0,...Si,...Sn:S=mS_0, ... S_i, ... S_n: \overline{ |S| } = m where s[0,k]s_{[0,k]} is a prefix, our time complexity breaks down into two main parts.

First, the binary search which is at worst O(klog2(n))O(k\log_2(n)) , where finding out if the query is a string's prefix is O(k)O(k) .

Then, for walking up and down the array from the initial match, we get O(k(z+2))O(k(z + 2)) where zz is the number of matches in the string set.

This reduces to O(klog2(n)+kz)O(k\log_2(n) + kz) or O(max(kz,klog2(n)))O(max(kz, k\log_2(n)))

You can observe this first hand, minus the kk character comparisons, by trying a few queries in the 10k word autocomplete. Particularly queries that have no matches and queries that have several.

If we had 1000 times the user base of the 100k display name example from earlier, we'd only need to visit at worst log2(100,000,000)=26.627\log_2(\mathord{100,000,000}) = 26.6 ≈ 27 strings, before finding our first match and starting the adjacent walk.

Right now, while we have a very performant algorithm, we're still limited to matching the prefixes of the strings. To generalize it to any inner substring, we won't change how we search, but what we search for.

2.3 Suffix Arrays

Suffix: A letter or group of letters added at the end of a word to make a new word -Cambridge Dictionary

Replace "word" with "string" (and "letters" with "characters") and we get the key to substring autocomplete. But we won't be adding to our strings, we'll generate all the possible suffixes of that string and add them to our string set with a reference to the original string.

For example, for the string "hello" we generate the following suffixes:

Prefix Suffix
hello
h ello
he llo
hel lo
hell o

We then link all of these suffixes with "hello", so that if any of them are matched with our current method, say a user types "el" which matches with "ello", then "hello" is added to the actual matches.

We'll use a map/dictionary and IDs to link suffixes to their respective strings to save on memory, and to work around colliding suffixes, i.e. 'cake' and 'joke' both have 'ke' as a suffix, we'll store an array of string IDs at each suffix.

This array of suffixes is aptly names a Suffix Array, and when created for multiple strings, it's a Generalized Suffix Array. For the sake of brevity, we won't repeat 'generalized' each time.

2.3.1 Procedure

Preprocessing

const strings = ['test', 'cake', 'joke', 'best', 'hello', 'suffix', 'etc'] // ;)

// 1. Declare object/dictionary to store ID:string pairs
const stringMap = {}

// 2. Declare an object to store Suffix:ID[] pairs
const suffixMap = {} // We'll see why an object soon

// 3. Iterate over the string set and
for (let string of strings) {
  // 1. Generate a unique ID based on the state of stringMap
  const id = Object.values(stringMap).length

  // 2. Add the string to stringMap with the ID as the key
  stringMap[id] = string

  // 3. Generate all suffixes - we can also just loop without storing them in an array
  const suffixes = [...string].map((_,i) => string.slice(i))

  // 4.Iterate over the suffixes to insert them
  for (let suffix of suffixes) {
    // 0. If our suffix isn't a key in suffixMap, add it with an empty array as its value
    suffixMap[suffix] ||= []

    // 1. Push the ID into the suffix's id array - With objects, this is O(1)
    suffixMap[suffix].push(id)
  }
}

// 4. Convert suffixMap to an array of {suffix, ids: id[]} objects
const suffixArray = Object.entries(suffixMap).map(([suffix, ids]) => ({suffix, ids}))

// 5. Sort the suffixArray according to the suffixes
const sortedSuffixArray = suffixArray.sort((a,b) => a.suffix < b.suffix ? -1 : 1)
Enter fullscreen mode Exit fullscreen mode
const strings = ['test', 'cake', 'joke', 'best', 'hello', 'suffix', 'etc']; const stringMap = {}; const suffixMap = {}; for (let string of strings) { const id = Object.values(stringMap).length; stringMap[id] = string; const suffixes = [...string].map((_,i) => string.slice(i)); for (let suffix of suffixes) { if(!suffixMap[suffix]) suffixMap[suffix] = []; suffixMap[suffix].push(id); }; }; const suffixArray = Object.entries(suffixMap).map(([suffix, ids]) => ({suffix, ids})); const sortedSuffixArray=suffixArray.sort((a,b) => a.suffix < b.suffix ? -1 : 1); console.log(sortedSuffixArray)

Note that storing all of the suffixes in the array is unnecessarily inefficient, in actual suffix array implementations, we only need to store the suffix's position in the string, and in our multi-string generalized model, we also store a string ID. That way, we can use these two indices to generate any suffix in our model. This is discussed and visually demonstrated in section 5.2.

Preprocessing Time Complexity

Looking at the complexity of constructing a suffix array for:

D=S0,...Si,...Sn:S=mD = S_0, ... S_i, ... S_n: \overline{ |S| } = m
  • Each string of length mm has mm suffixes.
  • There are n×mn×m suffixes in total.
  • Sorting n×mn×m suffixes with an efficient algorithm like merge sort is O(nmlog(nm))×O(SuffixComparison)O(nm \log(nm)) × O(Suffix Comparison)
  • Suffix lengths scale with mm so, comparing suffixes, which is linear is O(m)O(m)

So the final time complexity of building a suffix array is O(nm2log(nm))O(nm^2 \log(nm)) . This would not be performant on large sets with long strings.

Fortunately, there are more efficient algorithms that can build suffix arrays in linear time, such as DC3, essentially reducing preprocessing time complexity to O(nm)O(nm) .

Matching

Matching is similar to the prefix binary search example from earlier, since we're essentially prefix matching our query with suffixes instead of strings.

The only difference is that we're first collecting string ID's from each matched suffix then mapping those IDs to their relevant strings.

// 1.a Find one match with a binary search 
const firstMatch = binarySearch(query, sortedSuffixArray)
// 1.b Then traverse up and down the array from that match to get the rest of the matches. We only return an array of IDs in this step.
const matchIDs = farmMatches(query, firstMatches, sortedSuffixArray)

// 2. Remove duplicate IDs
const uniqueIDs = [...new Set(matchIDs)]

// 3. Map the IDs to their respective strings
const matchedStrings = uniqueIDs.map(id => stringMap[id])
Enter fullscreen mode Exit fullscreen mode
Time Complexity

Since we didn't inherently change our matching algorithm, and just tacked on the ID mapping, matching a query s[0,k]s_{[0,k]} breaks down to:

  • Finding the first match: O(klog2(nm))O(k\log_2(nm))
    • Very fast. Even 100 billion suffixes needs at worst 37 comparisons.
  • Farming matches around the first match: O(kz)O(kz)
    • where zz is the number of matched suffixes.
  • Removing duplicates: O(z)O(z)
  • Mapping match IDs to strings: O(z)O(z)
    • The number of unique matches is at worst, the number of matches.

This reduces to:

O(max(klog2(nm),kz)O(max(k\log_2(nm), kz)

We can see that matching is very efficient, at worst scaling linearly with the number of suffixes that match our query since the binary search component is so fast.

2.3.2 Interactive Example

Here's the same example from before, now using suffix arrays.

Once again, try different queries in the 10k word, now 28k suffix, and get a feel for how it scales in different cases like no matches, vs several.

2.3.3 Insertion and Deletion

We won't go too deep into inserting and deleting strings here but they're pretty straightforward.

To insert a string, generate an ID and the string's suffixes. Binary search for each suffix in the suffix array. If you find it, push the ID to its ID array, if not insert it into the suffix array using the last index returned by the binary search.

To delete a string, query all its suffixes and remove its ID from each ID array, if the ID array is now empty, remove the suffix. Then remove the ID and string from the string map.

Depending on how the suffix array is implemented and the algorithm used, it could be more efficient or even necessary to rebuild it completely with each change.

2.3.4 Are Suffix Arrays a Viable Solution

Yes! Matching a query scales really well with the size of the overall suffix array, and scales linearly with the number of matches.

We do have to preprocess the data, but even that is speedy using efficient algorithms.

But can we query our data and get results even faster, regardless of the size of the string set? We sure can (but only marginally faster).

3. Trie

The key data structure that makes this possible is a Trie (pronounced Try) which is a type of tree data structure where keys/edges between nodes represent the next character from the current position.

Example Trie


An example Trie of [salt, car, cart, trie, tree]
With 'tree' being queried

To query a trie, we start at the root node, and traverse to the next node by following the edge/key for the next character in our string.

When we reach the final node in our query, leaf nodes, connected through an edge with a special character ('$' in our case) represent ends of strings. Their value is either the string itself or its ID key in a separate map.

3.1 Building a Trie

Building basic tries is a straightforward process:

/**
 * A basic trie built from an array of strings.
 * Further strings and string arrays can be inserted to it.
 * 
 * The autocomplete method matches strings with the query as a prefix.
 */
function Trie(strings) {
  const root = Node();
  insertStringArray(strings);

  /**
   * Trie node with a value property and children object
   *
   * Values are reserved for leaf nodes that denote 
   * the end of a string, and they store the string itself.
   * 
   * Children keys represent the next relative character and point to 
   * the relevant nodes.
   */
  function Node(value) {
    const children = {}
    return {value, children}
  }

  /**
   * Traverse the trie from root along the string characters until
   * the edge for the next character isn't found, or the string is
   * fully traced.
   * 
   * returns the final traced node, and the remaining string ('' if fully traced)
   */
  function trace(string, node = root) {
      if (!node) return
      return trace(string.slice(1), node.children[string[0]] || null) ||
        {node, remainingString: string}
  }

  /** 
   * Insert string into the trie by tracing the portion of the string already
   * in the trie, then adding nodes and edges for each remaining character.
   * 
   * After the string is fully inserted, add a final node on the '$' edge with 
   * the string as its value, marking the end of that path.
   */ 
  function insertString(string){
    let {node, remainingString} = trace(string);
    for (let char of remainingString) {
      const nextNode = Node();
      node.children[char] = nextNode;
      node = nextNode;
    }
    // Add a leaf node to represent the end of a string's path
    node.children['$'] = Node(string)
  }

  // Iterate through the array and insert each string
  function insertStringArray(stringArray) {
    stringArray.forEach(insertString)
  }

  /**
   * Matches all strings in the trie that contain the query string
   * as their prefix.
   * 
   * Returns an array of the matched strings (empty if none are found).
   */
  function autocomplete(string) {
    const {node, remainingString} = trace(string);
    // If the fully query isn't matched, return no matches
    if (remainingString) return [];

    const matches = [];
    // Visit all children nodes and add leaf node values. Can be done in a few ways.
    const stack = [node];
    while (stack.length) {
      const currentNode = stack.pop();

      /* Only leaf nodes, which are end of strings, have values.
         If we visit one, add that string to the matches */
      if (currentNode.value) matches.push(currentNode.value)

      // Add the current node's children to the stack
      const childNodes = Object.values(currentNode.children)
      stack.push(...childNodes)
    }
    return matches;
  }

  return { root, autocomplete, insertString, insertStringArray }
}
Enter fullscreen mode Exit fullscreen mode
function Trie(strings) { const root = Node(); insertStringArray(strings); function Node(value) { const children = {} return {value, children} } function trace(string, node = root) { if (!node) return return trace(string.slice(1), node.children[string[0]] || null) || {node, remainingString: string} } function insertString(string){ let {node, remainingString} = trace(string); for (let char of remainingString) { const nextNode = Node(); node.children[char] = nextNode; node = nextNode; } node.children['$'] = Node(string) } // Iterate through the array and insert each string function insertStringArray(stringArray) { stringArray.forEach(insertString) } function autocomplete(string) { const {node, remainingString} = trace(string); if (remainingString) return []; const matches = []; const stack = [node]; while (stack.length) { const currentNode = stack.pop(); if (currentNode.value) matches.push(currentNode.value) const childNodes = Object.values(currentNode.children) stack.push(...childNodes) } return matches; } return { root, autocomplete, insertString, insertStringArray } } let example = Trie(['tree','trie','teal', 'test', 'true']) console.log(example.autocomplete('tr'))

Trie (😉) different autocomplete queries, or make more tries.

Removing Strings
To remove a string from a trie, we determine the longest singular branch from the 'end of string' node towards the root, where each node has only one child. This is to avoid affecting any other string paths.

To do this, we trace the string normally from the root. At each character edge, we check if the next node only has 1 child:

  • If it does, we mark that edge to be deleted.
    • If there's already a marked ancestor edge, then don't change it.
  • If the node has multiple children, unmark any marked ancestor edge and continue tracing.
String removal example
Removing the string tree from the trie.

Remove Method Code
function remove(string) {
  let currentNode = root;
  let parentToTrim; // Parent node to delete the edge from
  let edgeToRemove;
  for (let char of string) {
    // Get the next node in the string path, making current node the parent
    const nextNode = currentNode.children[char];
    // We can only remove the next node if it has no other branches
    if (Object.values(nextNode.children).length > 1) {
      // If the next node has multiple children, we can't remove trim from above it
      [parentToTrim, edgeToRemove] = [null, null];
    } else {
      /* If it's okay to remove the child, then only update the parent node
         if it was previously reset  */
      parentToTrim ||= currentNode;
      edgeToRemove ||= char;
    }
    currentNode = nextNode;
  }
  /* If we can't delete any node we visit during the trace then only
     delete the 'end of string' node */
  if (parentToTrim) {
    delete parentToTrim.children[edgeToRemove];
  } else {
    delete currentNode.children['$'];
  }
}
Enter fullscreen mode Exit fullscreen mode

3.2 Interactive Example

Here's an example I made to visually illustrate tries and autocompleting with them.

The source code is slightly different at parts to account for visualizing the results (adding query statuses to the nodes). The example also uses a string map with IDs to reduce clutter.

P.S. Let me know if you're interested in a post about creating this visualization, namely drawing tree structures. I'm already considering it, but more interest would definitely drive it forward.

3.3 Performance and Caveats

Performance

The best thing about using tries is that, inserting, removing and tracing strings of length S=m|S| = m is O(m)O(m) .

Building off of that, creating a trie of a set of strings D=S0,S1,...SnD = {S_0, S_1,...S_n} where the average string length is mm , is O(mn)O(mn) .

Autocompleting a query s[0,j]s_{[0,j]} , assuming the path exists in the trie, involves:

  1. Tracing the query: O(j)O(j)

  2. Visiting each leaf node that descends from the trace result:

    • The average leaf depth is the average string length mm . So the upper bound complexity to traverse to each leaf is O(mj)O(m-j) where m>jm>j .
    • The number of leaf nodes to visit is the number of matches zz
    • Time complexity is thus: O(z(mj))O(z(m-j)) which at worst is O(zm)O(zm)

Assuming zmax(j,m)z \gg max(j,m) , then autocompleting has an upper bound of O(z)O(z) .

Caveats
There are two main caveats to using basic tries as our autocomplete solution, first and foremost, it can only do prefix autocomplete, also, there are a lot of extra nodes that take up unnecessary storage space and traversal operations.

To address the prefix limitation, we can include suffixes into our model, and to reduce the node count, we'll compress all redundant node chains, where there's a straight chain from an inner node to a leaf with no inner branches.

4. Suffix Trees

Suffix trees are conceptually similar to tries, but they also also include all of the suffixes of each string, and branches are compressed with no redundant chains.

Suffix Trees

[left to right] Trie, Suffix Trie, and Suffix Tree of the string 'hello'
Note: The lack of terminating character '$' is discussed in 4.1

Suffix trees allow us to build highly scalable and relatively space efficient* autocomplete models.

Similar to suffix arrays, a suffix tree of multiple string is a Generalized Suffix Tree, and also like suffix arrays, we won't keep repeating 'generalized'.

* Not the most space efficient. This is discussed in section 5

4.1 Building a Suffix Tree

A key aspect of suffix tree creation and manipulation, is the branching and compression logic. This is slightly more involved that basic tries and affects how we query the tree and insert into it.

Querying the Tree

With edges being compressed multi-character strings, we need to change our querying logic slightly. The goal is still to traverse the tree in linear time depending on the length of the query.

We can do that by checking the edges for varying prefixes of the query, iteratively adding another character and checking again. Once we match an edge, we repeat with the remaining query at the next node.

function query(string, node = root) {
  let queryString = '';
  for (let char of string) {
    // Iteratively add characters to the query string
    queryString += char;

    // Each time, check if an edge matches that prefix
    const nextNode = node.children[queryString];

    // If an edge exists, recurse on its node with the remaining string
    if (nextNode) {
      const remainingString = string.slice(queryString.length);
      return query(remainingString, nextNode);
    }
  }
  /* Return both the last matched node and the remaining
     unmatched string (blank if it was a fully match) */
  return { node, remainingString: string };
}
Enter fullscreen mode Exit fullscreen mode

For a query ss where s=m|s| = m , this approach does indeed perform at O(m)O(m) since we simply iterate over the query, performing O(1)O(1) object property retrievals each time.

Inserting a Suffix

Let's expand a bit on insertion with compressed branches. First, our primary goal is to:

A. Have a path of edges and nodes that contains the full suffix/string we're inserting.
B. Ensure that the path ends at a leaf node containing the parent string's ID.

We can do this by:

  1. Querying the suffix in our tree.
  2. If the trace is incomplete:
    1. Insert an edge and node for the remaining string at the last traced node.
    2. If an edge from the last node shares a prefix with the remaining string, split that edge on the common prefix and insert under it.
  3. If the path ends on an inner node:
    1. Extend a leaf node from the end of the path, with a terminating character as its edge.
  4. Store the parent string ID at the leaf node at the end of the path.

Compressed Insertion and Splitting

Inserting lo in a suffix tree where the edge llo edge exists.

We won't include terminating character '$' in our edges except for terminating paths that end on inner nodes. Such as with the suffix e in tree shown below. This is just to make the tree easier to look at and faster to grasp at a sight without losing functionality.

Suffix Tree of 'tree'
Example of suffix tree for tree
Suffix Tree of the string 'tree'

Translating this process to code, we get:

function insert(string, id) {
  // Trace the string/suffix
  let {node, remainingString} = query(string)

  /* If there's no path for the full string, 
     Add or split an edge at the last node in the trace */
  if(remainingString) {
    node = addEdge(node, remainingString)
  }

   // If the path ends at an inner node, add a terminating leaf
    let leaf = Object.values(node.children).length
    ? (node.children['$'] ||= Node([]))
    : node;

  // Add the string ID to the leaf node
  leaf.value ||= [];
  leaf.value.push(id);

  function addEdge(node, string) {
    /* Add a new edge the main node if we can't split an edge
       at a common prefix */
    let targetParent = node;
    let newEdge = string;

    // Iterate over the  node's edges, looking for one to split.
    for(let [edge, subtree] of Object.entries(node.children)){
      // Check for the longest common prefix between the edge and string
      const lcp = getLCP(edge,string)

      // If a (non blank) LCP exists, split the edge at the lcp
      if (lcp) {
        // Add a new edge & node for the common prefix
        const prefixNode = Node();
        node.children[lcp] = prefixNode;

        // Move the original subtree under the new prefix node
        const remainingEdge = edge.slice(lcp.length); // Remove the lcp from the original edge
        prefixNode.children[remainingEdge] = subtree;
        delete node.children[edge]; // Delete the original edge

        /* If the common prefix is the complete string, then the prefix node 
           is the end of the path, and we don't need to add a new node */
        if (lcp === string) return prefixNode

        /* Otherwise, designate the prefix node as the one to add to,
           and remove the lcp from the edge to add */
        targetParent = prefixNode;
        newEdge = string.slice(lcp.length);

        break // Stop checking edges
      }
    }

    /* Add a new edge/node at the designated node 
       to complete the string path */
    const endNode = Node();
    targetParent.children[newEdge] = endNode;
    return endNode;
  }
}
Enter fullscreen mode Exit fullscreen mode

getLCP()
function getLCP(...strings) {
  let lcp = strings[0] || '';
  // Traverse through strings and update the LCP each time
  for(let string of strings.slice(1)) {
    for (let [i, char] of [...lcp].entries() ) {
      /* When the LCP and current string deviate stop matching
         trim the LCP and move on to the next string */ 
      if (char !== string[i]) {
        lcp = lcp.slice(0, i)
        break;
      }
    }
    if(!lcp) break // Avoid unneseccary iterations when there's no LCP
  }
  return lcp
}
Enter fullscreen mode Exit fullscreen mode

Autocompleting

Just like with tries, to autocomplete a query, we first trace it in the tree, then from the end of that trace, grab all the descendent leaf nodes.

Unlike trie, with compressed branches, there's a chance that our query's path with land in the middle of an edge, i.e. query: el, edge: ello. In that case, we consider the edge a match since it fully matches what remains of our query.

function autocomplete(string) {
  // Try to fully match the string with full edges
  const { node, remainingString } = query(string);

  let searchRoot;

  // If already fully matched, search from the end of the path
  if (!remainingString) {
    searchRoot = node;
  } else {
    /* If not fully matched, try to complete the match using part
       of an edge from the last node */
    for (let [edge, childNode] of Object.entries(node.children)) {
      // Get the common prefix between the edge and remaining string
      const lcp = getLCP(edge, remainingString);
      if (!lcp) continue; // Move on if they don't share a prefix

      /* If the prefix is the complete remaining string, then the 
         string is fully matched and we can search from that child */
      if (lcp === remainingString) {
        searchRoot = childNode;
      }
      break; // Stop checking edges
    }
  }
  if (!searchRoot) return [];
  return getAllMatches(searchRoot); // Search for leaf nodes
}
Enter fullscreen mode Exit fullscreen mode

These are the main notable differences between suffix trees and basic tries that we'll discuss here. Deletion is also different with compressed branches, but I'll leave that up to your ingenuity and research 😉.

4.3 Example

Here's an interactive suffix tree example, feel free to try different string sets and queries.

4.4 Performance

Inserting a string S[0,m]S_{[0,m]} , to a suffix tree, means adding mm suffixes each needing traversals that scale linearly with, hence O(m)O(m) each. Thus, adding one string has a time complexity of O(m2)O(m^2) , and adding nn strings is O(nm2)O(nm^2) .

Querying a string of length s[0,k]s_{[0,k]} is O(k)O(k) , which is similar to tries and is very performant.

Autocompleting a string s[0,k]s_{[0,k]} involves querying that string and then searching for all descendent leaf nodes. This has a time complexity of O(k+z)=O(max(k,z))O(k + z) = O(max(k,z)) where zz is the number of matches.

Compressed branches also allow us to store and traverse less nodes during our search, saving memory and reducing the constant time factor in average scenarios.

Efficient Algorithms
The performance discussed above is for naïve suffix tree construction. Algorithms have been developed that build suffix trees in linear time, or O(nm)O(nm) for D=S0,S1...Sn:S=mD={S_0,S_1...S_n}: \overline{ |S| } = m . A notable and known example is Ukkonen's algorithm.

5. Suffix Trees vs. Suffix Arrays

Now, having a handle on two solid methods to implement an autocomplete model, which is more viable in real world scenarios. Ultimately it all boils down to speed and size.

5.1 Speed

Looking at the operations we've been discussing till this point, we can try to draw any perceivable differences between both models.

For a model D=S0,S1...Sn:S=mD={S_0,S_1...S_n}: \overline{ |S| } = m

Task Suffix Tree Suffix Array Victor Comment
Preprocess DD O(nm)O(nm) O(nm)O(nm) Suffix Tree Both in linear time with efficient algorithms for capped string lengths
Insert S[0,j]S_{[0,j]} O(j2)O(j^2) O(j2log2(nm))O(j^2\log_2(nm)) Suffix Tree Depends on the implementation
Autocomplete s[0,k]s_{[0,k]} where
z=z = # of matches
O(k+z)O(k+z) O(klog2(nm)+kz)O(k\log_2(nm) + kz) Suffix Tree Both ultimately realistically scale with zz

In practice, both models can be built in linear time with the total number of characters, and modified relatively easily.

For autocompletion, the difference in searching for the initial query search O(k)O(k) vs. O(klog2(nm))O(k\log_2(nm)) is marginal since the scaling factor log2(nm)\log_2(nm) grows at a negligible rate with the total number of suffixes.

Thus using either suffix trees or suffix arrays is viable for a performant autocomplete model, with suffix trees giving a very slight edge in speed which would be hardly distinguishable in practical circumstances.

5.2 Storage

With the illustrative & visual implementation of suffix trees and arrays, the space complexity of both models is at worst the total number of stored suffixes O(nm2)O(nm^2) .

In practice, the suffixes and prefixes aren't stored as strings at each edge/value, they're mapped to indices that denote the parent string and the location in that string, and those indices are stored instead. This significantly reduces storage complexity to a linear O(nm)O(nm) .

Using indices for Suffix Trees and Arrays

Using indices for a suffix tree and array of **["hello","halo"]**.

While both models now scale linearly, suffix trees tend to grow reasonably larger than suffix arrays at worst cases. This because of their inner nodes.

We can deduce this by first considering the fact that both structures must at least store the indices for each possible suffix n×mn×m .

With suffix arrays, there's an array element for each suffix. So nmnm elements. In suffix trees, each suffix terminates at a leaf node, so there are nmnm leaves, but suffix trees can also have up to nm1nm - 1 inner nodes, making them grow at a worst case rate of 2nm12nm - 1 .

Thus, suffix arrays, require much less space than suffix trees since they only store references to the suffixes with no extraneous data.

6. Your Autocomplete Model

Ultimately, the approach you choose will depend on your needs coupled with the latest, most space and time efficient algorithms.

Suffix arrays in particular currently need a lot less space while still providing blazingly fast autocomplete on large data sets so they're a great choice that's easy to implement.

And, not to leave you empty handed, I made autokomplete, a fast and robust autocomplete module built around suffix arrays.

It uses DC3 to construct the array in linear time, and can match even faster than the performance discussed above using an extra optimization. Instead of walking up and down comparing suffixes around the first binary search match, use two modified binary searches to find the first and last matching suffixes, making it possible to just slice, map and return the range in between.

This is slightly less efficient with less matches, but significantly faster with a large number of matches, since we can skip comparing each matching suffix with the query.

Try out the module and feel free to contribute to it or recommend additional features. I have a couple in mind already.

Top comments (1)

Collapse
 
fruntend profile image
fruntend

Сongratulations 🥳! Your article hit the top posts for the week - dev.to/fruntend/top-10-posts-for-f...
Keep it up 👍