loading...

Comparing an autocomplete algorithm in Python, Go, Haskell

yujiri8 profile image Ryan Westlund ・9 min read

Since the web scraper comparison post was so well-liked, here's another comparison. The task is to implement the autocomplete algorithm from the Fish shell except for case folding:

  • If what you've typed matches the beginning of any filenames, complete to that one or the biggest prefix common to all of them. If there's nothing that starts with what you've typed, proceed to the next rule.

  • If what you've typed matches a contiguous part of exactly one filename, even if it's not at the beginning, complete to that one. If no matches, proceed to the next rule.

  • If exactly one filename contains the characters you've typed in order, complete to that one.

(As you've probably guessed if you follow me, I was inspired to do this after my post on depth-first versus breadth-first autocompletion.)

I've generated a million testcases using a Python script I'll post at the end. It has functions to generate a set with specific solutions (for example a set where partial completion from prefix is possible, a set where nothing can be completed, a set where completion is possible from a contiguous non-prefix), so it's guaranteed that all code paths will be tested in somewhat realistic amounts.

Here's my Python implementation.

import json, datetime

def autocomplete(typed, possibilities):
    """The basic breadth-first algorithm: add the biggest prefix common to all continuations."""
    # Exclude things that don't match the typed prefix.
    matches = [p for p in possibilities if p.startswith(typed)]
    # str.startswith is slower than slicing, but more idiomatic, so it's a fairer comparison.
    # If there's nothing that starts with the given prefix, proceed to the next rule.
    if not matches: return autocomplete_contiguous(typed, possibilities)
    # Scan the common prefix 1 char at a time.
    i = len(typed)
    while True:
        char = None
        for s in matches:
            if len(s) <= i or char and s[i] != char:
                return s[:i]
            if not char:
                char = s[i]
        i += 1

def autocomplete_contiguous(typed, possibilities):
    """Fallen back to if nothing starts with typed. Checks for things that contain it."""
    matches = [p for p in possibilities if typed in p]
    if len(matches) == 1: return matches[0]
    if not matches: return autocomplete_in_order(typed, possibilities)
    return typed

def autocomplete_in_order(typed, possibilities):
    """Fallen back to if nothing contains the typed characters contiguously. Checks if anything has them in order."""
    matches = []
    for s in possibilities:
        lastmatch = 0
        for char in typed:
            index = s.find(char, lastmatch)
            if index < lastmatch: break
            lastmatch = index + 1
        else:
            # It's ambiguous.
            if matches: return typed
            matches.append(s)
    # There's less than 2 matches.
    if matches: return matches[0]
    return typed


if __name__ == '__main__':

    with open('data') as f:
        sets = json.load(f)

    start = datetime.datetime.now()
    for s in sets:
        autocomplete(s['typed'], s['possibilities'])
    end = datetime.datetime.now()

    print(end - start)

Note that I don't start the clock until the JSON is parsed, because I'm comparing the autocomplete algorithm.

If I discount the main block and associated imports so we're only measuring the autocomplete implementation, Python comes in at 30 lines. (I use cloc to count lines of code so comments/docstrings and blank lines are excluded.) Pretty impressive for such a complicated algorithm.

Running Python 3.7, it took about 5.4 seconds to run all million cases, and used 1.7 GB of memory to do so. To be honest, I expected worse.

The Go implementation (with no concurrency):

package main

import (
    "encoding/json"
    "fmt"
    "io/ioutil"
    "strings"
    "time"
)

type Case struct {
    Typed         string
    Possibilities []string
}

func main() {
    var rawData, err = ioutil.ReadFile("data")
    if err != nil {
        panic(err)
    }
    var data = make([]Case, 0, 1000000)
    err = json.Unmarshal(rawData, &data)
    if err != nil {
        panic(err)
    }
    var before = time.Now()
    for i := range data {
        autocomplete(data[i].Typed, data[i].Possibilities)
    }
    var after = time.Now()
    fmt.Printf("Took %f\n", float64(after.Sub(before))/float64(time.Second))
}

func autocomplete(typed string, possibilities []string) string {
    var matches []string
    // Exclude things that don't match the typed prefix.
    for _, s := range possibilities {
        if strings.HasPrefix(s, typed) {
            matches = append(matches, s)
        }
    }
    if len(matches) == 0 {
        return autocompleteContiguous(typed, possibilities)
    }
    // Scan the common prefix 1 char at a time.
    for i := len(typed); true; i++ {
        var char byte = '~' // A dumb hack: this char doesn't appear in the dataset.
        for _, s := range matches {
            if len(s) <= i || char != '~' && s[i] != char {
                return s[:i]
            }
            if char == '~' {
                char = s[i]
            }
        }
    }
    // Go can't tell this is unreachable
    panic("Shouldn't happen")
}

func autocompleteContiguous(typed string, possibilities []string) string {
    var matches []string
    // Exclude things that don't match the typed prefix.
    for _, s := range possibilities {
        if strings.Contains(s, typed) {
            matches = append(matches, s)
        }
    }
    if len(matches) == 1 {
        return matches[0]
    } else if len(matches) == 0 {
        return autocompleteInOrder(typed, possibilities)
    }
    return typed
}

func autocompleteInOrder(typed string, possibilities []string) string {
    var matches []string
    // Exclude things that don't match the typed prefix.
    for _, s := range possibilities {
        var lastmatch = 0
        for i, char := range typed {
            var index = strings.IndexRune(s[lastmatch:], char)
            if index == -1 {
                break
            }
            // The index will be offset by our slicing, so account for that.
            lastmatch += index + 1
            // If we just found the last character, record this as a match.
            if i+1 == len(typed) {
                // There's already a match. It's ambiguous.
                if len(matches) > 0 {
                    return typed
                }
                matches = append(matches, s)
            }
        }
    }
    // If we get here, there's less than 2 matches.
    if len(matches) == 1 {
        return matches[0]
    }
    return typed
}

64 lines of code after removing the main function and Case struct definition. (I was actually going to store the cases in the data file as JSON arrays of [typed, possibilities], but Go doesn't have tuple types, so it couldn't easily parse a list of non-uniform elements, hence why I formatted them as JSON objects).

Of course, if I don't count lines of "code" that consist of only a closing brace or parenthesis, Go's line count drops to 44. I'm ambivalent about whether it's fair to count those since they don't represent any extra semantic complexity, but do take up vertical space, reducing the amount of code I can fit on screen at once.

Runtime, though, is a stellar 0.5 seconds. About 1.4 GB of memory was used, but only 700 MB resident in physical memory (I'm viewing the usage in top). I ran Go 1.14.4.

The Haskell implementation is the hardest to measure the performance of due to lazy evaluation. Since I'm not doing anything with the results, I had to read about ways to force Haskell to evaluate, and even had to post on r/haskell when I couldn't figure out how to make sure the JSON was really parsed before starting the clock. I think this implementation is a fair trial.

{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE OverloadedStrings #-}

import Data.Time
import Data.Text (Text)
import qualified Data.Text as Text
import Data.Aeson
import Data.Maybe
import Control.Exception
import Control.DeepSeq
import GHC.Generics

data Case = Case
  { typed :: !Text
  , possibilities :: ![Text]
  } deriving (Generic, Show, Read, Eq)

instance FromJSON Case
instance NFData Case

-- | The simple algorithm.
autocomplete :: Case -> Text
autocomplete (Case typed possibilities) =
  let suffixes = mapMaybe (Text.stripPrefix typed) possibilities
  in if not $ null suffixes
    then typed <> commonPrefix suffixes
    else autocompleteContiguous typed possibilities

commonPrefix :: [Text] -> Text
commonPrefix [] = ""
commonPrefix [x] = x
commonPrefix (x:y:xs) =
  case Text.commonPrefixes x y of
    Nothing -> ""
    Just (p,_,_) -> commonPrefix (p:xs)

autocompleteContiguous typed possibilities =
    let matches = filter (Text.isInfixOf typed) possibilities
    in if length matches > 1
        then typed
        else if length matches == 1
            then head matches
            else autocompleteSparse typed possibilities

autocompleteSparse typed possibilities =
    let matches = filter (isMatch typed) possibilities
    in if length matches == 1
        then head matches
        else typed
    where
        isMatch "" str = True
        isMatch substr str =
            case Text.findIndex (== Text.head substr) str of
                Nothing -> False
                Just index -> isMatch (Text.tail substr) (Text.drop (index + 1) str)

main = do
    cases <- (fromJust <$> decodeFileStrict' "data") :: IO [Case]
    mapM_ (evaluate . force) cases
    start <- getCurrentTime
    mapM_ (evaluate . autocomplete) cases
    end <- getCurrentTime
    putStrLn $ "Took " ++ (show $ diffUTCTime end start)

After removing main, the Case debacle and associated imports, and type signatures (I don't think any were really necessary here), this is 33 lines.

I didn't actually come up with this implementation of autocomplete and commonPrefix; the r/haskell folks helped me out. (Heck I was gonna use String instead of Text. That would've been awful.)

Runtime without optimizations is 2.25 seconds; if I compile with -O2, it goes down to 1.2. Not bad, but I'm still a little disappointed that it's so far behind Go.

And the memory use! For some reason, it allocates 3 GB. The optimization flag doesn't affect it. I wouldn't be surprised if that's something I'm doing wrong, but sheesh. I'd hoped even a naive implementation wouldn't take up nearly twice as much memory as the dynamically typed Python one.

Update: I tried the +RTS -s flags suggested by someone on r/haskell and it claimed only 1.5 GB of memory was in use. But that can't be true, can it? top showed a 3 GB increase in active memory. Weird.


So that's how that went. Feel free to critique my solutions in the comments. Lastly, here's the script I used to generate the test cases:

import string, json, time, random

allowed_chars = string.ascii_lowercase + string.digits

def gendata():
    start = time.time()
    funcs = (
        mkset_full_prefix_completion,
        mkset_partial_prefix_completion,
        mkset_typed_no_completion,
        mkset_prefix_ambiguous,
        mkset_contain_ambiguous,
        mkset_contain_completion,
        mkset_sparse_ambiguous,
        mkset_sparse_completion,
    )
    sets = [random.choice(funcs)() for i in range(1000000)]
    end = time.time()
    print(f"generated data in {end - start}")
    with open('data', 'w') as f:
        json.dump(sets, f)

def mkset_full_prefix_completion():
    """Full completion from prefix is possible."""
    typed = randomchars(allowed_chars, 1, 4)
    target = typed + randomchars(allowed_chars, 5, 10) 
    # To ensure there's only one possibility, others can't have the last char in typed.
    new_chars = ''.join(c for c in allowed_chars if c != typed[-1])
    possibilities = [target,
        *(randomchars(allowed_chars, 5, 15) for i in range(random.randint(2, 19)))]
    random.shuffle(possibilities)
    return {'typed': typed, 'possibilities': possibilities}

def mkset_partial_prefix_completion():
    """Partial completion from prefix is possible."""
    typed = randomchars(allowed_chars, 0, 3)
    common = typed + randomchars(allowed_chars, 1, 5)
    # Make sure there's one that uses a continuation the others don't, so full completion isn't possible.
    dropped_char = random.choice(allowed_chars)
    new_chars = ''.join(c for c in allowed_chars if c != dropped_char)
    possibilities = [common + dropped_char + randomchars(allowed_chars, 5, 10),
        *(common + randomchars(new_chars, 5, 10) for i in range(random.randint(2, 19)))]
    random.shuffle(possibilities)
    return {'typed': typed, 'possibilities': possibilities}

def mkset_typed_no_completion():
    """The user's typed something with no completion."""
    typed = randomchars(allowed_chars, 5, 10)
    # Don't allow choosing some char in the typed string.
    dropped_char = random.choice(typed)
    new_chars = ''.join(c for c in allowed_chars if c != dropped_char)
    possibilities = [randomchars(new_chars, 5, 15) for i in range(random.randint(2, 20))]
    random.shuffle(possibilities)
    return {'typed': typed, 'possibilities': possibilities}

def mkset_prefix_ambiguous():
    """There are multiple things that start with what the user's typed, and immediately diverge."""
    typed = randomchars(allowed_chars, 1, 5)
    p1 = typed + randomchars(allowed_chars, 5, 10)
    # Pick another possibility that immediately diverges from p1.
    p2_chars = ''.join(c for c in allowed_chars if c != p1[len(typed)])
    p2 = typed + randomchars(p2_chars, 5, 10)
    possibilities = [p1, p2,
        *(randomchars(allowed_chars, 5, 15) for i in range(random.randint(2, 18)))]
    random.shuffle(possibilities)
    return {'typed': typed, 'possibilities': possibilities}

def mkset_contain_ambiguous():
    """There's nothing that starts with what the user's typed, but two that contain it."""
    typed = randomchars(allowed_chars, 1, 5)
    p1 = randomchars(allowed_chars, 2, 5) + typed + randomchars(allowed_chars, 2, 5)
    # Pick another possibility that immediately diverges from p1.
    new_chars = ''.join(c for c in allowed_chars if c != p1[len(typed)])
    p2 = randomchars(new_chars, 2, 5) + typed + randomchars(new_chars, 2, 5)
    possibilities = [p1, p2,
        *(randomchars(new_chars, 5, 15) for i in range(random.randint(2, 18)))]
    random.shuffle(possibilities)
    return {'typed': typed, 'possibilities': possibilities}

def mkset_contain_completion():
    """There's nothing that starts with what the user's typed, but one that contains it."""
    typed = randomchars(allowed_chars, 1, 5)
    match = randomchars(allowed_chars, 2, 5) + typed + randomchars(allowed_chars, 2, 5)
    # Make sure nothing else can contain it.
    dropped_char = random.choice(typed)
    new_chars = ''.join(c for c in allowed_chars if c != dropped_char)
    possibilities = [match,
        *(randomchars(new_chars, 5, 15) for i in range(random.randint(2, 19)))]
    random.shuffle(possibilities)
    return {'typed': typed, 'possibilities': possibilities}

def mkset_sparse_ambiguous():
    """There's nothing that starts with what the user's typed, nothing that contains it,
    but two that contain all the characters."""
    typed = randomchars(allowed_chars, 2, 5)
    p1 = intersperse_chars(typed, randomchars(allowed_chars, 5, 15))
    p2 = intersperse_chars(typed, randomchars(allowed_chars, 5, 15))
    # Make sure nothing else can contain it.
    dropped_char = random.choice(typed)
    new_chars = ''.join(c for c in allowed_chars if c != dropped_char)
    possibilities = [p1, p2,
        *(randomchars(new_chars, 5, 15) for i in range(random.randint(2, 18)))]
    random.shuffle(possibilities)
    return {'typed': typed, 'possibilities': possibilities}

def mkset_sparse_completion():
    """There's nothing that starts with what the user's typed, nothing that contains it,
    but one that contains all the characters."""
    typed = randomchars(allowed_chars, 2, 5)
    match = intersperse_chars(typed, randomchars(allowed_chars, 5, 15))
    # Make sure nothing else can contain it.
    dropped_char = random.choice(typed)
    new_chars = ''.join(c for c in allowed_chars if c != dropped_char)
    possibilities = [match,
        *(randomchars(new_chars, 5, 15) for i in range(random.randint(2, 19)))]
    random.shuffle(possibilities)
    return {'typed': typed, 'possibilities': possibilities}

def randomchars(charset, lower, upper):
    return ''.join(random.choice(charset) for i in range(random.randint(lower, upper)))

def intersperse_chars(chars, string):
    final, string = string[0], string[1:]
    for c in chars:
        skip = random.randint(0, 3)
        final += c
        final += string[skip:]
        string = string[:skip]
    return final + string

if __name__ == '__main__': gendata()

Posted on by:

yujiri8 profile

Ryan Westlund

@yujiri8

I'm a programmer, writer, and philosopher. My Github account is yujiri8; all my content besides code is at yujiri.xyz.

Discussion

markdown guide
 

I think one of the reasons your Haskell-Code is slow/memory-hungry is the fact that you use lists.

For your case (prefix-search) the "correct" data-structure would be a trie.
hackage.haskell.org/package/text-t...
This is a specialized tree.. you can just use it as is if you use fromList with (Text, [Text]) which is isomorph to your Case-Type.

All lookups should then be in O(log(n)) instead of O(n), be better aligned in memory, have way less thunks & other memory-inderections (costing RAM & speed).

--

I know that "changing the algorithm" (compared to python/go) is kind of cheating. But thats what you should be allowed to do if the language gives you such easy methods to do so.. ;)

also adding a

`using` parListChunk 32

to the evaluation would make it parallel ;)