Before the title of this post scares you away, let me assure you that there is a valuable programmer-life lesson embedded in this story.
I'll get right to it.
Part 1: Discovering The Rabbit Hole
I began the weekend with some extra time on my hands, and decided to spend some of it browsing the depths of HackerRank's inventory of software engineering interview questions. I'm sure many developers out there agree: this is an exercise in futility - many of those problems leave us walking away not with a reinforced idea of our own abilities, but with a heightened sense of Imposter's Syndrome.
I chose a random question, titled Determining DNA Health, and I read the problem description. The gist is that you're given an array of genes, an array of health scores associated with the genes, and a string representation of a strand of DNA composed of, among some noise, the available genes. The task is to calculate the "total health score" of each DNA strand that HackerRank generates and then output the minimum and maximum scores.
At first glance this does not seem too bad. But at second glance it seems extremely bad. One hint is the domain: we know that representing DNA and genes will involve lots of genes and very long DNA strings. Whatever the solution is, it will have to be efficient.
So I thought about it for awhile, then headed over to the discussion boards to see what other people had to say about it.
One user emphasized that this is a fantastic question - it provides one with the opportunity to take a deep dive into "string matching algorithms". This person went on to say that to solve this problem, you need to know quite a bit about such algorithms. KMP (what?) won't work here; you'll need to learn about much more advanced techniques than it has to offer. Aho-Corasick (WHAT?) is the right option, and you need to know why it's the correct one, they said.
I was actually quite intrigued. How much could there really be to know about string matching algorithms? How many could there be? What the heck is Aho-Corasick and why is it so perfect for this problem domain? Wandering through the developer wilderness, where many such depths exist, I decided to take a dive into this somehow compelling rabbit hole.
Part Two: Entering the Rabbit Hole
I started by watching some YouTube videos. I checked out one on KMP and another on AC (this was a great primer but misses some critical points about the algorithm - read the Wiki on AC if you want to know more).
I immediately understood why KMP wasn't good enough and why AC was perfect. KMP is great for searching for the presence of a single word in a text string that runs in O(n) time, which is great! But the problem we found gives us an entire dictionary of words and we have to search the text for all of them. I learned how KMP works, but didn't venture far into its own rabbit tunnel.
Aho-Corasick is perfect because it allows you to find matches from huge dictionaries of words, but still in linear O(n) time. It uses this very well-named data structure called an automaton, which in this context, is a trie-based state machine. The words in the dictionary are encoded in the trie, which is a special type of tree where the nodes encapsulate information based on their location with respect to other nodes.
What makes the trie an automaton are the various extra links between the nodes. The links provide information about where to go next while looking for matches in the provided string. The AC algorithm happily flows through the automaton, emitting word matches as it finds them until the end of the string is reached. It's beautiful, really.
Part Three: Living in the Rabbit Hole
I was hooked. Why even leave this place? Why not take it a step further and implement this thing?
That is exactly what I did. Using the AC Wikipedia article as a guide, which has a great plaintext paragraph on how the algorithm works, I began constructing the data structure.
This involved building on my knowledge of graph data structures, breadth first search algorithms, tree traversal, recursion, and more. And this may have been the most fruitful exploration of all of these topics I've had, even counting my six years of formal education in the field. I hacked away at the thing, writing a set of robust unit tests to ensure the automaton had the exact structure it should have when the builder functions completed their work.
The next implementation challenge was the AC algorithm itself. Building the automaton is definitely the most complex aspect of this whole thing, both in intellectual complexity as well as algorithmic complexity when it is provided a dictionary and constructs itself. But the algorithm is what allows the state machine to be useful, and it took hours of tweaking to get it just right.
I finally got things to the point where all of my test cases were passing, even very large dictionary and query strings were yielding expected results. This was a very exciting moment in the rabbit hole that I then called home.
I started adding some layers around the automaton, adding some extra features like providing words similar to the words passed into it. This is like the "did you mean ?" feature you see in various applications.
I'm even working on some other applications, like plagiarism detection, real-time misspelled-word underlining, and other tech that already exists. But the cool thing is, because of my time in this rabbit hole, I have a deep understanding of how these technologies work.
Here is a summary of things I didn't know and now do, and some things I knew but now know better:
- Some really cool string pattern matching algorithms
- The 'trie' data structure, how to implement it, and why it's useful
- More applications of Breadth-First-Search (BFS)
- What an "automaton" is
- More recursion applications
- Tree traversal techniques
- How various applications work that are probably using AC and I never knew
- And more
Part Four: Lessons From the Rabbit Hole
The real essence of all this is that this sort of self-guided deep-dive learning is one of the best kinds of learning. It's very fun and hands on and the connections you make are invaluable. I have this awesome data structure implemented that I can now branch off into other projects, and I can continue proving the efficiency of it over time.
I will absolutely be going on more deep dives like this one, and I encourage you to do the same!
By the way...
And in case you were wondering, the answer is no, I haven't passed all the HackerRank test cases yet =D
But I do know that the implementation is correct. HackerRank doesn't like how long it currently takes to process 100,000 queries (the bottleneck seems to be the construction of the automaton, which I am working on improving).
Can you imagine getting this question in a 45-minute interview? Yikes!
If you're interested, you can find my work on it here.
Top comments (2)
Thanks for the topic and the links. I have a need for a way to generate lists of words (as in actual English words) but which have no overlaps. Which means, that no word is embedded in any other word.
If you've ever had to to a lot of automated search/replace actions in texts you might understand how the need arises. In the past I've used random strings as an interim step but now have a situation where those can't be tolerated in the process.
That had struck me as a rabbit hole which I'd need to plunge into in order to find a solution, and your post has reminded me to not be scared of diving in.
Good topic this is a subject that always comes up.