Fast Word Matching

#parsing #programming #algorithms

The problem

Let's say we have a problem recognizing if one word is the same as the other. In other words we want to understand if two words are identical or not. For example such a task will arise when you are writing your own parser, let's say on a web server. When a string is tokenized , then we should fastly compare two tokens and make a decision if they are equal.
If we are talking about the HTTP protocol, it has a request, a response, headers and so on. All of these types have a limited number of words like GET, POST, PUT, HOST. This is a so-called text protocol, there are also binary protocols.

Assume you are familiar with this, at least at a basic level. We won't consider how a parsing works, we only consider words matching. When a client is sending a request, our server must understand if this request is suitable and equal to a word from the protocol's RFC. One of the requirements is to do this as fast as possible.

Solutions

1. Naive approach

Some algorithms come to mind, like a naive approach that is just comparing each letter in a word. It's what someone came up with. The code in C might look like:

int word_matching(const char* word1, const char* word2, size_t len)
{
    size_t i=0;
    while(word1[i] == word2[i] && i < len)
        i++;
    return (i == len) ? 0 : -1;
}
int main (int argc, char** argv)
{
    char *a="HOST";
    char *b="HOST";
    int result;
    result=word_matching(a, b, 4);
    if(result == 0)
        printf("Words are identical\n");
    else
        printf("Words are different\n");
}

This is a pretty easy program. Let's analyze the complexity of the approach. In this case it is O(n). The longer the string, the more iterations to be done. With the function we should be careful and check the length of the strings too. Or compare if a symbol is zero(end sign). We assume that strings’ length are equal.
There are functions strcmp(), strncmp(), strcasecmp(), strncasecmp() in the standard library. Or even memcmp() in Unix-like OSes. Functions with 'n' in the name are absolutely the same, but safe for buffer overflow. All these functions also compare strings in lexicographic order. We say 'string' but keep in mind a 'word'.

2. Improving solution

Is there a way to increase the efficiency of the comparison function?

There is a function strstr() to search substring in a string. If we assume a substring is the entire string, this function seems to be also suitable for our goal. In other words, the pattern for searching is what the server receives as input. This function can be based on the Knuth-Morris-Pratt algorithm(KMP). Truly speaking, it's not the only one algorithm for searching for a pattern in a string.
Let’s have a look at the following code:

#include <stdio.h>
#include<string.h>
int main (int argc, char** argv)
{
   char *a="HOST";
   char *b="HOST";
   char* result=NULL;
   result=strstr(a,b);
   if(result != NULL)
       printf("words are identical, result=%s\n", result);
   else
       printf("Words are different\n");
}

The complexity is O(n+m), where n is a string length, m is a pattern length. But as we can see, the example above is not right. Let's take two words. One (a) is 'HOST' - is an HTTP header and imagine, the server got an HST(b) as an input. In this case strstr() will return a NULL pointer. It's OK. But, if the server takes a word like 'HOS'. What happens? Consider such a situation. strstr() returns a pointer to the beginning of the string HOST, because HOS is a substring of HOST. In this case one more check is required, which leads to making such an approach even more complex than the first one. Again, we assume, N == M, that gives a complexity of O(n).This does not give any advantages compared to the first solution.
3. Compare char by char

So this way, it is proposed to compare each symbol in the input string to symbols in a pattern.
Patterns are to be compared defined in the HTTP protocol RFC:

GET,                 
POST,           
HEAD,           
PUT,            
DELETE,         
CONNECT,        
OPTIONS,            
TRACE,              
PATCH,

Let me give an example (I skip header file):

http_method_t find_http_method_if_else(const char *sval)
{
    if(sval[0] == 'G' && sval[1] == 'E' && sval[2] == 'T')
        return GET;
    else if(sval[0] == 'P' && sval[1] == 'O' && sval[2] == 'S' && sval[3] == 'T')
        return POST;
    else if (sval[0] == 'H' && sval[1] == 'E' && sval[2] == 'A' && sval[3] == 'D')
        return HEAD;
    else if(sval[0] == 'P' && sval[1] == 'U' && sval[2] == 'T')
        return PUT;
    else if(sval[0] == 'D' && sval[1] == 'E' && sval[2] == 'L' && sval[3] == 'E' && sval[4] == 'T' && sval[5] == 'E')
        return DELETE;
    else if(sval[0] == 'C' && sval[1] == 'O' && sval[2] == 'N' && sval[3] == 'N' && sval[4] == 'E' && sval[5] == 'C' && sval[6] == 'T')
        return CONNECT;
    else if(sval[0] == 'O' && sval[1] == 'P' && sval[2] == 'T' && sval[3] == 'I' && sval[4] == 'O' && sval[5] == 'N' && sval[6] == 'S')
        return OPTIONS;
    else if(sval[0] == 'T' && sval[1] == 'R' && sval[2] == 'A' && sval[3] == 'C' && sval[4] == 'E')
        return TRACE;
    else if(sval[0] == 'P' && sval[1] == 'A' && sval[2] == 'T' && sval[3] == 'C' && sval[4] == 'H')
        return PATCH;
    return INVALID_METHOD;

Here, there is no loop against a string, which reduces CPU usage. It is better, we compare symbols with 'AND' condition. Depending on comparing status, the function returns a number (enumeration) which indicates a result. But still if we compare the word 'PATCH'(the last word in the function), the function will work 9 times (because the number of words is nine). It is the worst case, the best scenario will be GET. Of course, if you are sure of the probability of words (in our case GET is the most frequently encountered word, that's why this word was put at the first position). And again the same question arises. How can I accelerate?
4. Regular expressions

There is one popular approach used for parsing strings and comparing strings is to use regular expressions.
This method might be difficult to implement and debug, and does not have good complexity. Another disadvantage of regular expressions is different dialects (like Perl, POSIX etc), which are complex and complicated. Sometimes it makes you confused, especially if you have chosen regex for parsing protocols.

5. Use Finite State Machine

As shown above, we have a set of words like GET, POST etc, that is a limited set. So we can use this property and build the most efficient solution. In this way, there is no loop against letters, no dynamic allocations etc, pure static. And no steps back. Static (since keywords are known to us) gives O(1) complexity, that is constant time. It is known as the Finite State Machine (FSM). To create a matching function we should use switch-case statements, where each character is a state in this machine. There are disadvantages: FSM is created manually (FSM should recognize all valid words) and code is getting massive switch-case statements. I don’t want to consider the theory, all sources were put at the end of the article.

Here is a piece of code, which demonstrates finding methods starting from P: POST,PUT, PATCH. Again, I skip the header file where some custom enums are defined.

http_method_t str_to_http_method(const char *sval)
{
    http_method_t result=INVALID_METHOD;
    switch (sval[0])
    {
    case 'P':
        // POST, PUT, PATCH
        switch(sval[1])
        {
        case 'O':
            // POST
            switch(sval[2])
            {
            case 'S':
                switch(sval[3])
                {
                case 'T':
                    result = HTTP_METHOD_POST;
                    break;
                default:
                    return INVALID_METHOD;
                }
                break;
            default:
                return INVALID_METHOD;
            }
            break;
        case 'U':
            //PUT
            switch (sval[2])
            {
            case 'T':
                result = HTTP_METHOD_PUT;
                break;
            default:
                return INVALID_METHOD;
            }
            break;
        case 'A':
            //PATCH
            switch(sval[2])
            {
            case 'T':
                switch(sval[3])
                {
                case 'C':
                    switch(sval[4])
                    {
                    case 'H':
                        result = HTTP_METHOD_PATCH;
                        break;
                    default:
                        return INVALID_METHOD;
                    }
                    break;
                default:
                    return INVALID_METHOD;
                }
                break;
            default:
                return INVALID_METHOD;
            }
            break;
        default:
            return INVALID_METHOD;
        }
        break;
    }
    return result;
}

So, this code works with three HTTP requests, which start with a ‘P’ letter (start state). Let’s consider this. In the beginning, we have ‘P’ as the first character (we go to the 1st state of the machine), then if we have the next ‘A’, if this state is acceptable, we go to the second state of FSM and so on. If at any time a token is not matched, the machine is going to the last state (reject state) and we receive INVALID_METHOD (which is defined as -1).
Moreover, it’s possible to write a state machine for case insensitive words. It’s enough to add extra string like:

case 'P':
case 'p':

And this will not result in any adverse effects.

Wide strings

The article described a comparison of words containing only ASCII characters. For most purposes it is OK, because protocols are written using ASCII symbols. But if you go further and distribute your knowledge across all sorts of symbols, you should keep in mind, there are more than just ASCII symbols. ASCII symbols are useful for the Latin alphabet, but there are national alphabets encoded in UTF (multibyte symbols). For storing strings with such symbols, there are wide strings. This topic is out of the article and should be considered on your own.

Conclusion

Using static allocation and static comparing gives less flexibility but higher performance. In everyday life we should find an optimal solution or a compromise. This is also regarding programming. As described above, there are different approaches which have significant differences in performance.
Combining a few ways it's possible to achieve maximum performance and the necessary flexibility.
Such an approach, of course, can be applied to any internet protocol parsing, either text or binary. You can have a look at high performance parser like [6].

Sources:
[1] Extended Example: Text Parser State Machine. URL:https://w3.cs.jmu.edu/kirkpams/OpenCSF/Books/csf/html/StateModelImplementation.html
[2] KMP Algorithm for Pattern Searching URL: https://www.geeksforgeeks.org/kmp-algorithm-for-pattern-searching/
[3] Naive algorithm for Pattern Searching URL: https://www.geeksforgeeks.org/naive-algorithm-for-pattern-searching
[4] Thomas H. Cormen. Introduction to algorithms
[5] Robert David Graham, Peter C. Johnson. Finite State Machine Parsing for Internet Protocols: Faster Than You Think. URL: https://www.ieee-security.org/TC/SPW2014/papers/5103a185.PDF
[6] https://github.com/nodejs/http-parser/blob/main/http_parser.c