DEV Community

Noah11012
Noah11012

Posted on

String Internalization

Many modern interpreted programming languages like Python and Lua have a builtin feature that allows for O(1) comparisons of strings. Ordinarily, comparing two strings is a O(min(x, y)) operation that for huge strings can cause slow downs in performance. Compilers and interpreters can utilize this to their advantage to improve performance when checking against keywords and such. This will hopefully allow the programmer to focus on the more important tasks on hand like lexing, parsing, and generation of assembly instructions or byte code.

How does string internalization work? Like all other problems when related to the computer and the world, everything has a trade off. The two main ones are efficiency and memory usage and consumption. Some solution to a problem may have a small footprint when it comes to memory but might be slow comparatively to something that uses more memory but is more efficient. There are a few ways we can go about to internalize a string but we will do it how I have been doing it.

First, we need a data structure to store the strings themselves and to solve that we will use a stretchy buffer. If you don't know, they were invented by Sean Barrett and is similar to C++'s std::vector. Because this article is dealing with string internalization, we will not focus on the implementation details of a stretchy buffer.

// If buffer is NULL, create it and push the item.
// If buffer is full, reallocate and push the item.
#define buffer_push(buffer, item) ...

// Gets the length of a stretchy buffer.
#define buffer_length(buffer) ...

When it comes time to compare two internalized strings, what is actually being compared? Well, in our version, it is the memory address of the strings. If both are the same, then the strings are the same, if not, then the strings are not the same.

When we internalize a string, we will have to search the list and compare each entry against the one we are trying to internalize. This is the price we pay, but in the long is worth it as you can hold onto the memory address long as you want to.

/* Where all internalized strings will be stored. */
static char *inter_strings;

char const *string_internalize(char const *string)
{
    /* Search if the string already exists in the list. */
    int i = 0;
    while(i < buffer_length(inter_strings))
    {
        /* If found, return the internalized string. */
        if(string_compare(string, inter_strings))
            return inter_strings;
        ++i;
    }

    /* If not found, create internalized copy and return it. */
    char* new_string;
    int string_len = string_length(string);

    new_string = calloc(string_len + 1, sizeof(char));

    memory_copy(new_string, string, string_len);

    buffer_push(inter_strings, new_string);

    return new_string;
}

This is definitely a short one, but thought it was something that might be helpful.

Top comments (0)