Crystal Durham

Posted on Jul 10, 2020 • Edited on Jul 14, 2020

String interners in Rust

#rust #review #interners

This is basically a direct followup to Amos/fasterthanlime's blog post Small strings in Rust. If you haven't read that, I'd highly suggest reading it first: it covers the techniques that I've ~~stolen~~ used here, which I won't go over in much detail.

Amos covers two Rust crates that offer the "small string" optimization: small strings (of less than around 22 bytes) can be stored inline rather than on the heap, like the standard String type. If you have a large number of small strings, this can greatly reduce allocation pressure.

Rather than small string optimization, though, for certain use cases an interner is useful. An interner associates a "symbol" with each unique string that you want to manage. You can then very cheaply copy the symbol around and compare symbols, as they're just a single integer (typically 32 bits). If you need the actual contents of the string again (such as for display), you just ask the interner to translate back from your symbol to a string.

I took a look at a few of the top interning crates for Rust (there are quite a few!) as ranked by lib.rs and compared them to see what their allocation behavior was. The testing harness is basically a direct copy of the one Amos used for small strings.

Specifically, I've tested

string-interner version 0.7.1*
lasso version 0.2.2*
lalrpop-intern version 0.15.1
intaglio version 1.1.0
strena commit 1ddbf48b9f639ffbb4c89b0d51cb005fc0e3a4f7

* These crates have later versions published after this article -- see the postscript for updates.

I couldn't get the ASCII plot to reproduce properly here, so instead you get some xkcd-style plots. All measurements were done on a Windows machine.

Measurement code is available on GitHub.

`std::string::String`

As a baseline, here's the process with just the standard String type. We just create a list of owned strings for each of the strings on a word list of 7776 words.

    fn std_collect_words(&self) {
        crate::ALLOCATOR.set_active(true);
        let mut words: Vec<String> = Vec::with_capacity(WORDS.len());
        crate::ALLOCATOR.mark_point();
        for &word in WORDS {
            words.push(word.into());
            crate::ALLOCATOR.mark_point();
        }
        crate::ALLOCATOR.set_active(false);
    }

 total events | 7777
  peak bytes  | 241.0 KB
 ----------------------------
 alloc events | 7777
 alloc bytes  | 241.0 KB
 ----------------------------
 freed events | 0
 freed bytes  | 0 B

I see a ~190 KB allocation for the words vector, and then a small allocation for each string after.

`string-interner`

A data structure to cache strings efficiently, with minimal memory footprint and the ability to assicate the interned strings with unique symbols. These symbols allow for constant time comparisons and look-ups to the underlying interned string contents. Also, iterating through the interned strings is cache efficient.

Note the changing y axis.

    fn interner_collect_words(&self) {
        crate::ALLOCATOR.set_active(true);
        let mut words = string_interner::DefaultStringInterner::with_capacity(WORDS.len());
        crate::ALLOCATOR.mark_point();
        for &word in WORDS {
            words.get_or_intern(word);
            crate::ALLOCATOR.mark_point();
        }
        crate::ALLOCATOR.set_active(false);
    }

 total events | 7778
  peak bytes  | 588.4 KB
 ----------------------------
 alloc events | 7778
 alloc bytes  | 588.4 KB
 ----------------------------
 freed events | 0
 freed bytes  | 0 B

I see a ~540 KB allocation in two chunks for the interner, and then a small allocation for each string after.

`lasso`

A multithreaded and single threaded string interner that allows strings to be cached with a minimal memory footprint, associating them with a unique key that can be used to retrieve them at any time. A Rodeo allows O(1) internment and resolution and can be turned into a RodeoReader to allow for contention-free resolutions with both key to str and str to key operations. It can also be turned into a RodeoResolver with only key to str operations for the lowest possible memory usage.

Lasso is the only library in this set that has special support for interning from multiple threads. All other libraries always require exclusive access to intern new symbols. We measure the single-thread interner here to be more fair.

    fn lasso_collect_words(&self) {
        crate::ALLOCATOR.set_active(true);
        let mut words: lasso::Rodeo = lasso::Rodeo::with_capacity(WORDS.len());
        crate::ALLOCATOR.mark_point();
        for &word in WORDS {
            words.get_or_intern(word);
            crate::ALLOCATOR.mark_point();
        }
        crate::ALLOCATOR.set_active(false);
    }

 total events | 23
  peak bytes  | 591.8 KB
 ----------------------------
 alloc events | 20
 alloc bytes  | 592.1 KB
 ----------------------------
 freed events | 3
 freed bytes  | 312 B

I see a ~540 KB allocation in three chunks for the interner, and then a chunked allocation of ~4 KiB around every 550 symbols.

`lalrpop-intern`

Simple string interner used by LALRPOP

This test is designed to be a best-case scenario: we know the number of symbols ahead of time and tell the interner about it so it can hopefully pre-allocate. Lalrpop's interner is purely global, though, so we can't do that. This results in a very unfair comparison, so I'm going to defer showing LALRPOP's results until the section without pre-allocation.

`intaglio`

UTF-8 string and bytestring interner and symbol table. Used to implement storage for the Ruby Symbol table and the constant name table in Artichoke Ruby.

    fn intaglio_collect_words(&self) {
        crate::ALLOCATOR.set_active(true);
        let mut words = intaglio::SymbolTable::with_capacity(WORDS.len());
        crate::ALLOCATOR.mark_point();
        for &word in WORDS {
            words.intern(word).unwrap();
            crate::ALLOCATOR.mark_point();
        }
        crate::ALLOCATOR.set_active(false);
    }

 total events | 2
  peak bytes  | 606.2 KB
 ----------------------------
 alloc events | 2
 alloc bytes  | 606.2 KB
 ----------------------------
 freed events | 0
 freed bytes  | 0 B

... wait, what‽ Oh, right, intaglio has special handling when interning &'static str which doesn't bother to copy the strings and just refers to the static string you gave it. Smart, but for a fair comparison we need to stop it from doing that...

    fn intaglio_dyn_collect_words<'a>(&'a self) {
        crate::ALLOCATOR.set_active(true);
        let mut words = intaglio::SymbolTable::with_capacity(WORDS.len());
        crate::ALLOCATOR.mark_point();
        for &word in WORDS {
            words.intern(String::from(word)).unwrap();
            crate::ALLOCATOR.mark_point();
        }
        crate::ALLOCATOR.set_active(false);
    }

 total events | 7778
  peak bytes  | 660.6 KB
 ----------------------------
 alloc events | 7778
 alloc bytes  | 660.6 KB
 ----------------------------
 freed events | 0
 freed bytes  | 0 B

I see a ~610 KB allocation in two chunks for the interner, and then a small allocation for each string after.

`strena`

As opposed to most other interners, this interner stores all of the interned strings in a single concatenated string. This reduces allocation space required for the interned strings, as well as fragmentation of the memory held by the interner.

This is a new string interner I've written with the intent of reducing the amount of fragmented memory a string interner has to hold (thus this comparison). Hopefully it does well:

    fn strena_collect_words(&self) {
        crate::ALLOCATOR.set_active(true);
        let mut words = strena::Interner::with_capacity(strena::Capacity {
            symbols: WORDS.len(),
            bytes: WORDS.len() * 5, // google says average word length is 4.7
        });
        crate::ALLOCATOR.mark_point();
        for &word in WORDS {
            words.get_or_insert(word);
            crate::ALLOCATOR.mark_point();
        }
        crate::ALLOCATOR.set_active(false);
    }

 total events | 5
  peak bytes  | 260.8 KB
 ----------------------------
 alloc events | 4
 alloc bytes  | 260.8 KB
 ----------------------------
 freed events | 1
 freed bytes  | 38.9 KB

I see a ~180 KB allocation in three chunks for the interner, and then a single realloc around 5.5k symbols in, likely because our wordlist has an average word length greater than five.

The benchmark was basically chosen to show off strena's strong side, though, so digest it with that in mind. I can tell it exactly how to pre-allocate to fit the incoming data.

So what?

We've learned that intaglio is the only interner (of these highly ranked ones) that has special support for interning already-'static strings, and that lasso is the only published one that doesn't allocate every interned string separately. However, we also see that using an O(1) interner does have a noticeable memory impact over just a list of strings. At this scale, string-interner has approximately a 145% overhead; lasso, 145%; intaglio, 175%; strena, 10%.

lasso is already doing half of the clever things strena is doing, though; I'll see if it's possible to reduce lasso's memory overhead before publishing strena myself.

But this is a deliberately-crafted best-case scenario for strena, so

What if it's not best-case?

I've adjusted each of the test cases to default-construct the interner. Rapid-fire, how do each of the interners perform in this situation?

 total events | 7799
  peak bytes  | 323.4 KB
 ----------------------------
 alloc events | 7788
 alloc bytes  | 447.5 KB
 ----------------------------
 freed events | 11
 freed bytes  | 196.5 KB

 total events | 7826
  peak bytes  | 795.6 KB
 ----------------------------
 alloc events | 7802
 alloc bytes  | 1.1 MB
 ----------------------------
 freed events | 24
 freed bytes  | 540.8 KB

 total events | 71
  peak bytes  | 799.1 KB
 ----------------------------
 alloc events | 44
 alloc bytes  | 1.1 MB
 ----------------------------
 freed events | 27
 freed bytes  | 541.1 KB

 total events | 15602
  peak bytes  | 1.1 MB
 ----------------------------
 alloc events | 15578
 alloc bytes  | 1.6 MB
 ----------------------------
 freed events | 24
 freed bytes  | 737.3 KB

 total events | 6
  peak bytes  | 811.0 KB
 ----------------------------
 alloc events | 4
 alloc bytes  | 909.3 KB
 ----------------------------
 freed events | 2
 freed bytes  | 303.1 KB

Intaglio has a default capacity of 4096, so it really cheats this measurement. I've used a starting capacity of 0 for the dynamic test for a fair comparison, but keep in mind that you probably do want to prime the interner with a decent capacity estimate you plan to fill.

 total events | 7828
  peak bytes  | 861.2 KB
 ----------------------------
 alloc events | 7803
 alloc bytes  | 1.3 MB
 ----------------------------
 freed events | 25
 freed bytes  | 606.3 KB

 total events | 75
  peak bytes  | 254.0 KB
 ----------------------------
 alloc events | 39
 alloc bytes  | 426.1 KB
 ----------------------------
 freed events | 36
 freed bytes  | 213.1 KB

So what? (again)

For a simple overview, here's the overhead over the simple string collecting approach for each library, in peak memory usage, total bytes allocated, and total bytes freed:

string-interner: 145% / 145% / 175%
lasso: 150% / 145% / 175%
lalrpop: 240% / 260% / 275%
intaglio (dyn): 165% / 190% / 205%
strena: -10% / -5% / 5%

Whoops, I accidentally made my library look really good again. And keep in mind: this is something of a worst-case for interners, as we just straight insert a single symbol at a time for 7776 symbols.

In conclusion

I'll leave a final interpretation of the data I've gathered here to you, the reader. For me, I think I'd recommend using lasso currently, and I plan to see if I can upstream some of the cleverness in strena to lasso to decrease their memory usage closer to strena's.

One weekend later...

and we've got new versions of both string-interner (v0.11) and lasso (v0.3) taking advantage the approach strena uses to minimize memory usage!

 total events | 66
  peak bytes  | 319.9 KB
 ----------------------------
 alloc events | 41
 alloc bytes  | 492.3 KB
 ----------------------------
 freed events | 25
 freed bytes  | 213.4 KB

 total events | 41
  peak bytes  | 409.7 KB
 ----------------------------
 alloc events | 24
 alloc bytes  | 634.0 KB
 ----------------------------
 freed events | 17
 freed bytes  | 285.8 KB

The remaining difference in memory usage is likely down to different cache growth strategies. The different libraries are "ahead" at different points.

Additionally, both libraries have added (the ability to take advantage of) the &'static str optimization from intaglio. The fastest allocation is one you don't have to do, after all!

Open Source really did work in this instance, and I feel strena has served its purpose as a research project and doesn't need to be published. string-interner and lasso are better for it and you can take advantage of one of those instead.

DEV Community