Marc Ryan Riginding

Posted on Aug 21, 2019 • Edited on Aug 28, 2019

Benchmarking or "How to count words really fast"

#rust #optimization #benchmark

A colleague at work showed me this interesting GitHub repository: github.com/juditacs/wordcount.

The readme explains what the word-counting challenge is:

The task is to split a text and count each word's frequency, then print the list sorted by frequency in decreasing order. Ties are printed in alphabetical order.

@juditacs did a great job of aggregating different solutions in several programming languages, creating a Dockerfile to easily setup and benchmark all programs against a Hungarian Wikipedia text dump. They even wrote two blog posts about the word-counting challenge. 1, 2

The latest benchmark on the readme was done on 2016-06-13:

Rank	Experiment	CPU seconds	User time	Contributor
1	java -Xms2G -Xmx8G -classpath java:java/zah-0.6.jar WordCountOptimized	140.64	191.44
2	java -Xms2G -Xmx8G -classpath java:java/trove-3.0.3.jar WordCountOptimized	183.17	219.39
3	rust/wordcount/wordcount	208.86	193.34	Joshua Holmer

Something I find very interesting is that the java solution was the fastest one submitted. It was faster than the usual suspects of C, C++ and Rust programs.

This piqued my interest! Why is Java in the same ballpark as the rust program? Why does the java baseline solution differ so much from the optimized solution? And how can we make the Rust version as fast or even faster than the Java Solution?

Luckily the authors of the optimized java solution left some helpful comments:

Key tricks:

Faster hashing

Buffer I/O, either using Buffered I/O classes or by explicitly reading chunks to an array

Process recurring tokens separately from singletons (big performance gain)

Work directly with raw bytes (safe for UTF-8), to reduce memory and avoid text decoding cost

Using the same techniques we should be able to boost the rust solution to the top.

Before we start, let's look at the baseline rust implementation first. This is the submitted Rust solution by Joshua Holmer; I added some comments for people who are new to rust.

The rust solution is straight forward and the code is fairly easy to understand. We will use this code as the basis for our optimized rust solution.

Benchmarking

Hyperfine is a benchmarking tool written by the great sharkdp to measure the runtime of different cli commands and to gather benchmark results. The runtime of the baseline rust program against 5M lines of Hungarian Wikipedia is around 28 seconds.

Start improving the rust code

1. Faster hashing

Desired effect:
The programming challenge consists of basically two parts, gathering all words into a HashMap and secondly arranging the words into an array to sort it. Since most of the work happens within the HashMap, using a faster HashMap should give us some speed.

Changes:
The Rust solution was submitted in April 2016, that means the used Rust Version was around 1.10. In Rust 1.36 we got a new default implementation for Hashmap which is based on SwissTable. Just updating the rust version to the newest stable version gives us some performance improvements.

Benchmark:

$ hyperfine -w 1 -m 5 
'cat 5M_wiki_dump.xml | ./wordcount_1.10' 
'cat 5M_wiki_dump.xml | ./wordcount_1.37'

Benchmark #1: cat 5M_wiki_dump.xml | ./wordcount_1.10
  Time (mean ± σ):     28.131 s ±  0.363 s    [User: 27.292 s, System: 1.661 s]
  Range (min … max):   27.774 s … 28.718 s    5 runs

Benchmark #2: cat 5M_wiki_dump.xml | ./wordcount_1.37
  Time (mean ± σ):     23.357 s ±  0.335 s    [User: 22.644 s, System: 1.346 s]
  Range (min … max):   22.910 s … 23.708 s    5 runs

Summary
'cat 5M_wiki_dump.xml | ./wordcount_1.37'
ran 1.20 ± 0.02 times faster than
'cat 5M_wiki_dump.xml | ./wordcount_1.10'

The runtime of the updated rust binary takes around 23 seconds. We already improved by 5 seconds. 20% faster than the original solution just by updating rust!

2. Buffer I/O, either using Buffered I/O classes or by explicitly reading chunks to an array

Desired effect:
We want to buffer our output to minimize the number of syscalls we send.

Changes:
The solution already uses BufWriter. No need to do anything here.

Benchmark:
The runtime stayed the same because we did not change anything 🙂

3. Process recurring tokens separately from singletons (big performance gain)

Desired effect:
All benchmark optimization are walking on a thin line between "smart hack" and cheat. This optimization works well with the test data (Hungarian Wikipedia).
Basically, we are creating two Lists, one with words that have 2 or more occurrences and another one with only single occurrences.
The list with the multiple occurrences needs to be sorted twice, once by the number of occurrences and the second time the words need to be sorted lexically. The singleton list only has to be sorted lexically, because the number of occurrences is the same for the whole list (n=1).
Reducing the amount of sortable data that needs to be sorted twice reduces the runtime.

Changes:

-    sortable_words.sort_by_key(|a| a.0);
-    sortable_words.sort_by(|a, b| b.1.cmp(&a.1));

+    let mut multiple: Vec<(&str, usize)> = vec![];
+    let mut single: Vec<(&str, usize)> = vec![];
+
+    for (word, count) in words.iter() {
+        if *count > 1 {
+            multiple.push((*word, *count));
+        } else {
+            single.push((*word, *count));
+        }
+    }
+
+    multiple.sort_by_key(|a| a.0);
+    multiple.sort_by(|a, b| b.1.cmp(&a.1));
+    single.sort_by_key(|a| a.0);

     let mut output = BufWriter::new(io::stdout());
-    for word in sortable_words {
+    for word in multiple {
+        writeln!(output, "{}\t{}", word.0, word.1).ok();
+    }
+
+    for word in single {
         writeln!(output, "{}\t{}", word.0, word.1).ok();
     }
 }

link to gist

Benchmark:

$ hyperfine -w 1 -m 5 
'cat 5M_wiki_dump.xml | ./wordcount_1.37' 
'cat 5M_wiki_dump.xml | ./wordcount_single'

Benchmark #1: cat 5M_wiki_dump.xml | ./wordcount_1.37
  Time (mean ± σ):     23.809 s ±  0.278 s    [User: 23.079 s, System: 1.369 s]
  Range (min … max):   23.401 s … 24.152 s    5 runs

Benchmark #2: cat 5M_wiki_dump.xml | ./wordcount_single
  Time (mean ± σ):     23.257 s ±  0.398 s    [User: 22.533 s, System: 1.360 s]
  Range (min … max):   22.661 s … 23.729 s    5 runs

Summary
'cat 5M_wiki_dump.xml | ./wordcount_single'
ran 1.02 ± 0.02 times faster than
'cat 5M_wiki_dump.xml | ./wordcount_1.37'

Our new binary finished 2% faster than the previous solution.

4. Work directly with raw bytes (safe for UTF-8), to reduce memory and avoid text decoding cost

Desired effect:
Another performance optimization we can do is using raw bytes instead of UTF-8 Strings. In Rust, a String does check its content for UTF-8 validity, something that can be avoided by using plain byte arrays. The std::string::from_unchecked_utf8 function, which can only be used in an unsafe block, is used to print the byte array. This is a not safe operation, but because we know that the input will be valid UTF-8, we indicate to the compiler that we know what we are doing by putting it inside an unsafe block.

Changes:

Benchmark:

$ hyperfine -w 1 -m 5 
'cat 5M_wiki_dump.xml | ./wordcount_single' 
'cat 5M_wiki_dump.xml | ./wordcount_u8'

Benchmark #1: cat 5M_wiki_dump.xml | ./wordcount_single
  Time (mean ± σ):     23.380 s ±  0.296 s    [User: 22.658 s, System: 1.367 s]
  Range (min … max):   23.053 s … 23.745 s    5 runs

Benchmark #2: cat 5M_wiki_dump.xml | ./wordcount_u8
  Time (mean ± σ):     22.713 s ±  0.438 s    [User: 21.975 s, System: 1.378 s]
  Range (min … max):   22.154 s … 23.159 s    5 runs

Summary
'cat 5M_wiki_dump.xml | ./wordcount_u8' 
ran 1.03 ± 0.02 times faster than 
'cat 5M_wiki_dump.xml | ./wordcount_single'

Our Improvements gained us another 3%.

Result

Benchmark baseline vs latest improvement:

$ hyperfine -w 1 -m 5 
'cat 5M_wiki_dump.xml | ./wordcount_1.10' 
'cat 5M_wiki_dump.xml | ./wordcount_u8'

Benchmark #1: cat 5M_wiki_dump.xml | ./wordcount_1.10
  Time (mean ± σ):     27.722 s ±  0.483 s    [User: 26.926 s, System: 1.629 s]
  Range (min … max):   26.909 s … 28.146 s    5 runs

Benchmark #2:cat 5M_wiki_dump.xml | ./wordcount_u8
  Time (mean ± σ):     22.254 s ±  0.286 s    [User: 21.577 s, System: 1.315 s]
  Range (min … max):   21.950 s … 22.559 s    5 runs

Summary
'cat 5M_wiki_dump.xml | ./wordcount_u8' 
ran 1.25 ± 0.03 times faster than 
'cat 5M_wiki_dump.xml | ./wordcount_1.10'

All in all our improvements gained us a 25% faster binary compared to the baseline rust solution.

Oldest comments (2)

Tim McNamara • Aug 21 '19

Really like that you have broken down all of the steps and taken the time to benchmark each of them.

Some other suggestions:

Maybe stream stdin, rather than reading the whole thing at the start? io::stdin().read_to_end(&mut buffer).ok(); feels expensive.

Perhaps pick a specialized hash function that's happy with short strings.

Maybe use Vec::with_capacity(words.len()) rather than the vec![] macro when you construct multiple and single. That'll avoid several allocations. Although it also will probably waste a lot of memory.

Marc Ryan Riginding • Aug 22 '19

Great ideas, i'll update my code and measure.

Im primarily interested in optimizing for runtime, but I think it would be interesting to see how the implementation differs if you optimize for space.