My first real post on dev.to, in September 2017, was the following:
I was trying to extract information from around 60 GB of CSV files corresponding to 139 million events. I started with Python to see how it behaved. The experiment was sparked by my frustration at Redshift and because I wanted to play with TrailDB, a library to query event series. My tests were non-scientifical but, after switching to Go (by copy and pasting code because I didn't really know the language back then), I was able to setup the DB with a speedup of 2.6 times than in Python and to query the data 2.54 times faster.
The topic of speed and performance is dear to probably everyone on this website,even if speed and performance can be relative to a context (see the concept of "fast enough"). You can see this topic permeate conversations on dev.to around the slowness of the web, memory occupation of browsers and desktop apps and other related topics. A couple of nice examples with long and interesting discussions attached:
Nobody would argue against cost effective speed improvements, and this brings me to the gist of this post. An article titled Parsing logs 230x faster with Rust by André Arko (lead developer of Ruby's Bundler) caught my attention.
I've been aware of Rust's speed since... well that and its advantages around memory management is what everyone talks about when they talk about Rust :-D
I've since switched to two Rust based tools that I use everyday on the command line: bat instead of
cat and especially ripgrep instead of
ack. The speed improvement is noticeable (thanks @dmfay for the tip) with the naked eye!
Back to the article. Arko wanted to query Bundler's treasure trove of 500 GB of logs per day to extract useful information about the community. Each log file contains millions of events in JSON (BTW: use structured logs if you can, JSON or key-value, you'll thank me later). Currently those files are sitting compressed in a S3 bucket for a few dollars per month.
Hosted logging solutions were too expensive so he tried to see if he could cook something up.
The first attempt was in Ruby and it took an insane 16 hours for a day's worth of data. Nope.
The second attempt was in Python using AWS Glue and the full power of Amazon's servers. He went down to 3 hours with an average of 36 minutes per each log file (out of 500) using 100 parallel workers for 1000 dollars per month. Nope.
The third attempt was in Rust. He initially went down to 3 minutes per file, then to 60 seconds per file. After fiddling with it more and receiving feedback from readers, he managed to parse a single file in 8 seconds (!!).
The fourth attempt was in Rust again and he used parallelization. It was 3.3x faster than the sequential attempt. That's how he got to the 230x multiplication factor in the title.
If you read closely you'll notice the following:
- the first attempt shouldn't probably be mentioned in the post because it collects less data than the others (and we don't know how much less)
- the first attempt in Rust amounts to 8.33 hours if run sequentially, more than 30 times faster than the experiment with Python and Glue
- the last "sequential" experiment in Rust amounts to a little more than 1 hour for the entire set of 500 GB which is a huge speedup
The last thing André Arko talks about is how he managed to deploy the Rust script so that it can work on the production logs stored on AWS. This part made me laugh:
I discovered rust-aws-lambda, a crate that lets your Rust program run on AWS Lambda by pretending to be a Go binary
Another wonder of distributing an app as a binary :D
On AWS Lambda the speedup he got was 78 times the initial Python example, not bad!
He did some calculations and it was safely in the free tier for AWS Lambda.
So he went from 1000$ a month to 0 a month, by rewriting a script with Rust.
I checked the repository of the script and people are already suggesting ways to make it even faster 😂
- Performance can save you a lot of money
- Knowing (or being willing to learn) more than one language is a good idea
- Rust is definitely worth looking at for this kind of parsing
- Sometimes better is better than good enough