1 Billion Row Challenge
A couple of months ago, I came across this challenge, which involves reading and processing a file with one billion rows. Each row reports a city’s temperature measurement. The expected output is a list of cities sorted alphabetically, showing the minimum, maximum, and average temperature.
The required logic isn’t rocket science, but the real challenge lies in processing the file as quickly as possible, and one billion rows, even if it doesn’t seem like it, are a lot!
Intrigued by the challenge, I decided to try it.
Hardware and baseline
I’ll use a MacBook Pro 2021 with an M1 Max processor and 32 GB of RAM. The baseline is 184 seconds.
As a programming language, I’ll use Rust 🦀. I’ll measure performance using std::time::Instant
, taking three measurements and keeping the median as final result for each step.
First version (170 seconds)
The ideas that came to me for the first version are:
- We need an order; we can use
BTreeMap
. - The file is huge, so we don’t load it all in memory. We can use an iterator on
BufReader
. - To make things easy, we can work with
String
types
The first version executes in 170 seconds. It’s slow, but we’re already under the baseline.
Let’s analyze where we spend most of the time (you can view the original flame graph here):
Lines
iterator seems predominant. As we have enough RAM, let’s try to load the file in memory using a String
.
Read the file in memory (167 seconds)
With this second version, we notice a slight improvement, but we can do better.
For example, the documentation of BufReader
says:
BufReader<R>
can improve the speed of programs that make small and repeated read calls to the same file or network socket. It does not help when reading very large amounts at once, or reading just one or a few times.`
Given this indication and the new flame graph, we can try to remove BufReader
, which doesn’t seem helpful in our case.
Drop BufReader
(166 seconds)
Mmm… It doesn’t seem that it helped much (you can find the diff here). Let’s take a look at the new flamegraph:
It appears that iterators’ operations take 10% of CPU time, while String
conversions and operations take another ~8%.
First, we can avoid using String
to prevent unnecessary memory allocations. We can also use bytes directly where possible. Let’s also add some unsafe Rust code and avoid checking if what we’re reading is valid UTF8; we know it’s valid because we know the input.
Drop String
and don’t check UTF8 (103 seconds)
Now we’re talking, that’s a big improvement! (commit)
Let’s take a look at the new flamegraph:
Again, we observe that iterators waste CPU cycles. In particular, in the process_line
function, we split the line and accumulate the result in a vector that we discard immediately afterward.
Let’s try to change the approach and use the memchr
crate to split the line and then use the obtained indices to extract what we need (we know the structure of the line, so that’s easy!).
Similarly, we also waste resources when parsing numbers: We create a String
from bytes to discard immediately after converting it to a float
. Let’s try using the fast_float
crate to get the number directly from bytes.
Use memchr
and fast_float
to avoid useless conversions (77 seconds)
That’s another 20% improvement! (commit)
We’re making progress with this data-driven approach. Again, let’s take a look at the flamegraph:
A noticeable issue is read_to_end
which accounts for almost 4% of the time. I wouldn’t touch it for now, but I might try the iterator approach again later.
However, one thing that comes to my mind is related to map performance (our accumulator). To complete the challenge, we need alphabetical order, but the Btreemap
is slower than the Hasmap
.
Indeed, from the documentation, we can notice that time complexity for inserts is higher for Btreemap
than Hashmap
: O(log(n)) vs. O(1). For 1 billion elements, even log(n) isn’t free!
We can try to using a Hashmap
for accumulating and then copying it to a Btreemap
for ordering.
Use hashmap, order later (42.5 seconds)
Amazing, more than 30 seconds less! Copying at the end is much faster than maintaining an ordered collection! (commit)
As always, let’s go back to our flamegraph:
Having changed our approach to data accumulation, we now have new information and new ideas to improve even more.
The hashing part is an issue. The docs say that Hashmap
uses an algorithm resistant to HashDoS attacks; it’s more secure, but also slower. We don’t need these guarantees.
After searching the web, I found the rustc_hash
crate, which suits our needs. Quoting the docs:
A speedy, non-cryptographic hashing algorithm
In addition, it provides a map implementation that might do the job: FxHashMap
Let’s try it!
Use FxHashMap
(33 seconds)
That’s another good improvement! (commit)
Again, the new flamegraph is our best friend:
We still spend time mainly in read_to_end
and hashing. It would be great to try nohash-hasher
, but making it support &str
or &[u8]
appears long and tedious.
Reading the file takes around 3 seconds, and while we’re loading the file, we’re doing nothing meaningful. It would be nice to start working on the entries meanwhile, but it requires spawning another thread and messaging entries for processing. It doesn’t sound straightforward, but not even impossible. Let’s keep multithreading for later.
Apart from these observations, nothing stands out 🤔
Looking at the code, we’re still using split
to process the lines. Since switching to memchr
to split the single line brought some benefits, we can try using the same approach with memchr_iter
to split the different lines.
Use memchr_iter
(30.5 seconds)
Good! Around 8% less! (commit)
The flame graph shows nothing different, as expected. But since reading stood out as a big problem for many iterations, we can’t ignore it anymore, so let’s proceed in that direction.
Searching the web once again, I found the memmap2
crate. It allows to map portions of memory into a buffer, providing a more efficient way to access large files or memory regions without the need for explicit I/O operations.
Sounds promising!
Use Mmap
(27.2 seconds)
We nibbled away another 10%! (commit)
Furthermore, if we look at our flamegraph again, we notice it’s flat.
The hashing process still accounts for nearly 7% of the total time, but we’re satisfied with the current progress 😄
Now, we can go multithreading! I’ll divide the file into chunks and assign each to a separate thread. Each thread will execute the same logic as before, and once all threads are finished, we’ll merge the results into a BTreeMap.
Go multithreading (3.2 seconds)
BOOM! 💥 (commit)
With this result, we can conclude the challenge! We improved the initial benchmark by 53 times. There are undoubtedly other possible optimizations, but I am happy about this.