1 Billion Row Challenge

A couple of months ago, I came across this challenge, which involves reading and processing a file with one billion rows. Each row reports a city’s temperature measurement. The expected output is a list of cities sorted alphabetically, showing the minimum, maximum, and average temperature.

The required logic isn’t rocket science, but the real challenge lies in processing the file as quickly as possible, and one billion rows, even if it doesn’t seem like it, are a lot!

Intrigued by the challenge, I decided to try it.

Hardware and baseline

I’ll use a MacBook Pro 2021 with an M1 Max processor and 32 GB of RAM. The baseline is 184 seconds.

As a programming language, I’ll use Rust 🦀. I’ll measure performance using std::time::Instant, taking three measurements and keeping the median as final result for each step.

First version (170 seconds)

The ideas that came to me for the first version are:

  • We need an order; we can use BTreeMap.
  • The file is huge, so we don’t load it all in memory. We can use an iterator on BufReader.
  • To make things easy, we can work with String types

The first version executes in 170 seconds. It’s slow, but we’re already under the baseline.

Let’s analyze where we spend most of the time (you can view the original flame graph here): Image

Lines iterator seems predominant. As we have enough RAM, let’s try to load the file in memory using a String.

Read the file in memory (167 seconds)

With this second version, we notice a slight improvement, but we can do better.

For example, the documentation of BufReader says:

BufReader<R> can improve the speed of programs that make small and repeated read calls to the same file or network socket. It does not help when reading very large amounts at once, or reading just one or a few times.`

Given this indication and the new flame graph, we can try to remove BufReader, which doesn’t seem helpful in our case.

Drop BufReader (166 seconds)

Mmm… It doesn’t seem that it helped much (you can find the diff here). Let’s take a look at the new flamegraph:

Image

It appears that iterators’ operations take 10% of CPU time, while String conversions and operations take another ~8%.

First, we can avoid using String to prevent unnecessary memory allocations. We can also use bytes directly where possible. Let’s also add some unsafe Rust code and avoid checking if what we’re reading is valid UTF8; we know it’s valid because we know the input.

Drop String and don’t check UTF8 (103 seconds)

Now we’re talking, that’s a big improvement! (commit)

Let’s take a look at the new flamegraph: Image

Again, we observe that iterators waste CPU cycles. In particular, in the process_line function, we split the line and accumulate the result in a vector that we discard immediately afterward.

Let’s try to change the approach and use the memchr crate to split the line and then use the obtained indices to extract what we need (we know the structure of the line, so that’s easy!).

Similarly, we also waste resources when parsing numbers: We create a String from bytes to discard immediately after converting it to a float. Let’s try using the fast_float crate to get the number directly from bytes.

Use memchr and fast_float to avoid useless conversions (77 seconds)

That’s another 20% improvement! (commit)

We’re making progress with this data-driven approach. Again, let’s take a look at the flamegraph: Image

A noticeable issue is read_to_end which accounts for almost 4% of the time. I wouldn’t touch it for now, but I might try the iterator approach again later.

However, one thing that comes to my mind is related to map performance (our accumulator). To complete the challenge, we need alphabetical order, but the Btreemap is slower than the Hasmap.

Indeed, from the documentation, we can notice that time complexity for inserts is higher for Btreemap than Hashmap: O(log(n)) vs. O(1). For 1 billion elements, even log(n) isn’t free!

We can try to using a Hashmap for accumulating and then copying it to a Btreemap for ordering.

Use hashmap, order later (42.5 seconds)

Amazing, more than 30 seconds less! Copying at the end is much faster than maintaining an ordered collection! (commit)

As always, let’s go back to our flamegraph: Image

Having changed our approach to data accumulation, we now have new information and new ideas to improve even more.

The hashing part is an issue. The docs say that Hashmap uses an algorithm resistant to HashDoS attacks; it’s more secure, but also slower. We don’t need these guarantees.

After searching the web, I found the rustc_hash crate, which suits our needs. Quoting the docs:

A speedy, non-cryptographic hashing algorithm

In addition, it provides a map implementation that might do the job: FxHashMap

Let’s try it!

Use FxHashMap (33 seconds)

That’s another good improvement! (commit)

Again, the new flamegraph is our best friend: Image

We still spend time mainly in read_to_end and hashing. It would be great to try nohash-hasher, but making it support &str or &[u8] appears long and tedious.

Reading the file takes around 3 seconds, and while we’re loading the file, we’re doing nothing meaningful. It would be nice to start working on the entries meanwhile, but it requires spawning another thread and messaging entries for processing. It doesn’t sound straightforward, but not even impossible. Let’s keep multithreading for later.

Apart from these observations, nothing stands out 🤔

Looking at the code, we’re still using split to process the lines. Since switching to memchr to split the single line brought some benefits, we can try using the same approach with memchr_iter to split the different lines.

Use memchr_iter (30.5 seconds)

Good! Around 8% less! (commit)

Image

The flame graph shows nothing different, as expected. But since reading stood out as a big problem for many iterations, we can’t ignore it anymore, so let’s proceed in that direction.

Searching the web once again, I found the memmap2 crate. It allows to map portions of memory into a buffer, providing a more efficient way to access large files or memory regions without the need for explicit I/O operations.

Sounds promising!

Use Mmap (27.2 seconds)

We nibbled away another 10%! (commit)

Furthermore, if we look at our flamegraph again, we notice it’s flat.

Image

The hashing process still accounts for nearly 7% of the total time, but we’re satisfied with the current progress 😄

Now, we can go multithreading! I’ll divide the file into chunks and assign each to a separate thread. Each thread will execute the same logic as before, and once all threads are finished, we’ll merge the results into a BTreeMap.

Go multithreading (3.2 seconds)

BOOM! 💥 (commit)

With this result, we can conclude the challenge! We improved the initial benchmark by 53 times. There are undoubtedly other possible optimizations, but I am happy about this.