Posts Tagged ‘breadth first search’

I hinted in my last post that pipes in Rust have very good performance. This falls out of the fact that the protocol specifications provide very strong static guarantees about what sorts of things can happen at runtime. This allows, among other things, for message send/receive fastpath that requires only two atomic swaps.

Let’s start with the message ring benchmark. I posted results from this earlier. This benchmark spins up a bunch of tasks that arrange themselves in a while. Each task sends a message to their right-hand neighbor, and receives a message from the left-hand neighbor. This repeats for a while. At the end, we look at the total time taken divided by the number of messages. This gives us roughly the fastest we can send and receive a message, modulo some task spawning overhead. The existing port/chan system was able to send about 250,000 messages per second, or one message every 3.9 µs. Here are the results for pipes:

Sent 1000000 messages in 0.227634 seconds
  4.39301e+06 messages / second
  0.227634 µs / message

This is about 17x faster!

It would be a bit dishonest to stop here, however. I wrote this benchmark specifically to make any new implementation really shine. The question is whether faster message passing makes a difference on bigger programs.

To test this, I started by updating the Graph500 Parallel Breadth First Search benchmark. This code gets its parallelism from std::par::map, which in turn is built on core::future. Future has a very simple parallel protocol; it just spawns a task to compute something, which then sends a single message back to the spawner. Porting this was a relatively small change, yet it got measurable speedups. Here are the results.

Benchmark Port/chan time (s) Pipe time (s) Improvement (%)
Graph500 PBFS 0.914772 0.777784 17.6%

The Rust benchmark suite also includes several benchmarks from the Computer Language Benchmarks Game (i.e. the Programming Language Shootout). Some of these, such as k-nucleotide, use Rust’s parallelism features. I went ahead and ported this benchmark over to use pipes, and there are the results.

Benchmark Port/chan time (s) Pipe time (s) Improvement (%)
Shootout K-Nucleotide 4.335 3.125 38.7%

Not too shabby. I’ve been working on porting other benchmarks as well. Some are more difficult because they do not fit the 1:1 nature of pipes very well. In the case of the shootout-threadring benchmark, it actually got significantly slower when I moved to pipes. The thread ring benchmark seems to mostly be measuring the time to switch between tasks, as only one should be runnable at any given time. My hypothesis is that because message passing got faster, this test now hammers the scheduler synchronization code harder, leading to more slowdown due to contention. We’ll need more testing to know for sure. At any rate, scheduler improvements (such as work stealing, which Ben Blum will be working on) should improve this benchmark as well.

Other than that, I’ve been working on rewriting more Rust code to see how it works with pipes versus ports and chans. It has been particularly informative to try to transition parts of Servo over to using pipes.

Advertisements

I’ve continued tuning the Graph500 code I talked about yesterday. I still haven’t managed to achieve a speedup over the sequential version, but I’m not running about an order of magnitude faster than I was yesterday.

If you recall, the last profile from yesterday showed that we were spending an enormous amount of time waiting on a spinlock related to malloc and free. The problem was basically that the closure we were passing to par::mapi included the adjacency lists for the whole graph. In order to guarantee safety, Rust was copying the adjacency lists into each task. Since these are immutable, however, we should have been able to share them in place.

It turns out Rust has a way to let you do this by dropping into the unsafe portions of the language. Now, instead of duplicating the closure, we simply pass an unsafe pointer to the closure to the spawned tasks. These dereference it in place, and we avoid the copy. Now, instead of having a 50-100x slowdown, we’re only at a 4-5x slowdown. The profile looks like this:

Graph500 profile showing very little time in the malloc spinlocks.

The troublesome spinlock from before only accounts for 3.7% of our time. The 4th function, which has 5.3% of the time, is the main worker for the breadth first search. Ideally, this would account for the majority of the time. The top two functions, however, appear to do with allocating and zeroing memory.

I also tried using unsafe pointers to the input vectors, but this did not make a material difference in the performance. I suspect the bzero and memmove time mostly comes from aggregating the results from each of the parallel tasks together. The next obvious optimizations are to try to pre-allocate the result space and update these in place.