I hinted in my last post that pipes in Rust have very good performance. This falls out of the fact that the protocol specifications provide very strong static guarantees about what sorts of things can happen at runtime. This allows, among other things, for message send/receive fastpath that requires only two atomic swaps.

Let’s start with the message ring benchmark. I posted results from this earlier. This benchmark spins up a bunch of tasks that arrange themselves in a while. Each task sends a message to their right-hand neighbor, and receives a message from the left-hand neighbor. This repeats for a while. At the end, we look at the total time taken divided by the number of messages. This gives us roughly the fastest we can send and receive a message, modulo some task spawning overhead. The existing port/chan system was able to send about 250,000 messages per second, or one message every 3.9 µs. Here are the results for pipes:

Sent 1000000 messages in 0.227634 seconds
  4.39301e+06 messages / second
  0.227634 µs / message

This is about 17x faster!

It would be a bit dishonest to stop here, however. I wrote this benchmark specifically to make any new implementation really shine. The question is whether faster message passing makes a difference on bigger programs.

To test this, I started by updating the Graph500 Parallel Breadth First Search benchmark. This code gets its parallelism from std::par::map, which in turn is built on core::future. Future has a very simple parallel protocol; it just spawns a task to compute something, which then sends a single message back to the spawner. Porting this was a relatively small change, yet it got measurable speedups. Here are the results.

Benchmark Port/chan time (s) Pipe time (s) Improvement (%)
Graph500 PBFS 0.914772 0.777784 17.6%

The Rust benchmark suite also includes several benchmarks from the Computer Language Benchmarks Game (i.e. the Programming Language Shootout). Some of these, such as k-nucleotide, use Rust’s parallelism features. I went ahead and ported this benchmark over to use pipes, and there are the results.

Benchmark Port/chan time (s) Pipe time (s) Improvement (%)
Shootout K-Nucleotide 4.335 3.125 38.7%

Not too shabby. I’ve been working on porting other benchmarks as well. Some are more difficult because they do not fit the 1:1 nature of pipes very well. In the case of the shootout-threadring benchmark, it actually got significantly slower when I moved to pipes. The thread ring benchmark seems to mostly be measuring the time to switch between tasks, as only one should be runnable at any given time. My hypothesis is that because message passing got faster, this test now hammers the scheduler synchronization code harder, leading to more slowdown due to contention. We’ll need more testing to know for sure. At any rate, scheduler improvements (such as work stealing, which Ben Blum will be working on) should improve this benchmark as well.

Other than that, I’ve been working on rewriting more Rust code to see how it works with pipes versus ports and chans. It has been particularly informative to try to transition parts of Servo over to using pipes.

About these ads
Comments
  1. Maybe you could avoid the atomic operations too, or at least defer them. Just write the messages for all your sends (assuming they’re still “atomic” types – you don’t want half a value to show up on the other side, you just don’t care that it shows up *now*), but right before you receive you insert the appropriate memory barrier so other CPUs can see the preceeding sends. This allows you to do work before incurring the memory barrier cost.

    Now, if the receiver still does set a global flag (“I’m blocked on end point” or “Got the message”) then the sender can check to see what the receiver has done with his preceding messages (before blocking on a receive). If a message has been picked up then there’s no need to insert the memory barrier. If the other side is blocked on the opposite endpoint for a message, then you can just switch to that task and start executing (no need for a global memory barrier since you’ll be running on the same CPU thread). You only need to flush memory if the other thread is not blocked and hasn’t picked up the message (i.e. the message got through the cache “on its own” without manual cache operations).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s