Previously, I talked about a couple of ways to improve the message passing performance in Rust. The obvious bottleneck was that sending and receiving both involved taking a lock that was shared between all tasks. In order to remove this lock, I set out to write a new port/channel system that relied much less on the runtime library.
In order to do this, we first needed a variant of the atomic reference counter that allows us to share mutable data. We accomplish this by adding a mutex and condition variable. The mutex is your standard pthread mutex, while we’ve implemented our own condition variable that the Rust scheduler is aware of. Because we’re using a standard pthread mutex, using this exclusive access is unsafe and should be used with care; if you’re not careful, you could deadlock Rust. Fortunately, the API for Rust’s low-level locks and condition variables makes it harder to accidentally hold a lock for unbounded amounts of time.
Once we have this, ports and channels simply become a locked atomically reference-counted queue. There’s no more global lock, and things are far simpler because we only need the Rust runtime support for providing mutual exclusion and condition variable signalling. This part didn’t take long to implement, and I was eager to try it out. I spent a little while writing up a new benchmark that would really show the benefit of avoiding the global lock, and when I went to run it, things crashed.
It turns out, I had discovered a bug whereby vector addition could copy things that weren’t copyable. I saw two approaches to fixing this: fixing trans (the Rust to LLVM translation pass), or moving vector addition out of trans and into the library. The idea behind the second option is that if we did this, vector addition would be going through the existing and well-tested function call path, rather than a special vector addition codegen path. This seemed like the best option overall, since the first option felt more like a band aid for the root cause that we have too much functionality that is duplicated in subtly different ways.
Thus, I set out to move vector addition to libcore. This exposed some subtle semantics issues around const vectors, but these were mostly not too painful to work out. My firsts working versions were too slow to finish building the compiler in a reasonable amount of time. I was thinking we’d end up taking a 20% to 2x performance hit by doing this change, but thanks to some heroic optimization help from Niko, we got the performance to the point where in some cases the new code even performs ever so slightly better than the old code. In the course of doing these changes, I also discovered another bug, that led to us leaking memory, and in the course of fixing that, I discovered a way to make Rust segfault. Ah, the life of a compiler writer.
At any rate, we have fixes to all of these bugs in the works, and things are working well enough to run a benchmark. This test creates a ring of tasks, who each send a message to their neighbor on one side and receive from the other side. Here are the numbers for the old messaging system.
Sent 1000000 messages in 3.88114 seconds 257656 messages / second 3.88114 μs / message
And here are the numbers for the new system.
Sent 1000000 messages in 1.87881 seconds 532253 messages / second 1.87881 μs / message
As you can see, we’re about 1.9x faster than we were before.
This new system doesn’t yet have the exact same features as the old system, and it needs some help with ergonomics. My next task will be to work on adding these missing things and hopefully be able to incrementally replace the old code with the new, faster version.
The code for this post isn’t yet in the main Rust tree, but it should be landing soon as we squash the bugs we found in the process of doing this new message passing system.