I’m happy to report that my parallel breadth first search program in Rust gets a 2.8 to 3x speedup now on my quad core MacBook Pro. While not in the ideal 4-8x range, I’m pretty satisfied with a 3x speedup. As you recall from the last post, the majority of the time was still spent doing memory allocation and copies. I tried several things to finally get a speedup, only one of which had much noticeable effect:
- Allocate and update the result storage in place
- Reuse tasks
- Share more with unsafe pointers
Allocate and update in place
Allocation in place was my hypothesis from the last post. Basically, you allocate a big mutable vector for the results ahead of time. Then, each of the worker tasks write their results directly into the finally location, rather than incurring a bunch of memory copy costs to aggregate the results as a second step. Part of the reason why this seems like a good idea comes from plot below of CPU usage over time.
The benchmark searches for 64 randomly selected keys. This graph shows about one and a half of the searches. Each search consists of a parallel step to update all the vertex colors (from white to gray to black), and then a synchronization step to aggregation the results together. This repeats until no more nodes are gray. Traversing the whole graph should take about the diameter of the graph, which is usually about 4. This matches what we see on the graph. There are a bunch of spikes all the cores are busy, followed by some sequential synchronization time. Each search actually consists of 8 spikes because there is a parallel step to test if we’re done and a parallel step to do the actual work.
The idea was that the sequential time between each spike came from appending all the intermediate result vectors together. By allocating in place, we should have been able to avoid this cost. Unfortunately, I did not see a noticeable performance improvement by doing this.
My next idea was that maybe we were spending too much time in task creation and destruction. To test this, I wrote a task pool, which would use the same tasks to process multiple work items. By design, Rust’s task primitives are very cheap, so this shouldn’t be necessary in practice. Once again, this transformation did not noticeably affect performance. This is actually good news, because it means Rust delivers on its goal of making spawning tasks cheap enough that developers shouldn’t worry about spawning too many.
Share more with unsafe pointers
After two failed attempts, Patrick Walton and I took a more in depth look at the profiles. We discovered that compiling with the
-g flag makes Instruments able to show us a lot more useful information about where we are spending our time. Patrick’s diagnosis was that we were spending too much time allocating and copying, and he advised I try to track down the expensive copies. The profiles showed much of the time was spent in the beginning and end of the parallel block, meaning the problems were probably do to copying and freeing things in the closure. There are two fairly large structures that this code used: the edge list and the color vector. I replaced these with unsafe pointers to ensure they were actually shared rather than copied. Finally, I saw a parallel speedup! The edge list was a couple hundred megabytes, so this explains why copying it was rather expensive.
My experience again shows several ways we can improve Rust. The most obvious is that we need a way to share data between tasks; copying is just too expensive. For the most part, this isn’t a problem since we would only be sharing read-only data. The trickiness comes in when we have to decide when to reclaim the shared data. One way of doing this is to use the patient parent pattern. A parent can spawn several parallel tasks which each get read-only access to all of the parent’s data. The parent suspends execution until all of the children complete, at which point the parent would free the data through the usual means. Here the children do not worry about freeing these structures at all.
Another approach is to use atomic reference counting for data that is shared between tasks. This gives more flexibility, because the lifetimes of tasks are not strictly bounded by their parents. It does mean we pay the cost of atomic reference counting, but this will hopefully not be bad since we only have to update reference counts when data structures cross task boundaries.
Finally, there are some smaller changes that would be nice. I spent a while worrying about time spent appending vectors. It might be worth adding a tree-vector to the standard library, which gives logarithmic time element lookup but constant time appending. Since the root cause of my performance problems was a large implicit copy, it’d be nice to have the compiler warn on potentially large copies, or maybe even make it an error where the programmer must explicitly copy something if this is really desired. As another optimization, it might be worth keeping tasks around once they die and reusing them to hopefully reduce the overhead of creating tasks even further. I have a page about this in the Rust wiki, but we’ll need more evidence that this is a good idea before going for it.