Posts Tagged ‘graph processing’

I hinted in my last post that pipes in Rust have very good performance. This falls out of the fact that the protocol specifications provide very strong static guarantees about what sorts of things can happen at runtime. This allows, among other things, for message send/receive fastpath that requires only two atomic swaps.

Let’s start with the message ring benchmark. I posted results from this earlier. This benchmark spins up a bunch of tasks that arrange themselves in a while. Each task sends a message to their right-hand neighbor, and receives a message from the left-hand neighbor. This repeats for a while. At the end, we look at the total time taken divided by the number of messages. This gives us roughly the fastest we can send and receive a message, modulo some task spawning overhead. The existing port/chan system was able to send about 250,000 messages per second, or one message every 3.9 µs. Here are the results for pipes:

Sent 1000000 messages in 0.227634 seconds
  4.39301e+06 messages / second
  0.227634 µs / message

This is about 17x faster!

It would be a bit dishonest to stop here, however. I wrote this benchmark specifically to make any new implementation really shine. The question is whether faster message passing makes a difference on bigger programs.

To test this, I started by updating the Graph500 Parallel Breadth First Search benchmark. This code gets its parallelism from std::par::map, which in turn is built on core::future. Future has a very simple parallel protocol; it just spawns a task to compute something, which then sends a single message back to the spawner. Porting this was a relatively small change, yet it got measurable speedups. Here are the results.

Benchmark Port/chan time (s) Pipe time (s) Improvement (%)
Graph500 PBFS 0.914772 0.777784 17.6%

The Rust benchmark suite also includes several benchmarks from the Computer Language Benchmarks Game (i.e. the Programming Language Shootout). Some of these, such as k-nucleotide, use Rust’s parallelism features. I went ahead and ported this benchmark over to use pipes, and there are the results.

Benchmark Port/chan time (s) Pipe time (s) Improvement (%)
Shootout K-Nucleotide 4.335 3.125 38.7%

Not too shabby. I’ve been working on porting other benchmarks as well. Some are more difficult because they do not fit the 1:1 nature of pipes very well. In the case of the shootout-threadring benchmark, it actually got significantly slower when I moved to pipes. The thread ring benchmark seems to mostly be measuring the time to switch between tasks, as only one should be runnable at any given time. My hypothesis is that because message passing got faster, this test now hammers the scheduler synchronization code harder, leading to more slowdown due to contention. We’ll need more testing to know for sure. At any rate, scheduler improvements (such as work stealing, which Ben Blum will be working on) should improve this benchmark as well.

Other than that, I’ve been working on rewriting more Rust code to see how it works with pipes versus ports and chans. It has been particularly informative to try to transition parts of Servo over to using pipes.

In my last post, we saw how to get a parallel speedup on a breadth first search in Rust. One major flaw with this was that we had to use unsafe pointers all over the place to prevent from copying large data structures around. Given that Rust’s slogan is “a safe, concurrent, practical language,” it would be terribly sad to sacrifice safety in order to get concurrency. Fortunately, the tricks we were doing before were mostly safe. A human reviewer could see without too much trouble that the data we were sharing was immutable (so we wouldn’t have to worry about race conditions), and that we were careful to manage task and data lifetimes so we wouldn’t have a use-after-free error. Now the goal is to teach the compiler to verify the safety of these patterns on its own.

There are two safety properties we are concerned with:

  1. Only immutable (i.e. constant) data can be shared.
  2. Shared data must remain live as long as any task might access it.

For point (1), we have solved this by introducing a const kind. For the purposes of Rust, constant types are basically those which do not contain the mut annotation anywhere in them. Getting the details of this just right takes some care, and we will probably have to tweak the rules as time goes on. One current weakness is that only item functions (i.e. those without closures) count as const. Many Rust types already end up being const, which probably falls out of Rust’s immutable by default philosophy.

For (2), we’ve added a new module to the library called arc. This stands for atomic reference counting. This is a somewhat unfortunate name, as it is so similar to the more common automatic reference counting. We’d be happy to entertain suggestions for better names. The idea behind ARC is that it wraps over some piece of constant data. The ARC is non-copyable, but it can be sent (this required another tweak to Rust’s kind system to allow sendable resources). When you want to access the data the ARC is managing, you can ask it for a reference. Importantly, ARC provides a clone operation, which increments the reference count and returns another ARC pointing to the same data. When the ARC object goes out of scope, it decrements the reference count and potentially frees the data it is wrapping. ARCs can be safely shared between tasks because all reference counting is done using atomic operations. Additionally, Rust’s region system ensures that the ARC object outlives any references to the data inside of the ARC.

We were able to implement ARC entirely in the Rust Standard Library, with a little bit of runtime support for atomic operatons. ARC itself required some unsafe code in its implementation, but this fits with the “practical” part of Rust. We can implement limited parts of the program using unsafe primitives, and bless them as safe through manual inspection. Users can then build safe programs from the components in the standard library. Indeed, the actual parallel breadth first code no longer needs the unsafe keyword.

Along the way, Niko and I were also able to solve some annoying compiler bugs that make our standard library much more functional. The BFS code is also in the main Rust tree now, and later today I’m hoping to land a few more changes to the libraries to make building other parallel programs easier.

I’m happy to report that my parallel breadth first search program in Rust gets a 2.8 to 3x speedup now on my quad core MacBook Pro. While not in the ideal 4-8x range, I’m pretty satisfied with a 3x speedup. As you recall from the last post, the majority of the time was still spent doing memory allocation and copies. I tried several things to finally get a speedup, only one of which had much noticeable effect:

  1. Allocate and update the result storage in place
  2. Reuse tasks
  3. Share more with unsafe pointers

Allocate and update in place

Allocation in place was my hypothesis from the last post. Basically, you allocate a big mutable vector for the results ahead of time. Then, each of the worker tasks write their results directly into the finally location, rather than incurring a bunch of memory copy costs to aggregate the results as a second step. Part of the reason why this seems like a good idea comes from plot below of CPU usage over time.

Profile showing periodic CPU spikes

The benchmark searches for 64 randomly selected keys. This graph shows about one and a half of the searches. Each search consists of a parallel step to update all the vertex colors (from white to gray to black), and then a synchronization step to aggregation the results together. This repeats until no more nodes are gray. Traversing the whole graph should take about the diameter of the graph, which is usually about 4. This matches what we see on the graph. There are a bunch of spikes all the cores are busy, followed by some sequential synchronization time. Each search actually consists of 8 spikes because there is a parallel step to test if we’re done and a parallel step to do the actual work.

The idea was that the sequential time between each spike came from appending all the intermediate result vectors together. By allocating in place, we should have been able to avoid this cost. Unfortunately, I did not see a noticeable performance improvement by doing this.

Reusing tasks

My next idea was that maybe we were spending too much time in task creation and destruction. To test this, I wrote a task pool, which would use the same tasks to process multiple work items. By design, Rust’s task primitives are very cheap, so this shouldn’t be necessary in practice. Once again, this transformation did not noticeably affect performance. This is actually good news, because it means Rust delivers on its goal of making spawning tasks cheap enough that developers shouldn’t worry about spawning too many.

Share more with unsafe pointers

After two failed attempts, Patrick Walton and I took a more in depth look at the profiles. We discovered that compiling with the -g flag makes Instruments able to show us a lot more useful information about where we are spending our time. Patrick’s diagnosis was that we were spending too much time allocating and copying, and he advised I try to track down the expensive copies. The profiles showed much of the time was spent in the beginning and end of the parallel block, meaning the problems were probably do to copying and freeing things in the closure. There are two fairly large structures that this code used: the edge list and the color vector. I replaced these with unsafe pointers to ensure they were actually shared rather than copied. Finally, I saw a parallel speedup! The edge list was a couple hundred megabytes, so this explains why copying it was rather expensive.

Future directions

My experience again shows several ways we can improve Rust. The most obvious is that we need a way to share data between tasks; copying is just too expensive. For the most part, this isn’t a problem since we would only be sharing read-only data. The trickiness comes in when we have to decide when to reclaim the shared data. One way of doing this is to use the patient parent pattern. A parent can spawn several parallel tasks which each get read-only access to all of the parent’s data. The parent suspends execution until all of the children complete, at which point the parent would free the data through the usual means. Here the children do not worry about freeing these structures at all.

Another approach is to use atomic reference counting for data that is shared between tasks. This gives more flexibility, because the lifetimes of tasks are not strictly bounded by their parents. It does mean we pay the cost of atomic reference counting, but this will hopefully not be bad since we only have to update reference counts when data structures cross task boundaries.

Finally, there are some smaller changes that would be nice. I spent a while worrying about time spent appending vectors. It might be worth adding a tree-vector to the standard library, which gives logarithmic time element lookup but constant time appending. Since the root cause of my performance problems was a large implicit copy, it’d be nice to have the compiler warn on potentially large copies, or maybe even make it an error where the programmer must explicitly copy something if this is really desired. As another optimization, it might be worth keeping tasks around once they die and reusing them to hopefully reduce the overhead of creating tasks even further. I have a page about this in the Rust wiki, but we’ll need more evidence that this is a good idea before going for it.

One of the main applications for Rust is the Servo Parallel Browser Project. This means Rust must fundamentally be a parallel language. Indeed, the language already has Erlang-inspired lightweight tasks, and communication using ports and channels. Unfortunately, most of the code that uses these features are tiny microbenchmarks. While these are useful for making sure the low-level mechanics are efficient, they do not provide as much insight into how Rust feels and performs overall as a parallel language. To this end, I decided to begin my summer internship by implementing most of the Graph500 Breadth First Search Benchmark.

I started out by writing a purely sequential version of the benchmark that handled generating the edge list, building a graph structure from the edge list, generating a BFS tree for several randomly chosen keys, and finally doing some validation to give us some idea that the algorithm might be correct. For my graph data structure, I used an adjacency list; each vertex keeps a list of all the vertices it is connected to. I used the basic search algorithm, which you can find on the Wikipedia page for breadth first search.

The initial results seemed fast to me (although I’m not a graph performance expert, so I don’t know how my results compare to other implementations). Once I got to validation, however, things got slow. The algorithm I used for Step 5 of the validation (ensuring that edges in the BFS tree exist in the original graph) is O(N * E), where N is the number of vertices and E is the number of edges in the graph. While the right thing to do would have been to find a more efficient algorithm, I instead decided to throw some parallelism at the validation code instead.

The outer loop of the code in question is just mapping a function over a vector, so I wrote a parallel map. The simplest way is to simply spawn a task for each element in the input vector, but this could lead to the task scheduling overhead destroying any speedup you might get from parallelism. Instead, my parallel map function is controlled by a maximum number of tasks to spawn, and a minimum number of items to process per task. This leads towards coarser granularity for parallelism, which seems like a better fit for multiprocessors.

Sadly, once all this was done, I only had about a 1.4x speedup on my quad-core hyperthreaded machine. This was basically an embarrassingly parallel problem, so I expected a speedup in the 4-8x range. Patrick Walton suggested I look at this using, and here’s what we found:

Graph500 Benchmark Profile in

Around 40% of our time was spent in stack-growth related functions. The first step, then, was to use a larger minimum stack size. Once I ran the test using a 2MB stack size, the stack growth function disappeared from the profile and I was able to get about a 3.4x speedup over using the sequential map. This seemed reasonable to me, so I moved on to parallelizing the actual search algorithm.

Again, not being a graph processing expert, I wasn’t sure what algorithm to use. Most of my web searching led to more complicated algorithms that are highly tuned for certain architectures. I was looking for something simple. Eventually, I found this page which helped clarify things and inspired the approach I ended up taking. My approach involved keeping a color vector, where each vertex is colored either white, gray or black. White nodes are those that we haven’t seen before, gray are seen but not yet explored, and black nodes are the ones that are finished. We start by setting the root of the search to gray, and then keep producing a new color vector until all of the gray nodes have disappeared. In psuedocode, the algorithm works like this:

Initialize color vector
While there are gray nodes
    For each vertex v
        If v is white then
            If any neighbors are gray then
                Turn gray
                Stay white
        Else if v is gray then
            Turn black
        Else if v is black then
            Stay black

Since none of the vertices depend on each other, we can easily run the per-vertex loop in parallel. Unfortunately, doing this led to a 50-100x slowdown. Clearly this is not what we wanted! To try to figure out what went wrong, I once again turned to Instruments, and saw this:

Parallel Graph500 profile showing an inordinate amount of time in a spinlock.

Here we see that we’re spending an absurd amount of time in a spinlock. Drilling down a little further shows that this seems to have to do with allocating and freeing in Rust’s exchange heap. This is likely due to the way Rust restricts sharing between tasks. Each of the parallel blocks needs to access the graph edge list, which Rust must copy into each task. Since this is a fairly large vector of vectors, and vectors are allocated in the exchange heap, this leads to a lot of contention.

So, what next? One approach would be to rewrite the parallel BFS algorithm to avoid these pitfalls. Using a more traditional distributed memory approach, with longer running tasks that are each responsible for a portion of the color vector, could avoid some of these problems. Ideally we could use the code as written and realize that since the edge lists are immutable, Rust can simply share them between all tasks. Niko has some ideas along these lines.

Besides these parallel performance opportunities, there are a few other places where Rust could improve. I had to write my own queue because the standard library’s deque is unusable due to a compiler bug. This bug should be fixed soon though. This bug also meant I had to include my new parallel module in my benchmark code rather than adding it to the standard library. Second, even in the more optimized versions of the validation code, we end up spending a lot of time in compare glue. The reason is that the edge list is a vector of tuples of two uints, and we spend most of our time testing for membership. For simple scalar types, these comparisons are very fast. For tuples, however, Rust falls back on a slower shape interpreter. Ideally, for simple tuples, Rust or LLVM would generate optimized comparison code.

Overall, this has been a fun experience, and has given a lot of opportunities to improve Rust!

The code for this benchmark is currently available on a branch in my Rust fork.