Archive for May, 2012

In my last post, we saw how to get a parallel speedup on a breadth first search in Rust. One major flaw with this was that we had to use unsafe pointers all over the place to prevent from copying large data structures around. Given that Rust’s slogan is “a safe, concurrent, practical language,” it would be terribly sad to sacrifice safety in order to get concurrency. Fortunately, the tricks we were doing before were mostly safe. A human reviewer could see without too much trouble that the data we were sharing was immutable (so we wouldn’t have to worry about race conditions), and that we were careful to manage task and data lifetimes so we wouldn’t have a use-after-free error. Now the goal is to teach the compiler to verify the safety of these patterns on its own.

There are two safety properties we are concerned with:

  1. Only immutable (i.e. constant) data can be shared.
  2. Shared data must remain live as long as any task might access it.

For point (1), we have solved this by introducing a const kind. For the purposes of Rust, constant types are basically those which do not contain the mut annotation anywhere in them. Getting the details of this just right takes some care, and we will probably have to tweak the rules as time goes on. One current weakness is that only item functions (i.e. those without closures) count as const. Many Rust types already end up being const, which probably falls out of Rust’s immutable by default philosophy.

For (2), we’ve added a new module to the library called arc. This stands for atomic reference counting. This is a somewhat unfortunate name, as it is so similar to the more common automatic reference counting. We’d be happy to entertain suggestions for better names. The idea behind ARC is that it wraps over some piece of constant data. The ARC is non-copyable, but it can be sent (this required another tweak to Rust’s kind system to allow sendable resources). When you want to access the data the ARC is managing, you can ask it for a reference. Importantly, ARC provides a clone operation, which increments the reference count and returns another ARC pointing to the same data. When the ARC object goes out of scope, it decrements the reference count and potentially frees the data it is wrapping. ARCs can be safely shared between tasks because all reference counting is done using atomic operations. Additionally, Rust’s region system ensures that the ARC object outlives any references to the data inside of the ARC.

We were able to implement ARC entirely in the Rust Standard Library, with a little bit of runtime support for atomic operatons. ARC itself required some unsafe code in its implementation, but this fits with the “practical” part of Rust. We can implement limited parts of the program using unsafe primitives, and bless them as safe through manual inspection. Users can then build safe programs from the components in the standard library. Indeed, the actual parallel breadth first code no longer needs the unsafe keyword.

Along the way, Niko and I were also able to solve some annoying compiler bugs that make our standard library much more functional. The BFS code is also in the main Rust tree now, and later today I’m hoping to land a few more changes to the libraries to make building other parallel programs easier.

I’m happy to report that my parallel breadth first search program in Rust gets a 2.8 to 3x speedup now on my quad core MacBook Pro. While not in the ideal 4-8x range, I’m pretty satisfied with a 3x speedup. As you recall from the last post, the majority of the time was still spent doing memory allocation and copies. I tried several things to finally get a speedup, only one of which had much noticeable effect:

  1. Allocate and update the result storage in place
  2. Reuse tasks
  3. Share more with unsafe pointers

Allocate and update in place

Allocation in place was my hypothesis from the last post. Basically, you allocate a big mutable vector for the results ahead of time. Then, each of the worker tasks write their results directly into the finally location, rather than incurring a bunch of memory copy costs to aggregate the results as a second step. Part of the reason why this seems like a good idea comes from plot below of CPU usage over time.

Profile showing periodic CPU spikes

The benchmark searches for 64 randomly selected keys. This graph shows about one and a half of the searches. Each search consists of a parallel step to update all the vertex colors (from white to gray to black), and then a synchronization step to aggregation the results together. This repeats until no more nodes are gray. Traversing the whole graph should take about the diameter of the graph, which is usually about 4. This matches what we see on the graph. There are a bunch of spikes all the cores are busy, followed by some sequential synchronization time. Each search actually consists of 8 spikes because there is a parallel step to test if we’re done and a parallel step to do the actual work.

The idea was that the sequential time between each spike came from appending all the intermediate result vectors together. By allocating in place, we should have been able to avoid this cost. Unfortunately, I did not see a noticeable performance improvement by doing this.

Reusing tasks

My next idea was that maybe we were spending too much time in task creation and destruction. To test this, I wrote a task pool, which would use the same tasks to process multiple work items. By design, Rust’s task primitives are very cheap, so this shouldn’t be necessary in practice. Once again, this transformation did not noticeably affect performance. This is actually good news, because it means Rust delivers on its goal of making spawning tasks cheap enough that developers shouldn’t worry about spawning too many.

Share more with unsafe pointers

After two failed attempts, Patrick Walton and I took a more in depth look at the profiles. We discovered that compiling with the -g flag makes Instruments able to show us a lot more useful information about where we are spending our time. Patrick’s diagnosis was that we were spending too much time allocating and copying, and he advised I try to track down the expensive copies. The profiles showed much of the time was spent in the beginning and end of the parallel block, meaning the problems were probably do to copying and freeing things in the closure. There are two fairly large structures that this code used: the edge list and the color vector. I replaced these with unsafe pointers to ensure they were actually shared rather than copied. Finally, I saw a parallel speedup! The edge list was a couple hundred megabytes, so this explains why copying it was rather expensive.

Future directions

My experience again shows several ways we can improve Rust. The most obvious is that we need a way to share data between tasks; copying is just too expensive. For the most part, this isn’t a problem since we would only be sharing read-only data. The trickiness comes in when we have to decide when to reclaim the shared data. One way of doing this is to use the patient parent pattern. A parent can spawn several parallel tasks which each get read-only access to all of the parent’s data. The parent suspends execution until all of the children complete, at which point the parent would free the data through the usual means. Here the children do not worry about freeing these structures at all.

Another approach is to use atomic reference counting for data that is shared between tasks. This gives more flexibility, because the lifetimes of tasks are not strictly bounded by their parents. It does mean we pay the cost of atomic reference counting, but this will hopefully not be bad since we only have to update reference counts when data structures cross task boundaries.

Finally, there are some smaller changes that would be nice. I spent a while worrying about time spent appending vectors. It might be worth adding a tree-vector to the standard library, which gives logarithmic time element lookup but constant time appending. Since the root cause of my performance problems was a large implicit copy, it’d be nice to have the compiler warn on potentially large copies, or maybe even make it an error where the programmer must explicitly copy something if this is really desired. As another optimization, it might be worth keeping tasks around once they die and reusing them to hopefully reduce the overhead of creating tasks even further. I have a page about this in the Rust wiki, but we’ll need more evidence that this is a good idea before going for it.

I’ve continued tuning the Graph500 code I talked about yesterday. I still haven’t managed to achieve a speedup over the sequential version, but I’m not running about an order of magnitude faster than I was yesterday.

If you recall, the last profile from yesterday showed that we were spending an enormous amount of time waiting on a spinlock related to malloc and free. The problem was basically that the closure we were passing to par::mapi included the adjacency lists for the whole graph. In order to guarantee safety, Rust was copying the adjacency lists into each task. Since these are immutable, however, we should have been able to share them in place.

It turns out Rust has a way to let you do this by dropping into the unsafe portions of the language. Now, instead of duplicating the closure, we simply pass an unsafe pointer to the closure to the spawned tasks. These dereference it in place, and we avoid the copy. Now, instead of having a 50-100x slowdown, we’re only at a 4-5x slowdown. The profile looks like this:

Graph500 profile showing very little time in the malloc spinlocks.

The troublesome spinlock from before only accounts for 3.7% of our time. The 4th function, which has 5.3% of the time, is the main worker for the breadth first search. Ideally, this would account for the majority of the time. The top two functions, however, appear to do with allocating and zeroing memory.

I also tried using unsafe pointers to the input vectors, but this did not make a material difference in the performance. I suspect the bzero and memmove time mostly comes from aggregating the results from each of the parallel tasks together. The next obvious optimizations are to try to pre-allocate the result space and update these in place.

One of the main applications for Rust is the Servo Parallel Browser Project. This means Rust must fundamentally be a parallel language. Indeed, the language already has Erlang-inspired lightweight tasks, and communication using ports and channels. Unfortunately, most of the code that uses these features are tiny microbenchmarks. While these are useful for making sure the low-level mechanics are efficient, they do not provide as much insight into how Rust feels and performs overall as a parallel language. To this end, I decided to begin my summer internship by implementing most of the Graph500 Breadth First Search Benchmark.

I started out by writing a purely sequential version of the benchmark that handled generating the edge list, building a graph structure from the edge list, generating a BFS tree for several randomly chosen keys, and finally doing some validation to give us some idea that the algorithm might be correct. For my graph data structure, I used an adjacency list; each vertex keeps a list of all the vertices it is connected to. I used the basic search algorithm, which you can find on the Wikipedia page for breadth first search.

The initial results seemed fast to me (although I’m not a graph performance expert, so I don’t know how my results compare to other implementations). Once I got to validation, however, things got slow. The algorithm I used for Step 5 of the validation (ensuring that edges in the BFS tree exist in the original graph) is O(N * E), where N is the number of vertices and E is the number of edges in the graph. While the right thing to do would have been to find a more efficient algorithm, I instead decided to throw some parallelism at the validation code instead.

The outer loop of the code in question is just mapping a function over a vector, so I wrote a parallel map. The simplest way is to simply spawn a task for each element in the input vector, but this could lead to the task scheduling overhead destroying any speedup you might get from parallelism. Instead, my parallel map function is controlled by a maximum number of tasks to spawn, and a minimum number of items to process per task. This leads towards coarser granularity for parallelism, which seems like a better fit for multiprocessors.

Sadly, once all this was done, I only had about a 1.4x speedup on my quad-core hyperthreaded machine. This was basically an embarrassingly parallel problem, so I expected a speedup in the 4-8x range. Patrick Walton suggested I look at this using Instruments.app, and here’s what we found:

Graph500 Benchmark Profile in Instruments.app

Around 40% of our time was spent in stack-growth related functions. The first step, then, was to use a larger minimum stack size. Once I ran the test using a 2MB stack size, the stack growth function disappeared from the profile and I was able to get about a 3.4x speedup over using the sequential map. This seemed reasonable to me, so I moved on to parallelizing the actual search algorithm.

Again, not being a graph processing expert, I wasn’t sure what algorithm to use. Most of my web searching led to more complicated algorithms that are highly tuned for certain architectures. I was looking for something simple. Eventually, I found this page which helped clarify things and inspired the approach I ended up taking. My approach involved keeping a color vector, where each vertex is colored either white, gray or black. White nodes are those that we haven’t seen before, gray are seen but not yet explored, and black nodes are the ones that are finished. We start by setting the root of the search to gray, and then keep producing a new color vector until all of the gray nodes have disappeared. In psuedocode, the algorithm works like this:

Initialize color vector
While there are gray nodes
    For each vertex v
        If v is white then
            If any neighbors are gray then
                Turn gray
            Else
                Stay white
        Else if v is gray then
            Turn black
        Else if v is black then
            Stay black

Since none of the vertices depend on each other, we can easily run the per-vertex loop in parallel. Unfortunately, doing this led to a 50-100x slowdown. Clearly this is not what we wanted! To try to figure out what went wrong, I once again turned to Instruments, and saw this:

Parallel Graph500 profile showing an inordinate amount of time in a spinlock.

Here we see that we’re spending an absurd amount of time in a spinlock. Drilling down a little further shows that this seems to have to do with allocating and freeing in Rust’s exchange heap. This is likely due to the way Rust restricts sharing between tasks. Each of the parallel blocks needs to access the graph edge list, which Rust must copy into each task. Since this is a fairly large vector of vectors, and vectors are allocated in the exchange heap, this leads to a lot of contention.

So, what next? One approach would be to rewrite the parallel BFS algorithm to avoid these pitfalls. Using a more traditional distributed memory approach, with longer running tasks that are each responsible for a portion of the color vector, could avoid some of these problems. Ideally we could use the code as written and realize that since the edge lists are immutable, Rust can simply share them between all tasks. Niko has some ideas along these lines.

Besides these parallel performance opportunities, there are a few other places where Rust could improve. I had to write my own queue because the standard library’s deque is unusable due to a compiler bug. This bug should be fixed soon though. This bug also meant I had to include my new parallel module in my benchmark code rather than adding it to the standard library. Second, even in the more optimized versions of the validation code, we end up spending a lot of time in compare glue. The reason is that the edge list is a vector of tuples of two uints, and we spend most of our time testing for membership. For simple scalar types, these comparisons are very fast. For tuples, however, Rust falls back on a slower shape interpreter. Ideally, for simple tuples, Rust or LLVM would generate optimized comparison code.

Overall, this has been a fun experience, and has given a lot of opportunities to improve Rust!

The code for this benchmark is currently available on a branch in my Rust fork.

Today I started my second internship at Mozilla. I’m pretty excited to be back. Pretty much the whole Rust team from last year is back, and we have a few new full time employees working on Rust this year too. It’s been pretty exciting to see how the language has changed over the last year too. I haven’t done much yet, other than change some comments and find a bug, but it should be a pretty exciting summer. Hopefully I can make posting here a more regular ritual as well.

Oh yeah, the friendly Mozilla dinosaur has a sparkly saddle now!

Over the last semester, Will Byrd, myself, and others have been doing various Atari hacks. These have included things like building a ROM dumper, writing custom ROMs, breadboarding an Atari, and the hack I’m going to talk about here: using the Wii Nunchuk as an Atari controller.

The Atari has two types of controllers. One is a joystick, and the other is a paddle. Both have extremely simple interfaces. The joystick is basically five switches. There is one switch for each direction, and another switch for the button. The paddle also has a button, but also an analog twisty control. The twist control is actually just a 1 MΩ potentiometer. The Atari controller port is a standard DB-9 connector, like is used for PC serial ports. The joystick uses a total of 7 wires (five for the controls, one for ground, and one for +5 volts), so only one joystick can connect to the DB-9 connector. On the other hand, the paddle only needs one wire for the button, one for ground, one for +5 volts, and one for the potentiometer output. Because of the fewer signals, the Atari designers allowed for two paddles per port, which allows for up to four players in a single game. We found the pinout from pinouts.ru very helpful in figuring all this out.

Emulating buttons (and the joystick, which is really just four buttons) with the Arduino is fairly simple. In order to press the button, we need to do the equivalent of shorting the joystick wire to ground. In the Arduino, this is the same as doing a digitalWrite of LOW on the corresponding pin.

At first glance, doing the knob on the paddle shouldn’t be too tricky either. After all, it’s just an analog control, so we can use the analogWrite command, right? Sadly, this is not the case. The Arduino’s PWM pins are still digital; they just turn on and off very quickly, and the percentage of the time that the pins are high is how we represent analog values. The Atari paddle uses a potentiometer to vary the voltage instead. The Arduino has no way to do this natively, so we must use something called a digital potentiometer. Digital potentiometers are like regular potentiometers, except you can control them programmatically. We selected an Analog Devices AD5242 digital potentiometer, which is controlled using I2C. I found deciphering the protocol from the datasheet a little tricky, but I’ve posted the code on my Github.

To interface with the Wii Nunchuk, we used the WiiChuck adapter, which also includes sample code. The Nunchuk also communicates using I2C, so we can wire the WiiChuck adapter and the AD5242 on the same I2C bus. This took a slight modification to the WiiChuk code to make it work with other devices at the same time.

After some fiddling around with all of this, we finally got it working. Here’s a video: