Posts Tagged ‘parallel’

Lately, the focus of my research has been on a new programming language called Harlan (that link is for IU students only, sorry), which is a high level language for GPU programming. One important task in this project has been forming a reasonable mental model of how GPUs actually work. I’ve recently come to the conclusion that the model exposed as part of CUDA and OpenCL make it almost impossible to form a clear picture of what is actually happening in the hardware.

The CUDA documentation gives the impression that an NVIDIA GPU is a mystical processor that is capable of running thousands of threads at once. This leads to a unique programming model. Suppose you want to add two vectors of 10,000 elements together. All you have to do on a GPU is spawn 10,000 threads, and each thread adds one element of the vector. If we wanted each thread to run the function add_vector, we could simply do this:

int block_size = ???;
int num_blocks = ???;
add_vector<<<num_blocks, block_size>>>(n, x, y, z);

This code adds vectors x and y, each of length n, and stores the result in z. Of course, we have already run into some complications. What should block_size and num_blocks be? CUDA partitions all of your threads into a grid of blocks, and each block has a certain number of threads. You can have basically as many blocks as you want, but for some reason the block size (or number of threads per block) cannot be more than 1024.

What quickly becomes clear is that these so-called thousands of cores that your GPU has are not the same as cores on a CPU. For example, we hear about how at least some of these cores execute in lock step, meaning they must execute the exact same instructions at the same time. Not all threads do though, because you can synchronize threads within a block using the __syncthreads() function. Besides grouping threads into blocks, some kernels also make use of the fact that threads are further subdivided into warps of up to 32 threads. The question is, how do these concepts map onto hardware?

A look at the Fermi Architecture Whitepaper shows that NVIDIA’s Fermi processors are made up of some number of Streaming Multiprocessors (SMs), which each have 32 CUDA cores. The Wikipedia page shows that different boards within the Fermi series have a different number of SMs. One of the new features of the Fermi architecture is the GigaThread™ Thread Scheduler, which apparently provides 10x faster context switching. At SC11, I heard one NVIDIA employee claim that context switching was free.

GPUs are Vector Processors

To me, rather than thinking of GPUs in terms of grids and blocks and warps, it’s best to think of them as vector processors. Vector processors are CPUs that are designed around Single Instruction, Multiple Data (SIMD) instructions. Typical desktop CPUs contain SIMD extensions, such as Intel’s AVX instructions, which allow them to perform some vector operations efficiently, but their focus is still on low latency execution of scalar code. By contrast, vector processors expect most of the computation to be expressed in terms of vector operations, and are optimized to perform these operations as quickly as possible, perhaps even at the expense of scalar performance.

Under this model, each SM on an NIVIDA GPU corresponds to a more traditional CPU core. These SMs would contain some number of 32-wide vector registers. It seems that CUDA exposes operations on vector registers as a warp. They appear to be 32 threads because each instruction on 32 lanes at once, while the threads must proceed in lock step because they are actually a single stream of instructions.

Now, how do CUDA blocks fit with this view? These blocks seem to correspond to a set of warps executing on a single SM. Although an SM is a single core, it can run multiple threads through simultaneous multithreading, or HyperThreading as Intel calls it. Under HyperThreading, two threads can be assigned to a single processor core. The processor then multiplexes resources between the two threads. For example, if one thread is blocked on a memory operation, the CPU can execute instructions from the other thread while the first one waits on memory. Switching between these threads is basically free; it’s just a matter of assigning ready work to available hardware resources. In terms of CUDA blocks, if we divide the maximum number of threads per block (1024) by the number of threads per warp (32), we end up with 32. This suggests that each SM is able to keep around 32 thread (or warp) contexts, and swap between them easily as execution units and data become available.

In summary, we can think of a Fermi GPU as a multicore processor, where each core does 32-way HyperThreading and supports 32-wide vector instructions.

In order to really verify that this is the case, it would be helpful to see the actual Fermi instruction set. NVIDIA is very secretive about this, instead only publishing a virtual instruction set, PTX. This is understandable, as it means NVIDIA does not have to maintain backwards compatibility between GPU revisions. However, AMD does provide documentation for the actual instruction set for their GPUs. After briefly perusing their latest documentation, it seems that their instruction set is compatible with the idea of GPUs as vector processors.

The benchmarks in my last post had one thing in common: all communication was one sender to one receiver. It’s surprising how often this is sufficient, but sooner or later we are going to need a way to have multiple tasks sending to the same receiver. I’ve been experimenting with two ways of doing many senders to different receivers, and I now have some results to show.

The pipes library includes a select operation. This lets you listen on several receive endpoints simultaneously. Unfortunately, the single-use nature of endpoints makes select a little clunky to use. To help alleviate this, I added a port_set to the library. Port sets allow you to easily treat several receive endpoints as a unit. This allows send to still be very fast, but receive is a little bit slower due to the overhead setting up and tearing down the select operation. The current implementation for select is O(n) in the number of endpoints, so this works well for small numbers of tasks, but breaks down as things get bigger.

The other option is to slow down the sending end, using something I call a shared_chan. This is a send endpoint wrapped in an exclusive ARC. Now all the senders have to contend with each other, but the receive side is exactly as cheap as before. For cases where you have a lot of senders that send messages relatively infrequently, this will likely outperform the port_set approach, at least until select is faster.

Both of these are sufficient to run the msgsend benchmark that I talked about at the beginning of all of this. Here are the results, combined with the previous numbers.

Language Messages per second Comparison
Rust port_set 881,578 232.8%
Scala 378,740 100.0%
Rust port/chan (updated) 227,020 59.9%
Rust shared_chan 173,436 45.8%
Erlang (Bare) 78,670 20.8%
Erlang (OTP) 76,405 20.2%

The most obvious thing is that the port_set version is over twice as fast as Scala, the previous winner. I also re-ran the port/chan version for comparison, and it got a little bit faster. There has been quite a bit of churn in Rust recently, so it’s quite possible that these showed up here as better performance.

Writing the port_set version proved the most interesting to me. Relying on select ended up relaxing some of the ordering guarantees. Previously if we had Task A send a message to Task C and then send a message to Task B, and then have Task B wait to receive message to from Task A and then send a message to Task C, we could count on Task C seeing Task A’s message before seeing Task B’s message. With the port_set, this is no longer true, although we still preserve the order in messages sent by a single task. An easy way to work around this, however, was to rely on pipe’s closure reporting ability. The server could tell when a worker would no longer send any more messages because it would detect when the worker closed its end of the pipe.

I hinted in my last post that pipes in Rust have very good performance. This falls out of the fact that the protocol specifications provide very strong static guarantees about what sorts of things can happen at runtime. This allows, among other things, for message send/receive fastpath that requires only two atomic swaps.

Let’s start with the message ring benchmark. I posted results from this earlier. This benchmark spins up a bunch of tasks that arrange themselves in a while. Each task sends a message to their right-hand neighbor, and receives a message from the left-hand neighbor. This repeats for a while. At the end, we look at the total time taken divided by the number of messages. This gives us roughly the fastest we can send and receive a message, modulo some task spawning overhead. The existing port/chan system was able to send about 250,000 messages per second, or one message every 3.9 µs. Here are the results for pipes:

Sent 1000000 messages in 0.227634 seconds
  4.39301e+06 messages / second
  0.227634 µs / message

This is about 17x faster!

It would be a bit dishonest to stop here, however. I wrote this benchmark specifically to make any new implementation really shine. The question is whether faster message passing makes a difference on bigger programs.

To test this, I started by updating the Graph500 Parallel Breadth First Search benchmark. This code gets its parallelism from std::par::map, which in turn is built on core::future. Future has a very simple parallel protocol; it just spawns a task to compute something, which then sends a single message back to the spawner. Porting this was a relatively small change, yet it got measurable speedups. Here are the results.

Benchmark Port/chan time (s) Pipe time (s) Improvement (%)
Graph500 PBFS 0.914772 0.777784 17.6%

The Rust benchmark suite also includes several benchmarks from the Computer Language Benchmarks Game (i.e. the Programming Language Shootout). Some of these, such as k-nucleotide, use Rust’s parallelism features. I went ahead and ported this benchmark over to use pipes, and there are the results.

Benchmark Port/chan time (s) Pipe time (s) Improvement (%)
Shootout K-Nucleotide 4.335 3.125 38.7%

Not too shabby. I’ve been working on porting other benchmarks as well. Some are more difficult because they do not fit the 1:1 nature of pipes very well. In the case of the shootout-threadring benchmark, it actually got significantly slower when I moved to pipes. The thread ring benchmark seems to mostly be measuring the time to switch between tasks, as only one should be runnable at any given time. My hypothesis is that because message passing got faster, this test now hammers the scheduler synchronization code harder, leading to more slowdown due to contention. We’ll need more testing to know for sure. At any rate, scheduler improvements (such as work stealing, which Ben Blum will be working on) should improve this benchmark as well.

Other than that, I’ve been working on rewriting more Rust code to see how it works with pipes versus ports and chans. It has been particularly informative to try to transition parts of Servo over to using pipes.

About a month ago, I posted that I was going to be working on improving Rust’s message passing performance. I quickly threw together a prototype of a new communication system based on a shared queue protected by a mutex. This was about twice as fast as the existing system, because it removed the global mutex from the messaging paths. This prototype hurt expressiveness somewhat, and still it seemed we could do a lot better.

Rust has some extremely powerful features in its type system. The fact that it can deal with concepts like uniqueness, initialization status, copyability, and other traits mean we can encode some very powerful invariants. Thus, I took some inspiration from the Singularity OS and set out to see if I could encode something like channel contracts in Rust. The result is a proposal for a feature I’m calling pipes.

The way pipes work is that when you create a pipe you get two endpoints that are forever entangled together. One endpoint can send one message, and the other endpoint can receive that one message. Sending and receiving destroys the endpoint, but the operation also produces a new endpoint to continue the communication. Endpoints have a state associated with them, which specifies which messages can be sent or received. This information is encoding in the type system, so Rust can statically guarantee that no task will send a message that is not legal in the given state. Pipes are not copyable; they are always for 1:1 communication. However, endpoints can be sent between tasks.

Critical to pipes are the associated protocol specification. Protocols have two views: the client and the server. Protocols are always written from the perspective of the client. This decision was arbitrary, but in general it makes sense to only write down one side of the protocol. The other perspective is generated by reversing the direction of all the messages. Here’s an example of what I’m envisioning for a protocol specification.

proto! bank {
    login:send {
        login(username, password) -> login_response
    }

    login_response:recv {
        ok -> connected,
        invalid -> login
    }

    connected:send {
        deposit(money) -> connected,
        withdrawal(amount) -> withdrawal_response
    }

    withdrawal_response:recv {
        money(money) -> connected,
        insufficient_funds -> connected
    }
}

This describes the protocol you might use in an online banking situation. The protocol has four states (login, login_response, connected and withdrawal_response), each one annotated with whether the sender is allowed to send or receive in that state. In this case, a client would start out in the login state, where the client can attempt to login with a username and password. After sending a login message, the protocol enters the login_response state, where the server informs the client that either the login succeeded (in which case the protocol transitions to the connected state), or the login failed, in which case the protocol returns to the login state and the client can retry.

From the connected state, the client can try to deposit or withdrawal money. We assume that depositing money never fails, so sending a deposit message results in the protocol staying in the connected state. On the other hand, withdrawal can fail, for example, if the account does not have enough money. To model this, sending a withdrawal message results in the protocol going to the withdrawal_response state. Here, the client waits to either receive the requested money, or for a message saying there was not enough money in the account. In both cases, we end up back in the connected state.

Below is a code example showing how a client might use this protocol.

fn bank_client(+bank: bank::client::login) {
    import bank::*;

    let bank = client::login(bank, "theincredibleholk", "1234");
    let bank = alt recv(bank) {
      some(ok(connected)) {
        #move(connected)
      }
      some(invalid(_)) { fail "login unsuccessful" }
      none { fail "bank closed the connection" }
    };

    let bank = client::deposit(bank, 100.00);
    let bank = client::withdrawal(bank, 50.00);
    alt recv(bank) {
      some(money(m, _)) {
        io::println("Yay! I got money!");
      }
      some(insufficient_funds(_)) {
        fail "someone stole my money"
      }
      none {
        fail "bank closed the connection"
      }
    }
}

All of this code in this posts works on the latest Rust compiler as of this morning. I’ve also started transitioning some of our benchmarks to the new pipe system, and the results have been impressive. I’ll have a post diving into the performance of pipes soon.

Previously, I talked about a couple of ways to improve the message passing performance in Rust. The obvious bottleneck was that sending and receiving both involved taking a lock that was shared between all tasks. In order to remove this lock, I set out to write a new port/channel system that relied much less on the runtime library.

In order to do this, we first needed a variant of the atomic reference counter that allows us to share mutable data. We accomplish this by adding a mutex and condition variable. The mutex is your standard pthread mutex, while we’ve implemented our own condition variable that the Rust scheduler is aware of. Because we’re using a standard pthread mutex, using this exclusive access is unsafe and should be used with care; if you’re not careful, you could deadlock Rust. Fortunately, the API for Rust’s low-level locks and condition variables makes it harder to accidentally hold a lock for unbounded amounts of time.

Once we have this, ports and channels simply become a locked atomically reference-counted queue. There’s no more global lock, and things are far simpler because we only need the Rust runtime support for providing mutual exclusion and condition variable signalling. This part didn’t take long to implement, and I was eager to try it out. I spent a little while writing up a new benchmark that would really show the benefit of avoiding the global lock, and when I went to run it, things crashed.

It turns out, I had discovered a bug whereby vector addition could copy things that weren’t copyable. I saw two approaches to fixing this: fixing trans (the Rust to LLVM translation pass), or moving vector addition out of trans and into the library. The idea behind the second option is that if we did this, vector addition would be going through the existing and well-tested function call path, rather than a special vector addition codegen path. This seemed like the best option overall, since the first option felt more like a band aid for the root cause that we have too much functionality that is duplicated in subtly different ways.

Thus, I set out to move vector addition to libcore. This exposed some subtle semantics issues around const vectors, but these were mostly not too painful to work out. My firsts working versions were too slow to finish building the compiler in a reasonable amount of time. I was thinking we’d end up taking a 20% to 2x performance hit by doing this change, but thanks to some heroic optimization help from Niko, we got the performance to the point where in some cases the new code even performs ever so slightly better than the old code. In the course of doing these changes, I also discovered another bug, that led to us leaking memory, and in the course of fixing that, I discovered a way to make Rust segfault. Ah, the life of a compiler writer.

At any rate, we have fixes to all of these bugs in the works, and things are working well enough to run a benchmark. This test creates a ring of tasks, who each send a message to their neighbor on one side and receive from the other side. Here are the numbers for the old messaging system.

Sent 1000000 messages in 3.88114 seconds
  257656 messages / second
  3.88114 μs / message

And here are the numbers for the new system.

Sent 1000000 messages in 1.87881 seconds
  532253 messages / second
  1.87881 μs / message

As you can see, we’re about 1.9x faster than we were before.

This new system doesn’t yet have the exact same features as the old system, and it needs some help with ergonomics. My next task will be to work on adding these missing things and hopefully be able to incrementally replace the old code with the new, faster version.

The code for this post isn’t yet in the main Rust tree, but it should be landing soon as we squash the bugs we found in the process of doing this new message passing system.

In my last post, we saw how to get a parallel speedup on a breadth first search in Rust. One major flaw with this was that we had to use unsafe pointers all over the place to prevent from copying large data structures around. Given that Rust’s slogan is “a safe, concurrent, practical language,” it would be terribly sad to sacrifice safety in order to get concurrency. Fortunately, the tricks we were doing before were mostly safe. A human reviewer could see without too much trouble that the data we were sharing was immutable (so we wouldn’t have to worry about race conditions), and that we were careful to manage task and data lifetimes so we wouldn’t have a use-after-free error. Now the goal is to teach the compiler to verify the safety of these patterns on its own.

There are two safety properties we are concerned with:

  1. Only immutable (i.e. constant) data can be shared.
  2. Shared data must remain live as long as any task might access it.

For point (1), we have solved this by introducing a const kind. For the purposes of Rust, constant types are basically those which do not contain the mut annotation anywhere in them. Getting the details of this just right takes some care, and we will probably have to tweak the rules as time goes on. One current weakness is that only item functions (i.e. those without closures) count as const. Many Rust types already end up being const, which probably falls out of Rust’s immutable by default philosophy.

For (2), we’ve added a new module to the library called arc. This stands for atomic reference counting. This is a somewhat unfortunate name, as it is so similar to the more common automatic reference counting. We’d be happy to entertain suggestions for better names. The idea behind ARC is that it wraps over some piece of constant data. The ARC is non-copyable, but it can be sent (this required another tweak to Rust’s kind system to allow sendable resources). When you want to access the data the ARC is managing, you can ask it for a reference. Importantly, ARC provides a clone operation, which increments the reference count and returns another ARC pointing to the same data. When the ARC object goes out of scope, it decrements the reference count and potentially frees the data it is wrapping. ARCs can be safely shared between tasks because all reference counting is done using atomic operations. Additionally, Rust’s region system ensures that the ARC object outlives any references to the data inside of the ARC.

We were able to implement ARC entirely in the Rust Standard Library, with a little bit of runtime support for atomic operatons. ARC itself required some unsafe code in its implementation, but this fits with the “practical” part of Rust. We can implement limited parts of the program using unsafe primitives, and bless them as safe through manual inspection. Users can then build safe programs from the components in the standard library. Indeed, the actual parallel breadth first code no longer needs the unsafe keyword.

Along the way, Niko and I were also able to solve some annoying compiler bugs that make our standard library much more functional. The BFS code is also in the main Rust tree now, and later today I’m hoping to land a few more changes to the libraries to make building other parallel programs easier.

I’m happy to report that my parallel breadth first search program in Rust gets a 2.8 to 3x speedup now on my quad core MacBook Pro. While not in the ideal 4-8x range, I’m pretty satisfied with a 3x speedup. As you recall from the last post, the majority of the time was still spent doing memory allocation and copies. I tried several things to finally get a speedup, only one of which had much noticeable effect:

  1. Allocate and update the result storage in place
  2. Reuse tasks
  3. Share more with unsafe pointers

Allocate and update in place

Allocation in place was my hypothesis from the last post. Basically, you allocate a big mutable vector for the results ahead of time. Then, each of the worker tasks write their results directly into the finally location, rather than incurring a bunch of memory copy costs to aggregate the results as a second step. Part of the reason why this seems like a good idea comes from plot below of CPU usage over time.

Profile showing periodic CPU spikes

The benchmark searches for 64 randomly selected keys. This graph shows about one and a half of the searches. Each search consists of a parallel step to update all the vertex colors (from white to gray to black), and then a synchronization step to aggregation the results together. This repeats until no more nodes are gray. Traversing the whole graph should take about the diameter of the graph, which is usually about 4. This matches what we see on the graph. There are a bunch of spikes all the cores are busy, followed by some sequential synchronization time. Each search actually consists of 8 spikes because there is a parallel step to test if we’re done and a parallel step to do the actual work.

The idea was that the sequential time between each spike came from appending all the intermediate result vectors together. By allocating in place, we should have been able to avoid this cost. Unfortunately, I did not see a noticeable performance improvement by doing this.

Reusing tasks

My next idea was that maybe we were spending too much time in task creation and destruction. To test this, I wrote a task pool, which would use the same tasks to process multiple work items. By design, Rust’s task primitives are very cheap, so this shouldn’t be necessary in practice. Once again, this transformation did not noticeably affect performance. This is actually good news, because it means Rust delivers on its goal of making spawning tasks cheap enough that developers shouldn’t worry about spawning too many.

Share more with unsafe pointers

After two failed attempts, Patrick Walton and I took a more in depth look at the profiles. We discovered that compiling with the -g flag makes Instruments able to show us a lot more useful information about where we are spending our time. Patrick’s diagnosis was that we were spending too much time allocating and copying, and he advised I try to track down the expensive copies. The profiles showed much of the time was spent in the beginning and end of the parallel block, meaning the problems were probably do to copying and freeing things in the closure. There are two fairly large structures that this code used: the edge list and the color vector. I replaced these with unsafe pointers to ensure they were actually shared rather than copied. Finally, I saw a parallel speedup! The edge list was a couple hundred megabytes, so this explains why copying it was rather expensive.

Future directions

My experience again shows several ways we can improve Rust. The most obvious is that we need a way to share data between tasks; copying is just too expensive. For the most part, this isn’t a problem since we would only be sharing read-only data. The trickiness comes in when we have to decide when to reclaim the shared data. One way of doing this is to use the patient parent pattern. A parent can spawn several parallel tasks which each get read-only access to all of the parent’s data. The parent suspends execution until all of the children complete, at which point the parent would free the data through the usual means. Here the children do not worry about freeing these structures at all.

Another approach is to use atomic reference counting for data that is shared between tasks. This gives more flexibility, because the lifetimes of tasks are not strictly bounded by their parents. It does mean we pay the cost of atomic reference counting, but this will hopefully not be bad since we only have to update reference counts when data structures cross task boundaries.

Finally, there are some smaller changes that would be nice. I spent a while worrying about time spent appending vectors. It might be worth adding a tree-vector to the standard library, which gives logarithmic time element lookup but constant time appending. Since the root cause of my performance problems was a large implicit copy, it’d be nice to have the compiler warn on potentially large copies, or maybe even make it an error where the programmer must explicitly copy something if this is really desired. As another optimization, it might be worth keeping tasks around once they die and reusing them to hopefully reduce the overhead of creating tasks even further. I have a page about this in the Rust wiki, but we’ll need more evidence that this is a good idea before going for it.