I’ve decided to move my blog to a new home:
http://blog.theincredibleholk.org/
I’ll keep the existing posts in their current location, but all my new posts will be at the new location. Hope to see you there!
I’ve decided to move my blog to a new home:
http://blog.theincredibleholk.org/
I’ll keep the existing posts in their current location, but all my new posts will be at the new location. Hope to see you there!
Lately, the focus of my research has been on a new programming language called Harlan (that link is for IU students only, sorry), which is a high level language for GPU programming. One important task in this project has been forming a reasonable mental model of how GPUs actually work. I’ve recently come to the conclusion that the model exposed as part of CUDA and OpenCL make it almost impossible to form a clear picture of what is actually happening in the hardware.
The CUDA documentation gives the impression that an NVIDIA GPU is a mystical processor that is capable of running thousands of threads at once. This leads to a unique programming model. Suppose you want to add two vectors of 10,000 elements together. All you have to do on a GPU is spawn 10,000 threads, and each thread adds one element of the vector. If we wanted each thread to run the function add_vector
, we could simply do this:
int block_size = ???;
int num_blocks = ???;
add_vector<<<num_blocks, block_size>>>(n, x, y, z);
This code adds vectors x
and y
, each of length n
, and stores the result in z
. Of course, we have already run into some complications. What should block_size
and num_blocks
be? CUDA partitions all of your threads into a grid of blocks, and each block has a certain number of threads. You can have basically as many blocks as you want, but for some reason the block size (or number of threads per block) cannot be more than 1024.
What quickly becomes clear is that these so-called thousands of cores that your GPU has are not the same as cores on a CPU. For example, we hear about how at least some of these cores execute in lock step, meaning they must execute the exact same instructions at the same time. Not all threads do though, because you can synchronize threads within a block using the __syncthreads()
function. Besides grouping threads into blocks, some kernels also make use of the fact that threads are further subdivided into warps of up to 32 threads. The question is, how do these concepts map onto hardware?
A look at the Fermi Architecture Whitepaper shows that NVIDIA’s Fermi processors are made up of some number of Streaming Multiprocessors (SMs), which each have 32 CUDA cores. The Wikipedia page shows that different boards within the Fermi series have a different number of SMs. One of the new features of the Fermi architecture is the GigaThread™ Thread Scheduler, which apparently provides 10x faster context switching. At SC11, I heard one NVIDIA employee claim that context switching was free.
To me, rather than thinking of GPUs in terms of grids and blocks and warps, it’s best to think of them as vector processors. Vector processors are CPUs that are designed around Single Instruction, Multiple Data (SIMD) instructions. Typical desktop CPUs contain SIMD extensions, such as Intel’s AVX instructions, which allow them to perform some vector operations efficiently, but their focus is still on low latency execution of scalar code. By contrast, vector processors expect most of the computation to be expressed in terms of vector operations, and are optimized to perform these operations as quickly as possible, perhaps even at the expense of scalar performance.
Under this model, each SM on an NIVIDA GPU corresponds to a more traditional CPU core. These SMs would contain some number of 32-wide vector registers. It seems that CUDA exposes operations on vector registers as a warp. They appear to be 32 threads because each instruction on 32 lanes at once, while the threads must proceed in lock step because they are actually a single stream of instructions.
Now, how do CUDA blocks fit with this view? These blocks seem to correspond to a set of warps executing on a single SM. Although an SM is a single core, it can run multiple threads through simultaneous multithreading, or HyperThreading as Intel calls it. Under HyperThreading, two threads can be assigned to a single processor core. The processor then multiplexes resources between the two threads. For example, if one thread is blocked on a memory operation, the CPU can execute instructions from the other thread while the first one waits on memory. Switching between these threads is basically free; it’s just a matter of assigning ready work to available hardware resources. In terms of CUDA blocks, if we divide the maximum number of threads per block (1024) by the number of threads per warp (32), we end up with 32. This suggests that each SM is able to keep around 32 thread (or warp) contexts, and swap between them easily as execution units and data become available.
In summary, we can think of a Fermi GPU as a multicore processor, where each core does 32-way HyperThreading and supports 32-wide vector instructions.
In order to really verify that this is the case, it would be helpful to see the actual Fermi instruction set. NVIDIA is very secretive about this, instead only publishing a virtual instruction set, PTX. This is understandable, as it means NVIDIA does not have to maintain backwards compatibility between GPU revisions. However, AMD does provide documentation for the actual instruction set for their GPUs. After briefly perusing their latest documentation, it seems that their instruction set is compatible with the idea of GPUs as vector processors.
Apparently someone recently guessed my super-ingenious password and starting telling the world that my opinions were worth money. Thanks for Erick Tryzelaar for pointing this out to me. I’ve updated my password and hopefully I’ll be safe for a while.
For the last week or so, I’ve been mostly fighting with the performance of the Rust compiler. The rustc performance graph from the RustBot page tells a lot of the story:
The wildly varying performance is largely due to a change I made to move the code for addition on vectors out of the compiler and into the core library. The compiler used to generate very optimized vector code. For example, when doing v += [x]
(that is, appending a single element to a vector in place), the compiler would avoid allocating a temporary vector for the [x]
. In effect, the compiler generated what amounts to an inline call to vec::push
.
My work involved adding an overloaded addition operator to Rust’s core library, and changing the compiler to use that version instead of its generated version. The landing of this change corresponds to about build number 70 or so on the graph, where there is a small but significant jump in the time rustc took. Due to the way we handle overloaded operators, the compiler must insert many more vector copies than it did before. It seemed too good to be true that I would only lose a small amount of performance.
And it was too good to be true. It turns out through a bunch of rebasing, I had managed to undo a small switch in is_binopable
that controlled whether my new code ever got called. In the meantime, Sully has been doing a lot of work on the way Rust handles vectors, and he landed a change around build 90 in the performance graph above that ended up causing Rust to use the library version of vector append instead. This was more like the performance decrease I was expecting.
In order to fix this, I had to track down all the places in the compiler where we were using the addition operator on vectors, and replace them with things like vec::push
, vec::push_all
, vec::append
and vec::append_one
. This was tedious work, and I did it in stages. Each spot where the execution time improves in the graph is where I landed a couple more of these changes. Once this was finally done, we can see that rustc’s performance was even slightly better than it was before. I believe this is because the library versions of these functions are able to remove some copies that rustc could not remove before. Sadly, the code is much uglier than it was before. We are discussing the ability to specify modes on the self argument, which will allow the compiler to again avoid unnecessary copies when using the plus operator. This should allow us to make using the plus operator just as fast as the underlying library functions.
The moral of the story is that copies can kill performance. Rust warns often when it is inserting copies you may not have expected. Pay attention to these warnings and you will write much faster code.