Are GPUs just vector processors?

Posted: October 26, 2012 in Research
Tags: , , , , , , , , , ,

Lately, the focus of my research has been on a new programming language called Harlan (that link is for IU students only, sorry), which is a high level language for GPU programming. One important task in this project has been forming a reasonable mental model of how GPUs actually work. I’ve recently come to the conclusion that the model exposed as part of CUDA and OpenCL make it almost impossible to form a clear picture of what is actually happening in the hardware.

The CUDA documentation gives the impression that an NVIDIA GPU is a mystical processor that is capable of running thousands of threads at once. This leads to a unique programming model. Suppose you want to add two vectors of 10,000 elements together. All you have to do on a GPU is spawn 10,000 threads, and each thread adds one element of the vector. If we wanted each thread to run the function add_vector, we could simply do this:

int block_size = ???;
int num_blocks = ???;
add_vector<<<num_blocks, block_size>>>(n, x, y, z);

This code adds vectors x and y, each of length n, and stores the result in z. Of course, we have already run into some complications. What should block_size and num_blocks be? CUDA partitions all of your threads into a grid of blocks, and each block has a certain number of threads. You can have basically as many blocks as you want, but for some reason the block size (or number of threads per block) cannot be more than 1024.

What quickly becomes clear is that these so-called thousands of cores that your GPU has are not the same as cores on a CPU. For example, we hear about how at least some of these cores execute in lock step, meaning they must execute the exact same instructions at the same time. Not all threads do though, because you can synchronize threads within a block using the __syncthreads() function. Besides grouping threads into blocks, some kernels also make use of the fact that threads are further subdivided into warps of up to 32 threads. The question is, how do these concepts map onto hardware?

A look at the Fermi Architecture Whitepaper shows that NVIDIA’s Fermi processors are made up of some number of Streaming Multiprocessors (SMs), which each have 32 CUDA cores. The Wikipedia page shows that different boards within the Fermi series have a different number of SMs. One of the new features of the Fermi architecture is the GigaThread™ Thread Scheduler, which apparently provides 10x faster context switching. At SC11, I heard one NVIDIA employee claim that context switching was free.

GPUs are Vector Processors

To me, rather than thinking of GPUs in terms of grids and blocks and warps, it’s best to think of them as vector processors. Vector processors are CPUs that are designed around Single Instruction, Multiple Data (SIMD) instructions. Typical desktop CPUs contain SIMD extensions, such as Intel’s AVX instructions, which allow them to perform some vector operations efficiently, but their focus is still on low latency execution of scalar code. By contrast, vector processors expect most of the computation to be expressed in terms of vector operations, and are optimized to perform these operations as quickly as possible, perhaps even at the expense of scalar performance.

Under this model, each SM on an NIVIDA GPU corresponds to a more traditional CPU core. These SMs would contain some number of 32-wide vector registers. It seems that CUDA exposes operations on vector registers as a warp. They appear to be 32 threads because each instruction on 32 lanes at once, while the threads must proceed in lock step because they are actually a single stream of instructions.

Now, how do CUDA blocks fit with this view? These blocks seem to correspond to a set of warps executing on a single SM. Although an SM is a single core, it can run multiple threads through simultaneous multithreading, or HyperThreading as Intel calls it. Under HyperThreading, two threads can be assigned to a single processor core. The processor then multiplexes resources between the two threads. For example, if one thread is blocked on a memory operation, the CPU can execute instructions from the other thread while the first one waits on memory. Switching between these threads is basically free; it’s just a matter of assigning ready work to available hardware resources. In terms of CUDA blocks, if we divide the maximum number of threads per block (1024) by the number of threads per warp (32), we end up with 32. This suggests that each SM is able to keep around 32 thread (or warp) contexts, and swap between them easily as execution units and data become available.

In summary, we can think of a Fermi GPU as a multicore processor, where each core does 32-way HyperThreading and supports 32-wide vector instructions.

In order to really verify that this is the case, it would be helpful to see the actual Fermi instruction set. NVIDIA is very secretive about this, instead only publishing a virtual instruction set, PTX. This is understandable, as it means NVIDIA does not have to maintain backwards compatibility between GPU revisions. However, AMD does provide documentation for the actual instruction set for their GPUs. After briefly perusing their latest documentation, it seems that their instruction set is compatible with the idea of GPUs as vector processors.

About these ads
Comments
  1. gasche says:

    Your post is very interesting but frankly, it sucks to post a link to source code under a firewall (like people giving references to research articles behind paywall). On one hand, you’re free to develop whatever as closed source, but then it’s not necessary to link to a non-public implementation on your public blog. One the other, if any student of your university can access this, it’s essentially just as if it was public, so I don’t think there are really valid reason for not making it easy to access for anyone.

    (I’ve discussed with some researchers why they wouldn’t accept publish their prototype implementation on a public hosting place. The answer vary between “people could steal my great ideas”, which might be the case in some fields but generally not reasonably founded and rather an excuse for point two, “I don’t want everyone to see my borked commit messages and the shameful stupid bugs I’m fixing”, which is pride of an old age (nobody cares about that) and why Matt Might invented the CRAPL license: http://matt.might.net/articles/crapl/ . Of course, the decision to publish might not be in your power (but again this is sometimes a convenient excuse), but then again, you don’t need to link to it in your public comments.)

    • Eric Holk says:

      Thanks for the criticism. In this case the decision to release the code isn’t up to me. I do intend to release the code as soon as I can though. The version I linked to is accessible to Indiana University students, which as you point out is a large number of people. I felt like it was best to share the code with those I could even if I can’t make it truly public.

  2. While (modern) GPUs are in fact very similar to large SIMD engines, they are different as well. When a GPU SM engine executes a single stream of instructions on 32 parallel data-streams, it can predicate the execution of each individual instruction independently on the data-streams. So it’s possible to have conditionals in your instruction stream that can be true for some of the parallel data-streams and false for others. The instructions that happen to operate on stream data where the condition happens to be false are simply replaced by NOP-s for that stream. Most SIMD implementations that I’m familiar with don’t have this capability: an instruction will unconditionally execute on all parallel data-streams.

    Another way of looking at it is that your view of GPUs being vector-processors is a special-case of all what a GPU is capable of: the case when all threads in a warp happen to go through the same conditional branches.

  3. curdog says:

    I’m sorry but you’re post is meaningless without code… fork it publicly, for the sake of others. It sounds interesting…but no code to back anything. Why bother talk about something if no one can see it? Reminds me of something…

    Hey I got these really cool golden tablets with writing that I can only understand with this magic decoder, but you can’t see them or the decoder. Interested?

    • Eric Holk says:

      I apologize for now sharing the Harlan code in this post. However, the point of writing this wasn’t to talk about how great Harlan was, but to present a way of thinking about GPU architectures that is different from what’s typically taught. In my opinion, viewing GPUs as multicore, multithreaded vector processors is a better match for the hardware, which can help programmers reason about how to structure their code better.

      I did provide links to documentation on NVIDIA’s Fermi architecture and AMD’s GPU instruction set, which should provide enough evidence for you to evaluate the claims in my post.

      Thanks for reading, and thanks for your comment. I’m looking forward to sharing on GPU programming and some of the programming language work I’m doing on this front.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s