New Location

Posted: November 30, 2012 in Uncategorized

I’ve decided to move my blog to a new home:

http://blog.theincredibleholk.org/

I’ll keep the existing posts in their current location, but all my new posts will be at the new location. Hope to see you there!

Lately, the focus of my research has been on a new programming language called Harlan (that link is for IU students only, sorry), which is a high level language for GPU programming. One important task in this project has been forming a reasonable mental model of how GPUs actually work. I’ve recently come to the conclusion that the model exposed as part of CUDA and OpenCL make it almost impossible to form a clear picture of what is actually happening in the hardware.

The CUDA documentation gives the impression that an NVIDIA GPU is a mystical processor that is capable of running thousands of threads at once. This leads to a unique programming model. Suppose you want to add two vectors of 10,000 elements together. All you have to do on a GPU is spawn 10,000 threads, and each thread adds one element of the vector. If we wanted each thread to run the function add_vector, we could simply do this:

int block_size = ???;
int num_blocks = ???;
add_vector<<<num_blocks, block_size>>>(n, x, y, z);

This code adds vectors x and y, each of length n, and stores the result in z. Of course, we have already run into some complications. What should block_size and num_blocks be? CUDA partitions all of your threads into a grid of blocks, and each block has a certain number of threads. You can have basically as many blocks as you want, but for some reason the block size (or number of threads per block) cannot be more than 1024.

What quickly becomes clear is that these so-called thousands of cores that your GPU has are not the same as cores on a CPU. For example, we hear about how at least some of these cores execute in lock step, meaning they must execute the exact same instructions at the same time. Not all threads do though, because you can synchronize threads within a block using the __syncthreads() function. Besides grouping threads into blocks, some kernels also make use of the fact that threads are further subdivided into warps of up to 32 threads. The question is, how do these concepts map onto hardware?

A look at the Fermi Architecture Whitepaper shows that NVIDIA’s Fermi processors are made up of some number of Streaming Multiprocessors (SMs), which each have 32 CUDA cores. The Wikipedia page shows that different boards within the Fermi series have a different number of SMs. One of the new features of the Fermi architecture is the GigaThread™ Thread Scheduler, which apparently provides 10x faster context switching. At SC11, I heard one NVIDIA employee claim that context switching was free.

GPUs are Vector Processors

To me, rather than thinking of GPUs in terms of grids and blocks and warps, it’s best to think of them as vector processors. Vector processors are CPUs that are designed around Single Instruction, Multiple Data (SIMD) instructions. Typical desktop CPUs contain SIMD extensions, such as Intel’s AVX instructions, which allow them to perform some vector operations efficiently, but their focus is still on low latency execution of scalar code. By contrast, vector processors expect most of the computation to be expressed in terms of vector operations, and are optimized to perform these operations as quickly as possible, perhaps even at the expense of scalar performance.

Under this model, each SM on an NIVIDA GPU corresponds to a more traditional CPU core. These SMs would contain some number of 32-wide vector registers. It seems that CUDA exposes operations on vector registers as a warp. They appear to be 32 threads because each instruction on 32 lanes at once, while the threads must proceed in lock step because they are actually a single stream of instructions.

Now, how do CUDA blocks fit with this view? These blocks seem to correspond to a set of warps executing on a single SM. Although an SM is a single core, it can run multiple threads through simultaneous multithreading, or HyperThreading as Intel calls it. Under HyperThreading, two threads can be assigned to a single processor core. The processor then multiplexes resources between the two threads. For example, if one thread is blocked on a memory operation, the CPU can execute instructions from the other thread while the first one waits on memory. Switching between these threads is basically free; it’s just a matter of assigning ready work to available hardware resources. In terms of CUDA blocks, if we divide the maximum number of threads per block (1024) by the number of threads per warp (32), we end up with 32. This suggests that each SM is able to keep around 32 thread (or warp) contexts, and swap between them easily as execution units and data become available.

In summary, we can think of a Fermi GPU as a multicore processor, where each core does 32-way HyperThreading and supports 32-wide vector instructions.

In order to really verify that this is the case, it would be helpful to see the actual Fermi instruction set. NVIDIA is very secretive about this, instead only publishing a virtual instruction set, PTX. This is understandable, as it means NVIDIA does not have to maintain backwards compatibility between GPU revisions. However, AMD does provide documentation for the actual instruction set for their GPUs. After briefly perusing their latest documentation, it seems that their instruction set is compatible with the idea of GPUs as vector processors.

Apparently someone recently guessed my super-ingenious password and starting telling the world that my opinions were worth money. Thanks for Erick Tryzelaar for pointing this out to me. I’ve updated my password and hopefully I’ll be safe for a while.

A lot of people have been asking me how to use protocols in Rust lately, so I thought I’d write up a little tutorial. Custom protocols are how to get the biggest benefits from Rust’s communication system, as this is how you get the biggest safety guarantees and expose the most opportunities for optimization. It’s also more labor intensive, so some of the other library features such as streams or port sets might be a better starting point. This post, however, introduces how to write your own protocols.

Protocols are created using the proto! syntax extension. Each protocol has a collection of states, and each state in a protocol has a collection of messages that are allowed. Every protocol must have at least states, though states needn’t have any messages. Thus, we will start with the simplest possible protocol definition:

proto! simple_proto {
    StartState:send { }
}

This creates a protocol called simple_proto, with a single state called StartState. Protocols always start in the first state listed. There is one other decoration, :send. This indicates that StartState is a send state. Protocols are always written from the client perspective, so a send state means it is a state in which the client may send a message. The protocol compiler will generate a dual state for the server, meaning there will be a version of StartState that is expecting to receive a message.

Given this specification, the protocol compiler will create a module called simple_proto, that includes types and functions for communicating according to this protocol. One generated function is called init, which is used to create a pipe in the start state. We can start the protocol like this:

let (server, client) = simple_proto::init();

This creates a pair of endpoints. The client endpoint will have the type simple_proto::client::StartStart, meaning it is in the client’s version of StartState. The server endpoint, on the other hand, has type simple_proto::server::StartState, which means it is in the server’s version of the StartState. The difference means that client is expecting to send a message, while server is expecting to receive a message.

We can’t really do anything further with this protocol, since there are no messages defined. Strictly speaking, the server could try to receive, but it would block forever since there is no way for the sender to send a message. Let’s fix this by adding a message.

proto! simple_proto {
    StartState:send {
        SayHello -> StartState
    }
}

Messages have the form Name(arguments) -> NextState. In this case, we added a message called SayHello, which carries no data with it. After sending a SayHello message, the protocol transitions to (or stays in, really) the StartState. We can now write some functions that communicate with each other. Here’s an example client program:

fn client(+channel: simple_proto::client::StartState) {
    import simple_proto::client;

    client::SayHello(channel);
}

Receive, by itself, is a little trickier. I recommend using the select macro instead. For now, you can get the select macro here. Once macro import is working, the select macro will be included in the standard library. For now, to use the select macro, save it in a file called select-macro.rs and add the following lines near the top of your program.

fn macros() {
    include!("select-macro.rs");
}

Once you’ve done this, you can write the server as follows.

fn server(+channel: simple_proto::server::StartState) {
    import simple_proto::SayHello;

    select! {
        channel => {
            SayHello -> _channel {
                io::println("Client says hello!");
            }
        }
    }
}

Select allows you to provide a set of endpoints to listen for messages on, followed by actions to take depending on which one receives a message. In this case, we only have one endpoint, called channel, which is in the server StartState state. After the =>, there is a block describing message patterns and code to execute if the pattern is matched. In this case, we only have one pattern, SayHello -> _channel. This mirrors the definition of the message in the protocol specification. It says “if we receive a SayHello message, bind an endpoint in the next protocol state to _channel and execute the code in the following block.” We use _channel for the next state because in this case we are not planning on sending or receiving any more messages.

Now let’s make this protocol a little more interesting by adding a new state and a message that carries data. We will do this by letting the client ask the server’s name and wait for a reply. The new protocol looks like this:

proto! simple_proto {
    StartState:send {
        SayHello -> StartState,
        WhatsYourName -> GettingName
    }

    GettingName:recv {
        MyNameIs(~str) -> StartState
    }
}

We’ve added a new message to StartState, which the client uses to ask the server’s name. After sending this, the protocol transitions to the GettingName state, where the client will wait to receive the MyNameIs message from the server. At this point, the protocol moves back to the StartState, and we can do it all over again. We’ve added an argument to MyNameIs, which means this message carries a string with it. Our client code now looks like this:

fn client(+channel: simple_proto::client::StartState) {
    import simple_proto::client;
    import simple_proto::MyNameIs;

    let channel = client::SayHello(channel);
    let channel = client::WhatsYourName(channel);
    select! {
        channel => {
            MyNameIs(name) -> _channel {
                io::println(fmt!("The server is named %s", *name));
            }
        }
    }
}

At a high level, this code says hello to the server, then asks for it’s name, then waits for the response and reports the server’s name to the user. It probably looks a little add that every line starts with let channel = .... This is because endpoints are single use. Any time you send or receive a message on an endpoint, the endpoint is consumed. Fortunately, all the send and receive functions return a new endpoint that you can use to continue the protocol.

The use of select! here is similar to how it was in the previous server example, except that we’ve added name to the MyNameIs pattern. This matches the ~str parameter in the protocol specification, and it binds the string sent by the server to name, so that we can print it out in the handler code.

For the new server, we need to add another clause to the message patterns:

fn server(+channel: simple_proto::server::StartState) {
    import simple_proto::{SayHello, WhatsYourName};
    import simple_proto::server::MyNameIs;

    select! {
        channel => {
            SayHello -> _channel {
                io::println("Client says hello!");
            },
            WhatsYourName -> channel {
                MyNameIs(channel, ~"Bob");
            }
        }
    }
}

In this case, if we receive a WhatsYourName message, we send a MyNameIs message on the new endpoint (called channel), which contains the string ~"Bob", which is what this server has decided to call itself. The client will eventually receive this string and show it to the user.

This covers the basic definition and usage of protocols. There are several other features, however. This includes polymorphic states and terminal states. Polymorphic states allow you to create protocols that work for different types. One common example is the stream protocol, which lets you send a whole bunch of messages of a given type:

proto! stream {
    Stream:send<T:send> {
        Send(T) -> Stream<T>
    }
}

We can add as many type parameters as we want to each of the states, with arbitrary bounds as well. You’ll probably want all your data types to be send-bounded though. Then, each time a message transitions to a polymorphic state, it must provide type parameters. You can see this on the right had side of the Send message.

Sometimes, we want to have a message that ends the protocol. For example, in our previous example, we might want a GoodBye message. One way to do this is to make a state with no messages, and step to that:

proto! simple_proto {
    StartState:send {
        GoodBye -> Done,
    }

    Done { }
}

However, this is a little verbose, and it also hides that fact that you really intended the protocol to end there. Thus, there is a special form that indicates sending a message ends the protocol. We write it like this:

proto! simple_proto {
    StartState:send {
        GoodBye -> !
    }
}

Stepping to ! represents closing the protocol and is analogous to how a function that returns ! actually never returns. When a message steps to !, neither the corresponding send function nor the receive function will return a new endpoint, meaning there is no way you could even attempt to send messages on this connection.

I hope this helps to get started with protocols in Rust. There are a few other features, but this covers the basics. Please feel free to ask questions!

The benchmarks in my last post had one thing in common: all communication was one sender to one receiver. It’s surprising how often this is sufficient, but sooner or later we are going to need a way to have multiple tasks sending to the same receiver. I’ve been experimenting with two ways of doing many senders to different receivers, and I now have some results to show.

The pipes library includes a select operation. This lets you listen on several receive endpoints simultaneously. Unfortunately, the single-use nature of endpoints makes select a little clunky to use. To help alleviate this, I added a port_set to the library. Port sets allow you to easily treat several receive endpoints as a unit. This allows send to still be very fast, but receive is a little bit slower due to the overhead setting up and tearing down the select operation. The current implementation for select is O(n) in the number of endpoints, so this works well for small numbers of tasks, but breaks down as things get bigger.

The other option is to slow down the sending end, using something I call a shared_chan. This is a send endpoint wrapped in an exclusive ARC. Now all the senders have to contend with each other, but the receive side is exactly as cheap as before. For cases where you have a lot of senders that send messages relatively infrequently, this will likely outperform the port_set approach, at least until select is faster.

Both of these are sufficient to run the msgsend benchmark that I talked about at the beginning of all of this. Here are the results, combined with the previous numbers.

Language Messages per second Comparison
Rust port_set 881,578 232.8%
Scala 378,740 100.0%
Rust port/chan (updated) 227,020 59.9%
Rust shared_chan 173,436 45.8%
Erlang (Bare) 78,670 20.8%
Erlang (OTP) 76,405 20.2%

The most obvious thing is that the port_set version is over twice as fast as Scala, the previous winner. I also re-ran the port/chan version for comparison, and it got a little bit faster. There has been quite a bit of churn in Rust recently, so it’s quite possible that these showed up here as better performance.

Writing the port_set version proved the most interesting to me. Relying on select ended up relaxing some of the ordering guarantees. Previously if we had Task A send a message to Task C and then send a message to Task B, and then have Task B wait to receive message to from Task A and then send a message to Task C, we could count on Task C seeing Task A’s message before seeing Task B’s message. With the port_set, this is no longer true, although we still preserve the order in messages sent by a single task. An easy way to work around this, however, was to rely on pipe’s closure reporting ability. The server could tell when a worker would no longer send any more messages because it would detect when the worker closed its end of the pipe.

I hinted in my last post that pipes in Rust have very good performance. This falls out of the fact that the protocol specifications provide very strong static guarantees about what sorts of things can happen at runtime. This allows, among other things, for message send/receive fastpath that requires only two atomic swaps.

Let’s start with the message ring benchmark. I posted results from this earlier. This benchmark spins up a bunch of tasks that arrange themselves in a while. Each task sends a message to their right-hand neighbor, and receives a message from the left-hand neighbor. This repeats for a while. At the end, we look at the total time taken divided by the number of messages. This gives us roughly the fastest we can send and receive a message, modulo some task spawning overhead. The existing port/chan system was able to send about 250,000 messages per second, or one message every 3.9 µs. Here are the results for pipes:

Sent 1000000 messages in 0.227634 seconds
  4.39301e+06 messages / second
  0.227634 µs / message

This is about 17x faster!

It would be a bit dishonest to stop here, however. I wrote this benchmark specifically to make any new implementation really shine. The question is whether faster message passing makes a difference on bigger programs.

To test this, I started by updating the Graph500 Parallel Breadth First Search benchmark. This code gets its parallelism from std::par::map, which in turn is built on core::future. Future has a very simple parallel protocol; it just spawns a task to compute something, which then sends a single message back to the spawner. Porting this was a relatively small change, yet it got measurable speedups. Here are the results.

Benchmark Port/chan time (s) Pipe time (s) Improvement (%)
Graph500 PBFS 0.914772 0.777784 17.6%

The Rust benchmark suite also includes several benchmarks from the Computer Language Benchmarks Game (i.e. the Programming Language Shootout). Some of these, such as k-nucleotide, use Rust’s parallelism features. I went ahead and ported this benchmark over to use pipes, and there are the results.

Benchmark Port/chan time (s) Pipe time (s) Improvement (%)
Shootout K-Nucleotide 4.335 3.125 38.7%

Not too shabby. I’ve been working on porting other benchmarks as well. Some are more difficult because they do not fit the 1:1 nature of pipes very well. In the case of the shootout-threadring benchmark, it actually got significantly slower when I moved to pipes. The thread ring benchmark seems to mostly be measuring the time to switch between tasks, as only one should be runnable at any given time. My hypothesis is that because message passing got faster, this test now hammers the scheduler synchronization code harder, leading to more slowdown due to contention. We’ll need more testing to know for sure. At any rate, scheduler improvements (such as work stealing, which Ben Blum will be working on) should improve this benchmark as well.

Other than that, I’ve been working on rewriting more Rust code to see how it works with pipes versus ports and chans. It has been particularly informative to try to transition parts of Servo over to using pipes.

About a month ago, I posted that I was going to be working on improving Rust’s message passing performance. I quickly threw together a prototype of a new communication system based on a shared queue protected by a mutex. This was about twice as fast as the existing system, because it removed the global mutex from the messaging paths. This prototype hurt expressiveness somewhat, and still it seemed we could do a lot better.

Rust has some extremely powerful features in its type system. The fact that it can deal with concepts like uniqueness, initialization status, copyability, and other traits mean we can encode some very powerful invariants. Thus, I took some inspiration from the Singularity OS and set out to see if I could encode something like channel contracts in Rust. The result is a proposal for a feature I’m calling pipes.

The way pipes work is that when you create a pipe you get two endpoints that are forever entangled together. One endpoint can send one message, and the other endpoint can receive that one message. Sending and receiving destroys the endpoint, but the operation also produces a new endpoint to continue the communication. Endpoints have a state associated with them, which specifies which messages can be sent or received. This information is encoding in the type system, so Rust can statically guarantee that no task will send a message that is not legal in the given state. Pipes are not copyable; they are always for 1:1 communication. However, endpoints can be sent between tasks.

Critical to pipes are the associated protocol specification. Protocols have two views: the client and the server. Protocols are always written from the perspective of the client. This decision was arbitrary, but in general it makes sense to only write down one side of the protocol. The other perspective is generated by reversing the direction of all the messages. Here’s an example of what I’m envisioning for a protocol specification.

proto! bank {
    login:send {
        login(username, password) -> login_response
    }

    login_response:recv {
        ok -> connected,
        invalid -> login
    }

    connected:send {
        deposit(money) -> connected,
        withdrawal(amount) -> withdrawal_response
    }

    withdrawal_response:recv {
        money(money) -> connected,
        insufficient_funds -> connected
    }
}

This describes the protocol you might use in an online banking situation. The protocol has four states (login, login_response, connected and withdrawal_response), each one annotated with whether the sender is allowed to send or receive in that state. In this case, a client would start out in the login state, where the client can attempt to login with a username and password. After sending a login message, the protocol enters the login_response state, where the server informs the client that either the login succeeded (in which case the protocol transitions to the connected state), or the login failed, in which case the protocol returns to the login state and the client can retry.

From the connected state, the client can try to deposit or withdrawal money. We assume that depositing money never fails, so sending a deposit message results in the protocol staying in the connected state. On the other hand, withdrawal can fail, for example, if the account does not have enough money. To model this, sending a withdrawal message results in the protocol going to the withdrawal_response state. Here, the client waits to either receive the requested money, or for a message saying there was not enough money in the account. In both cases, we end up back in the connected state.

Below is a code example showing how a client might use this protocol.

fn bank_client(+bank: bank::client::login) {
    import bank::*;

    let bank = client::login(bank, "theincredibleholk", "1234");
    let bank = alt recv(bank) {
      some(ok(connected)) {
        #move(connected)
      }
      some(invalid(_)) { fail "login unsuccessful" }
      none { fail "bank closed the connection" }
    };

    let bank = client::deposit(bank, 100.00);
    let bank = client::withdrawal(bank, 50.00);
    alt recv(bank) {
      some(money(m, _)) {
        io::println("Yay! I got money!");
      }
      some(insufficient_funds(_)) {
        fail "someone stole my money"
      }
      none {
        fail "bank closed the connection"
      }
    }
}

All of this code in this posts works on the latest Rust compiler as of this morning. I’ve also started transitioning some of our benchmarks to the new pipe system, and the results have been impressive. I’ll have a post diving into the performance of pipes soon.

For the last week or so, I’ve been mostly fighting with the performance of the Rust compiler. The rustc performance graph from the RustBot page tells a lot of the story:

Performance graph for rustc

The wildly varying performance is largely due to a change I made to move the code for addition on vectors out of the compiler and into the core library. The compiler used to generate very optimized vector code. For example, when doing v += [x] (that is, appending a single element to a vector in place), the compiler would avoid allocating a temporary vector for the [x]. In effect, the compiler generated what amounts to an inline call to vec::push.

My work involved adding an overloaded addition operator to Rust’s core library, and changing the compiler to use that version instead of its generated version. The landing of this change corresponds to about build number 70 or so on the graph, where there is a small but significant jump in the time rustc took. Due to the way we handle overloaded operators, the compiler must insert many more vector copies than it did before. It seemed too good to be true that I would only lose a small amount of performance.

And it was too good to be true. It turns out through a bunch of rebasing, I had managed to undo a small switch in is_binopable that controlled whether my new code ever got called. In the meantime, Sully has been doing a lot of work on the way Rust handles vectors, and he landed a change around build 90 in the performance graph above that ended up causing Rust to use the library version of vector append instead. This was more like the performance decrease I was expecting.

In order to fix this, I had to track down all the places in the compiler where we were using the addition operator on vectors, and replace them with things like vec::push, vec::push_all, vec::append and vec::append_one. This was tedious work, and I did it in stages. Each spot where the execution time improves in the graph is where I landed a couple more of these changes. Once this was finally done, we can see that rustc’s performance was even slightly better than it was before. I believe this is because the library versions of these functions are able to remove some copies that rustc could not remove before. Sadly, the code is much uglier than it was before. We are discussing the ability to specify modes on the self argument, which will allow the compiler to again avoid unnecessary copies when using the plus operator. This should allow us to make using the plus operator just as fast as the underlying library functions.

The moral of the story is that copies can kill performance. Rust warns often when it is inserting copies you may not have expected. Pay attention to these warnings and you will write much faster code.

Previously, I talked about a couple of ways to improve the message passing performance in Rust. The obvious bottleneck was that sending and receiving both involved taking a lock that was shared between all tasks. In order to remove this lock, I set out to write a new port/channel system that relied much less on the runtime library.

In order to do this, we first needed a variant of the atomic reference counter that allows us to share mutable data. We accomplish this by adding a mutex and condition variable. The mutex is your standard pthread mutex, while we’ve implemented our own condition variable that the Rust scheduler is aware of. Because we’re using a standard pthread mutex, using this exclusive access is unsafe and should be used with care; if you’re not careful, you could deadlock Rust. Fortunately, the API for Rust’s low-level locks and condition variables makes it harder to accidentally hold a lock for unbounded amounts of time.

Once we have this, ports and channels simply become a locked atomically reference-counted queue. There’s no more global lock, and things are far simpler because we only need the Rust runtime support for providing mutual exclusion and condition variable signalling. This part didn’t take long to implement, and I was eager to try it out. I spent a little while writing up a new benchmark that would really show the benefit of avoiding the global lock, and when I went to run it, things crashed.

It turns out, I had discovered a bug whereby vector addition could copy things that weren’t copyable. I saw two approaches to fixing this: fixing trans (the Rust to LLVM translation pass), or moving vector addition out of trans and into the library. The idea behind the second option is that if we did this, vector addition would be going through the existing and well-tested function call path, rather than a special vector addition codegen path. This seemed like the best option overall, since the first option felt more like a band aid for the root cause that we have too much functionality that is duplicated in subtly different ways.

Thus, I set out to move vector addition to libcore. This exposed some subtle semantics issues around const vectors, but these were mostly not too painful to work out. My firsts working versions were too slow to finish building the compiler in a reasonable amount of time. I was thinking we’d end up taking a 20% to 2x performance hit by doing this change, but thanks to some heroic optimization help from Niko, we got the performance to the point where in some cases the new code even performs ever so slightly better than the old code. In the course of doing these changes, I also discovered another bug, that led to us leaking memory, and in the course of fixing that, I discovered a way to make Rust segfault. Ah, the life of a compiler writer.

At any rate, we have fixes to all of these bugs in the works, and things are working well enough to run a benchmark. This test creates a ring of tasks, who each send a message to their neighbor on one side and receive from the other side. Here are the numbers for the old messaging system.

Sent 1000000 messages in 3.88114 seconds
  257656 messages / second
  3.88114 μs / message

And here are the numbers for the new system.

Sent 1000000 messages in 1.87881 seconds
  532253 messages / second
  1.87881 μs / message

As you can see, we’re about 1.9x faster than we were before.

This new system doesn’t yet have the exact same features as the old system, and it needs some help with ergonomics. My next task will be to work on adding these missing things and hopefully be able to incrementally replace the old code with the new, faster version.

The code for this post isn’t yet in the main Rust tree, but it should be landing soon as we squash the bugs we found in the process of doing this new message passing system.

Today we’re going to take a brief look at message passing performance in Rust. Rust is already pretty quick on this front, but there are some really obvious possibilities for making it even faster. It looks like I’ll be spending most of the summer working on this, so I thought it’d be good to start out with some baseline performance numbers. Rust’s testsuite has a benchmark called msgsend.rs, which is a microbenchmark for message passing performance. It is based on a benchmark from Paul Keeble that original compared Erlang and Scala. Here’s a table summarizing the results:

Language Messages per second Comparison
Scala 378,740 100.0%
Rust 175,373 46.3%
Erlang (Bare) 78,670 20.8%
Erlang (OTP) 76,405 20.2%

These numbers were generated on an Early 2011 MacBook Pro 2.3 GHz Intel Core i7 with 8GB RAM and Lion 10.7.4. Don’t read too much into these numbers; they are incredibly unscientific. The only real use for them will be to see if running the same benchmark for Rust in a few weeks yields a larger number. It’s also worth noting that my results disagree with the original author, who saw Erlang performing about 6 times faster than Scala. I suspect that Erlang’s message passing may not be as efficient on Mac OS X, but I do not know what system the tests were originally run on.

So, where do we go from here? I mentioned that there are some obvious first steps. The most obvious is that right now sending a message involves taking a global lock. In effect, this means we can only send one message in a time. Suppose we had four tasks, A, B, C and D, where A was sending to B and C to D. Intuitively, these two sends should be completely independent of each other, but this is not currently the case in Rust. This is one thing I hope to change.

It would be nice if we could do at least some of the communication without any locks at all, using a lock-free data structure. This is worth experimenting with, but we’ll have to wait for the benchmarks to show whether this is the best idea.

Somewhat orthogonal to message passing performance, I hope to be able to implement more of the communication system in Rust itself. When the communication system was first written, we had to put almost all of it in the runtime library that was written in C++. Rust is significantly more powerful now, and it seems like it would not take too much more work to write the communication system almost entirely in Rust.