Valentin Popov 1b6a04ca55
Initial vendor packages
Signed-off-by: Valentin Popov <valentin@popov.link>
2024-01-08 01:21:28 +04:00

9.0 KiB

Rayon FAQ

This file is for general questions that don't fit into the README or crate docs.

How many threads will Rayon spawn?

By default, Rayon uses the same number of threads as the number of CPUs available. Note that on systems with hyperthreading enabled this equals the number of logical cores and not the physical ones.

If you want to alter the number of threads spawned, you can set the environmental variable RAYON_NUM_THREADS to the desired number of threads or use the ThreadPoolBuilder::build_global function method.

How does Rayon balance work between threads?

Behind the scenes, Rayon uses a technique called work stealing to try and dynamically ascertain how much parallelism is available and exploit it. The idea is very simple: we always have a pool of worker threads available, waiting for some work to do. When you call join the first time, we shift over into that pool of threads. But if you call join(a, b) from a worker thread W, then W will place b into its work queue, advertising that this is work that other worker threads might help out with. W will then start executing a.

While W is busy with a, other threads might come along and take b from its queue. That is called stealing b. Once a is done, W checks whether b was stolen by another thread and, if not, executes b itself. If W runs out of jobs in its own queue, it will look through the other threads' queues and try to steal work from them.

This technique is not new. It was first introduced by the Cilk project, done at MIT in the late nineties. The name Rayon is an homage to that work.

What should I do if I use Rc, Cell, RefCell or other non-Send-and-Sync types?

There are a number of non-threadsafe types in the Rust standard library, and if your code is using them, you will not be able to combine it with Rayon. Similarly, even if you don't have such types, but you try to have multiple closures mutating the same state, you will get compilation errors; for example, this function won't work, because both closures access slice:

/// Increment all values in slice.
fn increment_all(slice: &mut [i32]) {
    rayon::join(|| process(slice), || process(slice));
}

The correct way to resolve such errors will depend on the case. Some cases are easy: for example, uses of [Rc] can typically be replaced with [Arc], which is basically equivalent, but thread-safe.

Code that uses Cell or RefCell, however, can be somewhat more complicated. If you can refactor your code to avoid those types, that is often the best way forward, but otherwise, you can try to replace those types with their threadsafe equivalents:

  • Cell -- replacement: AtomicUsize, AtomicBool, etc
  • RefCell -- replacement: RwLock, or perhaps Mutex

However, you have to be wary! The parallel versions of these types have different atomicity guarantees. For example, with a Cell, you can increment a counter like so:

let value = counter.get();
counter.set(value + 1);

But when you use the equivalent AtomicUsize methods, you are actually introducing a potential race condition (not a data race, technically, but it can be an awfully fine distinction):

let value = tscounter.load(Ordering::SeqCst);
tscounter.store(value + 1, Ordering::SeqCst);

You can already see that the AtomicUsize API is a bit more complex, as it requires you to specify an ordering. (I won't go into the details on ordering here, but suffice to say that if you don't know what an ordering is, and probably even if you do, you should use Ordering::SeqCst.) The danger in this parallel version of the counter is that other threads might be running at the same time and they could cause our counter to get out of sync. For example, if we have two threads, then they might both execute the "load" before either has a chance to execute the "store":

Thread 1                                          Thread 2
let value = tscounter.load(Ordering::SeqCst);
// value = X                                      let value = tscounter.load(Ordering::SeqCst);
                                                  // value = X
tscounter.store(value+1);                         tscounter.store(value+1);
// tscounter = X+1                                // tscounter = X+1

Now even though we've had two increments, we'll only increase the counter by one! Even though we've got no data race, this is still probably not the result we wanted. The problem here is that the Cell API doesn't make clear the scope of a "transaction" -- that is, the set of reads/writes that should occur atomically. In this case, we probably wanted the get/set to occur together.

In fact, when using the Atomic types, you very rarely want a plain load or plain store. You probably want the more complex operations. A counter, for example, would use fetch_add to atomically load and increment the value in one step. Compare-and-swap is another popular building block.

A similar problem can arise when converting RefCell to RwLock, but it is somewhat less likely, because the RefCell API does in fact have a notion of a transaction: the scope of the handle returned by borrow or borrow_mut. So if you convert each call to borrow to read (and borrow_mut to write), things will mostly work fine in a parallel setting, but there can still be changes in behavior. Consider using a handle: RefCell<Vec<i32>> like:

let len = handle.borrow().len();
for i in 0 .. len {
    let data = handle.borrow()[i];
    println!("{}", data);
}

In sequential code, we know that this loop is safe. But if we convert this to parallel code with an RwLock, we do not: this is because another thread could come along and do handle.write().unwrap().pop(), and thus change the length of the vector. In fact, even in sequential code, using very small borrow sections like this is an anti-pattern: you ought to be enclosing the entire transaction together, like so:

let vec = handle.borrow();
let len = vec.len();
for i in 0 .. len {
    let data = vec[i];
    println!("{}", data);
}

Or, even better, using an iterator instead of indexing:

let vec = handle.borrow();
for data in vec {
    println!("{}", data);
}

There are several reasons to prefer one borrow over many. The most obvious is that it is more efficient, since each borrow has to perform some safety checks. But it's also more reliable: suppose we modified the loop above to not just print things out, but also call into a helper function:

let vec = handle.borrow();
for data in vec {
    helper(...);
}

And now suppose, independently, this helper fn evolved and had to pop something off of the vector:

fn helper(...) {
    handle.borrow_mut().pop();
}

Under the old model, where we did lots of small borrows, this would yield precisely the same error that we saw in parallel land using an RwLock: the length would be out of sync and our indexing would fail (note that in neither case would there be an actual data race and hence there would never be undefined behavior). But now that we use a single borrow, we'll see a borrow error instead, which is much easier to diagnose, since it occurs at the point of the borrow_mut, rather than downstream. Similarly, if we move to an RwLock, we'll find that the code either deadlocks (if the write is on the same thread as the read) or, if the write is on another thread, works just fine. Both of these are preferable to random failures in my experience.

But wait, isn't Rust supposed to free me from this kind of thinking?

You might think that Rust is supposed to mean that you don't have to think about atomicity at all. In fact, if you avoid interior mutability (Cell and RefCell in a sequential setting, or AtomicUsize, RwLock, Mutex, et al. in parallel code), then this is true: the type system will basically guarantee that you don't have to think about atomicity at all. But often there are times when you WANT threads to interleave in the ways I showed above.

Consider for example when you are conducting a search in parallel, say to find the shortest route. To avoid fruitless search, you might want to keep a cell with the shortest route you've found thus far. This way, when you are searching down some path that's already longer than this shortest route, you can just stop and avoid wasted effort. In sequential land, you might model this "best result" as a shared value like Rc<Cell<usize>> (here the usize represents the length of best path found so far); in parallel land, you'd use a Arc<AtomicUsize>. Now we can make our search function look like:

fn search(path: &Path, cost_so_far: usize, best_cost: &Arc<AtomicUsize>) {
    if cost_so_far >= best_cost.load(Ordering::SeqCst) {
        return;
    }
    ...
    best_cost.store(...);
}

Now in this case, we really WANT to see results from other threads interjected into our execution!