Atomics and Memory Ordering

Understanding Atomics and Memory Ordering

I’ve always had trouble clearly understanding how atomic operations and memory ordering work. Many of the resources that I looked at poorly explain these concepts which made understanding them harder. I’ve rarely worked with these things directly so it wasn’t a big issue for me.

However, on one of my recent projects, I’ve had to make use of them to ensure correct synchronization of memory accesses between threads. As a result, I finally attempted to understand how atomic and memory ordering work.

Before we discuss atomics and memory ordering, we need to understand how the underlying hardware processes data loads/writes. This part is really important because many articles about this topic don’t go over it and only discuss the effects of atomic operations.

Understanding the Hardware

When the CPU loads a value from memory, it doesn’t only fetch that value, it fetches the cache line that the value is in.

This significantly speeds up subsequent accesses, but also causes issues when multiple cores read/write to this data. This is because one core can store a new value in its cache and delay writing that value to ram until it evicts that cache-line (write-back cache). As a result, another core can read a stale value from RAM.

To address this issue, caches of different cores are kept in sync using a cache coherence protocol.

MESI Cache Coherence Protocol

The MESI cache coherency protocol defines the states that a cache line can be in. Intel uses a variation of this protocol called MESIF and AMD uses MOESI. However, their hardware implementation might differs significantly.

MESI stands for: Modified, Exclusive, Shared, and Invalid. its an acronym for these four states.

Lets go over what each of them mean:

Modified: This means that the cache line is modified by this core and this core has exclusive access to it. Also, the cache line has not yet been writting back to RAM. Before evicting this cache line, it needs to write-back the data to memory.
Exclusive: the core has exclusive access to this cache line but has not yet modified it.
Shared: The cache line is shared across other cores. Cores can only read the data from this cache line.
Invalid: The cache line is empty/unused.

To show how this protocol might work lets take a look at this example:

Lets say we have a system with two cores. The following sequence of events might occur when accessing a cache line.

void core1(){
  int x = A;
}

void core2(){
  A = 1; 
}

As we can see, core 1 reads the value from cache line A. Core 2, on the other hand, writes to A.

Assuming that cache line A is initially invalid in both cores, the cache coherence protocol would work like this:

core 1: cache line A is invalid, fetch cache line from memory and mark its status as Shared.
core 2: cache line A is invalid, fetch the cache line and send an invalidate message to other caches. The state of the cache line will be Exclusive.
core 1: invalidate cache line A and send invalidate acknowledge response.
core 2: receives cache line and invalidate ack response from other cores and writes data to the cache line. The state of the cache line will be Modified.

In this example, core 2 must wait for the invalidate ack responses from all cores as well as the corresponding cache line to be fetched before writing that data. This is especially a problem if another core is busy writing/reading data to that cache line. As a result of that, the core would block. This is a big performance bottleneck and not how the hardware actually works.

In reality, each core maintains a store buffer of the writes that it issued. This has a few benefits in that it allows speculative execution of stores without actually commiting them to the cache or memory. In addition, it allows writes to be issued to a cache line that we don’t yet have exclusive access to, without stalling (fetching the cache-line and invalidating it from other caches).

Atomic Operations

Atomic operations perform the operation that you specify (incr, store, cmpxch, etc.), the important difference is that they ensure that no other core can read/write to the cache line while the operation is ongoing. In x86, you can use the lock prefix on some specific instruction to make them atomic.

For example, when executing this atomic read-modify-write instruction, fetch add, the following would happen: (assume that the cache line containing the value is invalid)

lock add qword ptr [rcx], rax

cache line line containing the address in rcx is invalid, fetch the cache line and make sure you have exclusive access to it, invalidating it in other cores.
read the qword at address rcx
add the value rax to it
store the value back to the address
mark the cache line state as Modified

While the CPU is executing this atomic instruction, it will hold off to responding to cache coherency requests (such as read, invalidate, etc.), ensuring that no other core can access this cache line.

Memory Ordering

atomic operations only guarantee that an operation will be observed as complete or incomplete. It can not be observed in a state in between. It does not control how memory accesses around these atomic operations are ordered.

For example, take a look at this code:

bool can_read = false;
int x = 0;

// core 1
x = 1;
atomicStore(can_read, true)

// core 2
if(can_read){
  assert(x == 1);
}

The above code can fail because the atomic store can be reordered before setting x to 1. As a result, can_read can be true and x == 0.

These are the different memory orderings that most compilers support:

unordered: no ordering guaranteed. Only the atomicity of the operation.
monotonic: similar to unordered. but there is a total order for stores that is observed among all threads. All threads see writes to a given address in the same order.
acquire: ensures that all memory loads before this instruction are observed before memory accesses after this instruction.
release: ensures that all memory stores before this instruction happen before subsequent stores after this instruction.
acq_rel: for read-modify-write operations, it provides both an acquire barrier for the load and release barrier for store.
seq_cst: ensures total ordering between atomic operations. This acts as release for store and acquire for loads. However, it guarantees that seq_cst operations have a total order that is observed. The LLVM docs explain this as follows:

SequentiallyConsistent loads minimally require the same barriers as Acquire operations and SequentiallyConsistent stores require Release barriers. Additionally, the code generator must enforce ordering between SequentiallyConsistent stores followed by SequentiallyConsistent loads. This is usually done by emitting either a full fence before the loads or a full fence after the stores; which is preferred varies by architecture.

In essence, the memory_order sets a restriction on the compiler and the cpu from reordering the memory accesses around an atomic operation.