Friday, 22 July 2011

Dissecting the Disruptor: Why it's so fast (part two) - Magic cache line padding

We mention the phrase Mechanical Sympathy quite a lot, in fact it's even Martin's blog title.  It's about understanding how the underlying hardware operates and programming in a way that works with that, not against it.

We get a number of comments and questions about the mysterious cache line padding in the RingBuffer, and I referred to it in the last post.  Since this lends itself to pretty pictures, it's the next thing I thought I would tackle.

Comp Sci 101
One of the things I love about working at LMAX is all that stuff I learnt at university and in my A Level Computing actually means something.  So often as a developer you can get away with not understanding the CPU, data structures or Big O notation - I spent 10 years of my career forgetting all that.  But it turns out that if you do know about these things, and you apply that knowledge, you can come up with some very clever, very fast code.

So, a refresher for those of us who studied this at school, and an intro for those who didn't.  Beware - this post contains massive over-simplifications.

The CPU is the heart of your machine and the thing that ultimately has to do all the operations, executing your program.  Main memory (RAM) is where your data (including the lines of your program) lives.  We're going to ignore stuff like hard drives and networks here because the Disruptor is aimed at running as much as possible in memory.

The CPU has several layers of cache between it and main memory, because even accessing main memory is too slow.  If you're doing the same operation on a piece of data multiple times, it makes sense to load this into a place very close to the CPU when it's performing the operation (think a loop counter - you don't want to be going off to main memory to fetch this to increment it every time you loop around).


The closer the cache is to the CPU, the faster it is and the smaller it is.  L1 cache is small and very fast, and right next to the core that uses it.  L2 is bigger and slower, and still only used by a single core.  L3 is more common with modern multi-core machines, and is bigger again, slower again, and shared across cores on a single socket.  Finally you have main memory, which is shared across all cores and all sockets.

When the CPU is performing an operation, it's first going to look in L1 for the data it needs, then L2, then L3, and finally if it's not in any of the caches the data needs to be fetched all the way from main memory.  The further it has to go, the longer the operation will take.  So if you're doing something very frequently, you want to make sure that data is in L1 cache.

Martin and Mike's QCon presentation gives some indicative figures for the cost of cache misses:

Latency from CPU to...Approx. number of
CPU cycles
Approx. time
in nanoseconds
Main memory~60-80ns
QPI transit
(between sockets, not drawn)
~20ns
L3 cache~40-45 cycles, ~15ns
L2 cache~10 cycles, ~3ns
L1 cache~3-4 cycles,~1ns
Register1 cycle

If you're aiming for an end-to-end latency of something like 10 milliseconds, an 80 nanosecond trip to main memory to get some missing data is going to take a serious chunk of that.

Cache lines
Now the interesting thing to note is that it's not individual items that get stored in the cache - i.e. it's not a single variable, a single pointer.  The cache is made up of cache lines, typically 64 bytes, and it effectively references a location in main memory.  A Java long is 8 bytes, so in a single cache line you could have 8 long variables.


(I'm going to ignore the multiple cache-levels for simplicity)

This is brilliant if you're accessing an array of longs - when one value from the array gets loaded into the cache, you get up to 7 more for free.  So you can walk that array very quickly.  In fact, you can iterate over any data structure that is allocated to contiguous blocks in memory very quickly.  I made a passing reference to this in the very first post about the ring buffer, and it explains why we use an array for it.

So if items in your data structure aren't sat next to each other in memory (linked lists, I'm looking at you) you don't get the advantage of freebie cache loading.  You could be getting a cache miss for every item in that data structure.

However, there is a drawback to all this free loading.  Imagine your long isn't part of an array.  Imagine it's just a single variable.  Let's call it head, for no real reason.  Then imagine you have another variable in your class right next to it.  Let's arbitrarily call it tail.  Now, when you load head into your cache, you get tail for free.


Which sounds fine.  Until you realise that tail is being written to by your producer, and head is being written to by your consumer.  These two variables aren't actually closely associated, and in fact are going to be used by two different threads that might be running on two different cores.


Imagine your consumer updates the value of head.  The cache value is updated, the value in memory is updated, and any other cache lines that contain head are invalidated because other caches will not have the shiny new value.  And remember that we deal with the level of the whole line, we can't just mark head as being invalid.


Now if some process running on the other core just wants to read the value of tail, the whole cache line needs to be re-read from main memory.  So a thread which is nothing to do with your consumer is reading a value which is nothing to do with head, and it's slowed down by a cache miss.

Of course this is even worse if two separate threads are writing to the two different values. Both cores are going to be invalidating the cache line on the other core and having to re-read it every time the other thread has written to it. You've basically got write-contention between the two threads even though they're writing to two different variables.

This is called false sharing, because every time you access head you get tail too, and every time you access tail, you get head as well.  All this is happening under the covers, and no compiler warning is going to tell you that you just wrote code that's going to be very inefficient for concurrent access.

Our solution - magic cache line padding
You'll see that the Disruptor eliminates this problem, at least for architecture that has a cache size of 64 bytes or less, by adding padding to ensure the ring buffer's sequence number is never in a cache line with anything else.

public long p1, p2, p3, p4, p5, p6, p7; // cache line padding
    private volatile long cursor = INITIAL_CURSOR_VALUE;
    public long p8, p9, p10, p11, p12, p13, p14; // cache line padding

So there's no false sharing, no unintended contention with any other variables, no needless cache misses.

It's worth doing this on your Entry classes too - if you have different consumers writing to different fields, you're going to need to make sure there's no false sharing between each of the fields.


EDIT: Martin wrote a more technically correct and detailed post about false sharing, and posted performance results too.

Saturday, 16 July 2011

Dissecting the Disruptor: Why it's so fast (part one) - Locks Are Bad

Martin Fowler has written a really good article describing not only the Disruptor, but also how it fits into the architecture at LMAX.  This gives some of the context that has been missing so far, but the most frequently asked question is still "What is the Disruptor?".

I'm working up to answering that.  I'm currently on question number two: "Why is it so fast?".

These questions do go hand in hand, however, because I can't talk about why it's fast without saying what it does, and I can't talk about what it is without saying why it is that way.

So I'm trapped in a circular dependency.  A circular dependency of blogging.

To break the dependency, I'm going to answer question one with the simplest answer, and with any luck I'll come back to it in a later post if it still needs explanation: the Disruptor is a way to pass information between threads.

As a developer, already my alarm bells are going off because the word "thread" was just mentioned, which means this is about concurrency, and Concurrency Is Hard.

Concurrency 101


Imagine two threads are trying to change the same value.

Case One: Thread 1 gets there first:
  1. The value changes to "blah"
  2. Then the value changes to "blahy" when Thread 2 gets there.
Case Two: Thread 2 gets there first:
  1. The value changes to "fluffy"
  2. Then the value changes to "blah" when Thread 1 gets there.
Case Three: Thread 1 interrupts Thread 2:
  1. Thread 2 gets the value "fluff" and stores it as myValue
  2. Thread 1 goes in and updates value to "blah"
  3. Then Thread 2 wakes up and sets the value to "fluffy".
Case Three is probably the only one which is definitely wrong, unless you think the naive approach to wiki editing is OK (Google Code Wiki, I'm looking at you...).  In the other two cases it's all about intentions and predictability.  Thread 2 might not care what's in value, the intention might be to append "y" to whatever is in there regardless.  In this circumstance, cases one and two are both correct.

But if Thread 2 only wanted to change "fluff" to "fluffy", then both cases two and three are incorrect.
Assuming that Thread 2 wants to set the value to "fluffy", there are some different approaches to solving the problem.

Approach One: Pessimistic locking

(Does the "No Entry" sign make sense to people who don't drive in Britain?)

The terms pessimistic and optimistic locking seem to be more commonly used when talking about database reads and writes, but the principal applies to getting a lock on an object.

Thread 2 grabs a lock on Entry as soon as it knows it needs it and stops anything from setting it. Then it does its thing, sets the value, and lets everything else carry on.

You can imagine this gets quite expensive, with threads hanging around all over the place trying to get hold of objects and being blocked.  The more threads you have, the more chance that things are going to grind to a halt.

Approach Two: Optimistic locking


In this case Thread 2 will only lock Entry when it needs to write to it.  In order to make this work, it needs to check if Entry has changed since it first looked at it.  If Thread 1 came in and changed the value to "blah" after Thread 2 had read the value, Thread 2 couldn't write "fluffy" to the Entry and trample all over the change from Thread 1.  Thread 2 could either re-try (go back, read the value, and append "y" onto the end of the new value), which you would do if Thread 2 didn't care what the value it was changing was; or it could throw an exception or return some sort of failed update flag if it was expecting to change "fluff" to "fluffy".  An example of this latter case might be if you have two users trying to update a Wiki page, and you tell the user on the other end of Thread 2 they'll need to load the new changes from Thread 1 and then reapply their changes.

Potential Problem: Deadlock
Locking can lead to all sorts of issues, for example deadlock.  Imagine two threads that need access to two resources to do whatever they need to do:


If you've used an over-zealous locking technique, both threads are going to sit there forever waiting for the other one to release its lock on the resource.  That's when you reboot Windows your computer.

Definite Problem: Locks are sloooow...
The thing about locks is that they need the operating system to arbitrate the argument.  The threads are like siblings squabbling over a toy, and the OS kernel is the parent that decides which one gets it. It's like when you run to your Dad to tell him your sister has nicked the Transformer when you wanted to play with it - he's got bigger things to worry about than you two fighting, and he might finish off loading the dishwasher and putting on the laundry before settling the argument.  If you draw attention to yourself with a lock, not only does it take time to get the operating system to arbitrate, the OS might decide the CPU has better things to do than servicing your thread.

The Disruptor paper talks about an experiment we did.  The test calls a function incrementing a 64-bit counter in a loop 500 million times.  For a single thread with no locking, the test takes 300ms.  If you add a lock (and this is for a single thread, no contention, and no additional complexity other than the lock) the test takes 10,000ms.  That's, like, two orders of magnitude slower.  Even more astounding, if you add a second thread (which logic suggests should take maybe half the time of the single thread with a lock) it takes 224,000ms.  Incrementing a counter 500 million times takes nearly a thousand times longer when you split it over two threads instead of running it on one with no lock.

Concurrency Is Hard and Locks Are Bad
I'm just touching the surface of the problem, and obviously I'm using very simple examples.  But the point is, if your code is meant to work in a multi-threaded environment, your job as a developer just got a lot more difficult:
  • Naive code can have unintended consequences.  Case Three above is an example of how things can go horribly wrong if you don't realise you have multiple threads accessing and writing to the same data.
  • Selfish code is going to slow your system down.  Using locks to protect your code from the problem in Case Three can lead to things like deadlock or simply poor performance.
This is why many organisations have some sort of concurrency problems in their interview process (certainly for Java interviews).  Unfortunately it's very easy to learn how to answer the questions without really understanding the problem, or possible solutions to it.

How does the Disruptor address these issues?
For a start, it doesn't use locks.  At all.

Instead, where we need to make sure that operations are thread-safe (specifically, updating the next available sequence number in the case of multiple producers), we use a CAS (Compare And Swap/Set) operation.  This is a CPU-level instruction, and in my mind it works a bit like optimistic locking - the CPU goes to update a value, but if the value it's changing it from is not the one it expects, the operation fails because clearly something else got in there first.


Note this could be two different cores rather than two separate CPUs.

CAS operations are much cheaper than locks because they don't involve the operating system, they go straight to the CPU.  But they're not cost-free - in the experiment I mentioned above, where a lock-free thread takes 300ms and a thread with a lock takes 10,000ms, a single thread using CAS takes 5,700ms.  So it takes less time than using a lock, but more time than a single thread that doesn't worry about contention at all.

Back to the Disruptor - I talked about the ClaimStrategy when I went over the producers.  In the code you'll see two strategies, a SingleThreadedStrategy and a MultiThreadedStrategy.  You could argue, why not just use the multi-threaded one with only a single producer?  Surely it can handle that case?  And it can.  But the multi-threaded one uses an AtomicLong (Java's way of providing CAS operations), and the single-threaded one uses a simple long with no locks and no CAS.  This means the single-threaded claim strategy is as fast as possible, given that it knows there is only one producer and therefore no contention on the sequence number.

I know what you're thinking: turning one single number into an AtomicLong can't possibly have been the only thing that is the secret to the Disruptor's speed. And of course, it's not - otherwise this wouldn't be called "Why it's so fast (part one)".

But this is an important point - there's only one place in the code where multiple threads might be trying to update the same value.  Only one place in the whole of this complicated data-structure-slash-framework.  And that's the secret.  Remember everything has its own sequence number?  If you only have one producer then every sequence number in the system is only ever written to by one thread. That means there is no contention.  No need for locks.  No need even for CAS.  The only sequence number that is ever written to by more than one thread is the one on the ClaimStrategy if there is more than one producer.

This is also why each variable in the Entry can only be written to by one consumer.  It ensures there's no write contention, therefore no need for locks or CAS.

Back to why queues aren't up to the job
So you start to see why queues, which may implemented as a ring buffer under the covers, still can't match the performance of the Disruptor.  The queue, and the basic ring buffer, only has two pointers - one to the front of the queue and one to the end:


If more than one producer wants to place something on the queue, the tail pointer will be a point of contention as more than one thing wants to write to it.  If there's more than one consumer, then the head pointer is contended, because this is not just a read operation but a write, as the pointer is updated when the item is consumed from the queue.

But wait, I hear you cry foul!  Because we already knew this, so queues are usually single producer and single consumer (or at least they are in all the queue comparisons in our performance tests).

There's another thing to bear in mind with queues/buffers.  The whole point is to provide a place for things to hang out between producers and consumers, to help buffer bursts of messages from one to the other.  This means the buffer is usually full (the producer is out-pacing the consumer) or empty (the consumer is out-pacing the producer).  It's rare that the producer and consumer will be so evenly-matched that the buffer has items in it but the producers and consumers are keeping pace with each other.

So this is how things really look.  An empty queue:



...and a full queue:

The queue needs a size so that it can tell the difference between empty and full.  Or, if it doesn't, it might determine that based on the contents of that entry, in which case reading an entry will require a write to erase it or mark it as consumed.

Whichever implementation is chosen, there's quite a bit of contention around the tail, head and size variables, or the entry itself if a consume operation also includes a write to remove it.

On top of this, these three variables are often in the same cache line, leading to false sharing.  So, not only do you have to worry about the producer and the consumer both causing a write to the size variable (or the entry), updating the tail pointer could lead to a cache-miss when the head pointer is updated because they're sat in the same place.  I'm going to duck out of going into that in detail because this post is quite long enough as it is.

So this is what we mean when we talk about "Teasing Apart the Concerns" or a queue's "conflated concerns".  By giving everything its own sequence number and by allowing only one consumer to write to each variable in the Entry, the only case the Disruptor needs to manage contention is where more than one producer is writing to the ring buffer.

In summary
The Disruptor a number of advantages over traditional approaches:
  1. No contention = no locks = it's very fast.
  2. Having everything track its own sequence number allows multiple producers and multiple consumers to use the same data structure.
  3. Tracking sequence numbers at each individual place (ring buffer, claim strategy, producers and consumers), plus the magic cache line padding, means no false sharing and no unexpected contention.
EDIT: Note that version 2.0 of the Disruptor uses different names to the ones in this article.  Please see my summary of the changes if you are confused about class names.

Sunday, 10 July 2011

Dissecting the Disruptor: Wiring up the dependencies

So now I've covered the ring buffer itself, reading from it and writing to it.

Logically the next thing to do is to wire everything up together.

I talked about multiple producers - they have the producer barrier to keep them in order and under control.  I've talked about consumers in a simple situation.  Multiple consumers can get a little more involved.  We've done some clever stuff to allow the consumers to be dependent on each other and the ring buffer.  Like a lot of applications, we have a pipeline of things that need to happen before we can actually get on with the business logic - for example, we need to make sure the messages have been journalled to disk before we can do anything.

The Disruptor paper and the performance tests cover some basic configurations that you might want. I'm going to go over the most interesting one, mostly because I needed the practice with the graphics tablet.

Diamond configuration
DiamondPath1P3CPerfTest illustrates a configuration which is not too uncommon - a single producer with three consumers.  The tricky point being that the third consumer is dependent upon the previous two consumers to finish before it can do anything.


Consumer three might be your business logic, consumer one could be backing up the data received, and consumer two may be preparing the data or something.

Diamond configuration using queues
In a SEDA-style architecture, each stage will be separated by a queue:


(Why does queue have to have so many "e"s?  It's the letter I have the most trouble with in these drawings).

You might get an inkling of the problem here: for a message to get from P1 to C3 it has to travel through four whole queues, each queue taking its cost in terms of putting the message on the queue and taking it off again.

Diamond configuration using the Disruptor
In the Disruptor world, it's all managed on a single ring buffer:

It does look more complicated.  But the ring buffer remains the single point of contact between all the players, and the interactions are all based on the barriers checking the sequence numbers of the things it's dependent upon.

The producer side is fairly simple, it's the single producer model described in my last post. Interestingly, the producer barrier doesn't have to care about all the consumers.  It only cares about consumer three, because if consumer three has finished with an item in the ring buffer the other two will already have processed it.  So if C3 has moved on, that slot in the ring buffer is available.

To manage the dependencies between the consumers you need two consumer barriers.  The first just talks to the ring buffer and consumers one and two ask it for the next available item.  The second consumer barrier knows about consumers one and two, and it will return the lowest sequence number processed by both consumers.

How consumer dependencies work in the Disruptor
Hmm.  I can see I'm going to need an example.
We're joining the party halfway through the story: the producer has filled the ring buffer up to sequence number 22; consumer one has read and processed everything up to 21; consumer two has processed everything up to sequence 18; consumer three, which is dependent upon the other consumers, has only made it as far as 15.

The producer can't write anything more to the ring buffer because sequence 15 is taking up the slot where we'd want to put sequence 23.
(I'm sorry, I really did try to find an alternative to red and green, but everything else was just as ambiguous).

The first consumer barrier lets consumers one and two know they can grab anything up to sequence 22, the highest sequence number in the ring buffer.  The second consumer barrier checks the ring buffer sequence, but it also checks the sequences on the other two consumers and returns the lowest value.  So consumer three is told it can get anything up to sequence 18 from the ring buffer.

Note that the consumers are still reading the entries directly from the ring buffer - consumers one and two are not taking the entries off the ring buffer and then passing them on to consumer three.  Instead, the second consumer barrier is letting consumer three know which entry in the ring buffer it's safe to process.

This raises a question - if everything comes directly off the ring buffer, how is consumer three going to find out about anything the first two consumers have done?  If all consumer three cares about is that the earlier consumers have done their job (e.g. replicating the data to somewhere else) then everything's fine - when consumer three is told the job is done, it's happy.  If, however, consumer three needs the results of an earlier consumer's processing, where does it get that from?

Modifying entries
The secret is to write them to the ring buffer Entry itself.  This way, when consumer three grabs the entry off the ring buffer, it will have been populated with all the information consumer three needs to do the job.  The really important part of this is that for each field on the Entry only one consumer is allowed to write to it.  This prevents any write-contention which will slow the whole thing down.


You can see this in DiamondPath1P3CPerfTest - FizzBuzzEntry has two fields as well as the value: fizz and buzz.  If the consumer is a Fizz consumer, it writes to fizz.  If it's a Buzz consumer, it writes to buzz.  The third consumer, FizzBuzz, will read both of these fields but not write to either, since reading is fine and won't cause contention.

Some actual Java code
All this looks more complicated than the queue implementation.  And yes, it does involve a bit more coordination.  But this is hidden from the consumers and producers, they just talk to the barriers.  The trick is in the configuration.  The diamond graph in the example above would be created using something like the following:

ConsumerBarrier consumerBarrier1 = ringBuffer.createConsumerBarrier();

BatchConsumer consumer1 = new BatchConsumer(consumerBarrier1, handler1);
BatchConsumer consumer2 = new BatchConsumer(consumerBarrier1, handler2);

ConsumerBarrier consumerBarrier2 = 
    ringBuffer.createConsumerBarrier(consumer1, consumer2);

BatchConsumer consumer3 = new BatchConsumer(consumerBarrier2, handler3);

ProducerBarrier producerBarrier = 
    ringBuffer.createProducerBarrier(consumer3);

In summary
So there you have it - how to wire up the Disruptor with multiple consumers that are dependent on each other.  The key points:
  • Use multiple consumer barriers to manage dependencies between consumers.
  • Have the producer barrier watch the last consumer in the graph.
  • Allow only one consumer to write to an individual field in an Entry.
EDIT: Adrian has written a nice DSL to make wiring up the Disruptor much easier.

EDIT 2: Note that version 2.0 of the Disruptor uses different names to the ones in this article.  Please see my summary of the changes if you are confused about class names.  Also Adrian's DSL is now part of the main Disruptor code base.

    Monday, 4 July 2011

    Dissecting the Disruptor: Writing to the ring buffer

    This is the missing piece in the end-to-end view of the Disruptor.  Brace yourselves, it's quite long.  But I decided to keep it in a single blog so you could have the context in one place.

    The important areas are: not wrapping the ring; informing the consumers; batching for producers; and how multiple producers work.

    ProducerBarriers
    The Disruptor code has interfaces and helper classes for the Consumers, but there's no interface for your producer, the thing that writes to the ring buffer.  That's because nothing else needs to access your producer, only you need to know about it.  However, like the consuming side, a ProducerBarrier is created by the ring buffer and your producer will use this to write to it.

    Writing to the ring buffer involves a two-phase commit.  First, your producer has to claim the next slot on the buffer.  Then, when the producer has finished writing to the slot, it will call commit on the ProducerBarrier.

    So let's look at the first bit.  It sounds easy - "get me the next slot on the ring buffer".  Well, from your producer's point of view it is easy.  You simply call nextEntry() on the ProducerBarrier.  This will return you an Entry object which is basically the next slot in the ring buffer.

    The ProducerBarrier makes sure the ring buffer doesn't wrap
    Under the covers, the ProducerBarrier is doing all the negotiation to figure out what the next slot is, and if you're allowed to write to it yet.


    (I'm not convinced the shiny new graphics tablet is helping the clarity of my pictures, but it's fun to use).

    For this illustration, we're going to assume there's only one producer writing to the ring buffer.  We will deal with the intricacies of multiple producers later.

    The ConsumerTrackingProducerBarrier has a list of all the Consumers that are accessing the ring buffer.  Now to me this seemed a bit odd - I wouldn't expect the ProducerBarrier to know anything about the consuming side. But wait, there is a reason.  Because we don't want the "conflation of concerns" a queue has (it has to track the head and tail which are sometimes the same point), our consumers are responsible for knowing which sequence number they're up to, not the ring buffer.  So, if we want to make sure we don't wrap the buffer, we need to check where the consumers have got to.

    In the diagram above, one Consumer is happily at the same point as the highest sequence number (12, highlighted in red/pink). The second Consumer is a bit behind - maybe it's doing I/O operations or something - and it's at sequence number 3.  Therefore consumer 2 has the whole length of the buffer to go before it catches up with consumer 1.

    The producer wants to write to the slot on the ring buffer currently occupied by sequence 3, because this slot is the one after the current ring buffer cursor.  But the ProducerBarrier knows it can't write here because a Consumer is using it.  So the ProducerBarrier sits and spins, waiting, until the consumers move on.

    Claiming the next slot
    Now imagine consumer 2 has finished that batch of entries, and moves its sequence number on. Maybe it got as far as sequence 9 (in real life I expect it will make it as far as 12 because of the way consumer batching works, but that doesn't make the example as interesting).


    The diagram above shows what happens when consumer 2 updates to sequence number 9.  I've slimmed down the ConsumerBarrier in this picture because it takes no active part in this scene.

    The ProducerBarrier sees that the next slot, the one that had sequence number 3, is now available.  It grabs the Entry that sits in this slot (I've not talked specifically about the Entry class, but it's basically a bucket for stuff you want to put into the ring buffer slot which has a sequence number), sets the sequence number on the Entry to the next sequence number (13) and returns this entry to your producer.  The producer can then write whatever value it wants into this Entry.

    Committing the new value
    The second phase of the two-stage commit is, well, the commit.


    The green represents our newly updated Entry with sequence 13 - yeah, I'm sorry, I'm red-green colour-blind too.  But other colours were even more rubbish.

    When the producer has finished writing stuff into the entry it tells the ProducerBarrier to commit it.

    The ProducerBarrier waits for the ring buffer cursor to catch up to where we are (for a single producer this will always be a bit pointless - e.g. we know the cursor is already at 12, nothing else is writing to the ring buffer).  Then the ProducerBarrier updates the ring buffer cursor to the sequence number on the updated Entry - 13 in our case.  Next, the ProducerBarrier lets the consumers know there's something new in the buffer.  It does this by poking the WaitStrategy on the ConsumerBarrier - "Oi, wake up! Something happened!" (note - different WaitStrategy implementations deal with this in different ways, depending upon whether it's blocking or not).

    Now consumer 1 can get entry 13, consumer 2 can get everything up to and including 13, and they all live happily ever after.

    ProducerBarrier batching
    Interestingly the disruptor can batch on the producer side as well as on the Consumer side.  Remember when consumer 2 finally got with the programme and found itself at sequence 9?  There is a very cunning thing the ProducerBarrier can do here - it knows the size of the buffer, and it knows where the slowest Consumer is.  So it can figure out which slots are now available.


    If the ProducerBarrier knows the ring buffer cursor is at 12, and the slowest Consumer is at 9, it can let producers write to slots 3, 4, 5, 6, 7 and 8 before it needs to check where the consumers are.

    Multiple producers
    You thought I was done, but there's more.

    I slightly lied in some of the above drawings.  I implied that the sequence number the ProducerBarrier deals with comes directly from the ring buffer's cursor.  However, if you look at the code you'll see that it uses the ClaimStrategy to get this.  I skipped this to simplify the diagrams, it's not so important in the single-producer case.

    With multiple producers, you need yet another thing tracking a sequence number.  This is the sequence that is available for writing to.  Note that this is not the same as ring-buffer-cursor-plus-one - if you have more than one producer writing to the buffer, it's possible there are entries in the process of being written that haven't been committed yet.

    Let's revisit claiming a slot.  Each producer asks the ClaimStrategy for the next available slot.  Producer 1 gets sequence 13, like in the single producer case above.  Producer 2 gets sequence 14, even though the ring buffer cursor is still only pointing to 12, because the ClaimSequence is dishing out the numbers and has been keeping track of what's been allocated.

    So each producer has its own slot with a shiny new sequence number.

    I'm going colour producer 1 and its slot in green, and producer 2 and its slot in a suspiciously pink-looking purple.

    Now imaging producer 1 is away with the fairies, and hasn't got around to committing for whatever reason.  Producer 2 is ready to commit, and asks the ProducerBarrier to do so.

    As we saw in the earlier commit diagram, the ProducerBarrier is only going to commit when the ring buffer cursor reaches the slot behind the one it wants to commit into.  In this case, the cursor needs to reach 13 so that we can commit 14.  But we can't, because producer 1 is staring at something shiny and hasn't committed yet.  So the ClaimStrategy sits there spinning until the ring buffer cursor gets to where it should be.

    Now producer 1 wakes up from its coma and asks to commit entry 13 (green arrows are sparked by the request from producer 1).  The ProducerBarrier tells the ClaimStrategy to wait for the ring buffer cursor to get to 12, which it already had of course.  So the ring buffer cursor is incremented to 13, and the ProducerBarrier pokes the WaitStrategy to let everything know the ring buffer was updated.  Now the ProducerBarrier can finish the request from producer 2, increment the ring buffer cursor to 14, and let everyone know that we're done.

    You'll see that the ring buffer retains the ordering implied by the order of the initial nextEntry() calls, even if the producers finish writing at different times.  It also means that if a producer is causing a pause in writing to the ring buffer, when it unblocks any other pending commits can happen immediately.

    Phew.  And I managed to describe all that without mentioning a memory barrier once.


    EDIT: The most recent version of the RingBuffer hides away the Producer Barrier.  If you can't see a ProducerBarrier in the code you're looking at, then assume where I say "producer barrier" I mean "ring buffer"

    EDIT 2: Note that version 2.0 of the Disruptor uses different names to the ones in this article.  Please see my summary of the changes if you are confused about class names.