I've spent a bunch of time the past few weeks learning about lock-free and wait-free algorithms for super fast multi-threaded data structure access.
I've had two motivations for this. First, I wanted to vet the new SDL atomic API, and second I wanted to potentially increase the performance of the SDL event system.
In the course of learning about lockless programming, I've gained a healthy respect for people who are doing it correctly. It's incredibly tricky to get right, and even reviewed and published papers can have subtle bugs in them. There are a lot of subtle issues and ways multi-threaded lockless code can fail, especially on the XBox 360 where data ordering between CPUs is not guaranteed at all without expensive sync instructions.
All of this has made me wonder if it's a good idea to make a cross-platform atomic API available in the first place.
But that said, let's first see what the benefits might be...
I wrote a lock-free FIFO that modeled the behavior of the SDL event queue, and created a test that could run both lock-free and using a mutex, to simulate the way the current SDL event queue is handled. I then added a watcher thread and single spinlock to the lock-free version to simulate a thread periodically coming in and having to do heavy duty manipulation of the queue.
The test took 4 writer threads and had each of them put 1 MILLION events on the queue, and then created 4 reader threads to pull them off and process them. The queue was limited to a maximum of 256 events. I defined wait states for writers if they tried to add an event and the queue was full, and defined wait states for readers if they tried to get an event and the queue was empty.
I ran the test on a 4-core Mac Pro (with 8 hardware threads), on a mostly idle system.
The results were astonishing!
FIFO test---------------------------------------
Mode: Mutex
Finished in 37.097000 sec
Writer 0 wrote 1000000 events, had 0 waits
Writer 1 wrote 1000000 events, had 0 waits
Writer 2 wrote 1000000 events, had 0 waits
Writer 3 wrote 1000000 events, had 0 waits
Reader 0 read 999998 events, had 50 waits
Reader 1 read 1000003 events, had 23 waits
Reader 2 read 999998 events, had 13 waits
Reader 3 read 1000001 events, had 7 waits
FIFO test---------------------------------------
Mode: LockFree
Finished in 0.688000 sec
Writer 0 wrote 1000000 events, had 5308 waits
Writer 1 wrote 1000000 events, had 5263 waits
Writer 2 wrote 1000000 events, had 5348 waits
Writer 3 wrote 1000000 events, had 5314 waits
Reader 0 read 1023604 events, had 7838 waits
Reader 1 read 991976 events, had 7688 waits
Reader 2 read 981770 events, had 7700 waits
Reader 3 read 1002650 events, had 7893 waits
As you can see, the lock-free version was over 50 times faster than the mutex version!
On the other hand, the mutex protected queue was able to process an event in 9 microseconds, which is well within the design specs for the SDL event queue. :)
In conclusion, lock-free algorithms are a huge benefit for code that needs to be extremely high performance and scale well across many processors. Applications of this might be fast transaction network servers, databases, or massively parallel data processing.
Resources:
http://msdn.microsoft.com/en-us/library/ee418650%28v=vs.85%29.aspx
http://www.1024cores.net/home/lock-free-algorithms
http://www.drdobbs.com/high-performance-computing/212201163
Oh, and if you're not afraid yet, think about might what happen if you're managing dynamically allocated objects with an atomic reference count...
http://www.1024cores.net/home/lock-free-algorithms/object-life-time-management/differential-reference-counting
http://www.drdobbs.com/architecture-and-design/184401888