Wednesday, January 26, 2011

Is Lock-Free worth it?

I've spent a bunch of time the past few weeks learning about lock-free and wait-free algorithms for super fast multi-threaded data structure access.

I've had two motivations for this.  First, I wanted to vet the new SDL atomic API, and second I wanted to potentially increase the performance of the SDL event system.

In the course of learning about lockless programming, I've gained a healthy respect for people who are doing it correctly.  It's incredibly tricky to get right, and even reviewed and published papers can have subtle bugs in them.  There are a lot of subtle issues and ways multi-threaded lockless code can fail, especially on the XBox 360 where data ordering between CPUs is not guaranteed at all without expensive sync instructions.

All of this has made me wonder if it's a good idea to make a cross-platform atomic API available in the first place.

But that said, let's first see what the benefits might be...

I wrote a lock-free FIFO that modeled the behavior of the SDL event queue, and created a test that could run both lock-free and using a mutex, to simulate the way the current SDL event queue is handled.  I then added a watcher thread and single spinlock to the lock-free version to simulate a thread periodically coming in and having to do heavy duty manipulation of the queue.

The test took 4 writer threads and had each of them put 1 MILLION events on the queue, and then created 4 reader threads to pull them off and process them.  The queue was limited to a maximum of 256 events.  I defined wait states for writers if they tried to add an event and the queue was full, and defined wait states for readers if they tried to get an event and the queue was empty.

I ran the test on a 4-core Mac Pro (with 8 hardware threads), on a mostly idle system.

The results were astonishing!

FIFO test---------------------------------------

Mode: Mutex
Finished in 37.097000 sec

Writer 0 wrote 1000000 events, had 0 waits
Writer 1 wrote 1000000 events, had 0 waits
Writer 2 wrote 1000000 events, had 0 waits
Writer 3 wrote 1000000 events, had 0 waits

Reader 0 read 999998 events, had 50 waits
Reader 1 read 1000003 events, had 23 waits
Reader 2 read 999998 events, had 13 waits
Reader 3 read 1000001 events, had 7 waits

FIFO test---------------------------------------

Mode: LockFree
Finished in 0.688000 sec

Writer 0 wrote 1000000 events, had 5308 waits
Writer 1 wrote 1000000 events, had 5263 waits
Writer 2 wrote 1000000 events, had 5348 waits
Writer 3 wrote 1000000 events, had 5314 waits

Reader 0 read 1023604 events, had 7838 waits
Reader 1 read 991976 events, had 7688 waits
Reader 2 read 981770 events, had 7700 waits
Reader 3 read 1002650 events, had 7893 waits

As you can see, the lock-free version was over 50 times faster than the mutex version!

On the other hand, the mutex protected queue was able to process an event in 9 microseconds, which is well within the design specs for the SDL event queue. :)

In conclusion, lock-free algorithms are a huge benefit for code that needs to be extremely high performance and scale well across many processors.  Applications of this might be fast transaction network servers, databases, or massively parallel data processing.

Resources:
http://msdn.microsoft.com/en-us/library/ee418650%28v=vs.85%29.aspx
http://www.1024cores.net/home/lock-free-algorithms
http://www.drdobbs.com/high-performance-computing/212201163

Oh, and if you're not afraid yet, think about might what happen if you're managing dynamically allocated objects with an atomic reference count...
http://www.1024cores.net/home/lock-free-algorithms/object-life-time-management/differential-reference-counting
http://www.drdobbs.com/architecture-and-design/184401888

7 comments:

  1. Could you by any chance post the source code for the test?

    ReplyDelete
  2. The latency difference between the two readers using Mutex is quite large.

    Mutex:
    1023604 - 981770 == 41834

    The difference with the LockFree version:
    1000003 - 999998 == 5

    In real time apps like games, variable latency is not the nicest thing. The latency improvements are a much nicer part of the gains you get compared to overall speed.

    ReplyDelete
  3. The source code is available in test/testatomic.c in the latest SDL 1.3 snapshot:
    http://www.libsdl.org/tmp/SDL-1.3.zip

    ReplyDelete
  4. I think you have the latency difference backwards. I think the results are just showing that on my system, the thread scheduling for mutexes is much more fair than the lock-free version.

    This makes sense, if you think about it. With a mutex the operating system is doing the scheduling, but with the lock-free version the hardware threads are running independently and have more variability in wait states.

    ReplyDelete
  5. Frankly, it's a feature that's rarely vital to the operation of a game or multimedia software.
    An MMORPG server designed to handle thousands of players would probably benefit greatly, but even then it can be written (with a great deal more effort, probably) to not require their rapidness for anything.

    ReplyDelete
  6. G-Wan, a 200 KB HTTP server with ANSI C scripts, has a wait-free DB layer used for its key value store.

    Might be worth giving it a look.

    ReplyDelete
  7. As you stated in your blog "Is Lock-Free worth it?" I think that there are many ways for this.It is a good idea to make a cross-platform atomic API available in the first place.This lock is very easy to do once you know the proper method you are right that It is very tricky to get right, and even reviewed. So I liked the way you wrote this.

    ReplyDelete