Saturday, July 13, 2013

Lock Free - Part Deux

It's been a while since my last post and I've spent the last year working on lots of cool stuff at Valve. I worked on the item importer for Team Fortress 2, helped with the Linux Steam launch, and am currently working on very cool unannounced tech, which I'm looking forward to sharing with people soon.

In the meantime, I'm lucky enough to be sitting across from Bruce Dawson, who is one of the most amazing people I know when it comes to crash analysis, performance analysis, and "what the heck is my system doing?"  His blog is full of all sorts of interesting things, and you can check it out at http://randomascii.wordpress.com/ It also turns out that he was the author of the GDC talk that I saw a few years ago which got me interested (and afraid of) lockless programming. It's a small world! :)

So occasionally I like to break out of what I'm doing day to day and catch up on SDL work, which always has different and interesting challenges.

This week I went and fixed a bug in SDL regarding OpenGL context caching. An OpenGL context is made "current" on a given thread, and each thread can have a different context active. SDL was caching the current context in a global variable, which is completely wrong.  What I really needed here was thread-local storage.

No problem, I'll just add it to SDL!

It turns out that each supported OS has the concept of thread-local storage, and all I need is a single slot of OS thread-local storage to implement a full set of SDL based thread-local storage. However, by its very nature, multiple threads may allocate from it simultaneously and I needed an efficient way to make sure only a single thread allocates the first OS slot.

No problem, SDL already has spinlocks which are a good candidate for quick initialization like this:
static DWORD thread_local_storage = TLS_OUT_OF_INDEXES;

SDL_TLSData *
SDL_SYS_GetTLSData()
{
    if (thread_local_storage == TLS_OUT_OF_INDEXES) {
        static SDL_SpinLock lock;
        SDL_AtomicLock(&lock);
        if (thread_local_storage == TLS_OUT_OF_INDEXES) {
            thread_local_storage = TlsAlloc();
        }
        SDL_AtomicUnlock(&lock);
    }
    return (SDL_TLSData *)TlsGetValue(thread_local_storage);
}

I had Bruce come over and look at it since I knew he was well versed in this sort of thing, and he pointed out that even though I'm using a lock, it's really a form of lockless programming since I don't unconditionally acquire a lock before testing state.

I thought I was pretty safe, since I was double checking the condition after acquiring a lock, so if multiple threads tried to acquire the lock simultaneously only one would actually do the initialization. The assignment is an atomic operation, so if another thread passed the if conditional it would have a valid value in thread_local_storage and would be fine.

Bruce pointed out that at the machine level it's a giant soup of memory and instructions, and memory accesses inside TlsAlloc() may be reordered relative to the assignment of the return value.  As compilers get better at inlining during link time optimization, there may not even be a function call and return value. This is a common problem with the double-checked locking that I'm doing here.

What I really want to do is guarantee that all data initialization involving TlsAlloc() are written to memory before any the write to thread_local_storage ‘publishes’ them. And conversely, I want any other thread to be sure it can see all the data associated with thread_local_storage before it starts using it.

The right tool here is memory barriers. On x86 and x64 architectures memory barriers are simply instructions that prevent the compiler from reordering CPU instructions, since the hardware is strongly ordered (other CPUs see writes in the order they occur.) On ARM and PowerPC CPUs memory barriers are hardware instructions that force synchronization. Jeff Preshing has a great blog that explains how to think about them, along with lots of other great information about lockless programming:
http://preshing.com/20120913/acquire-and-release-semantics

Essentially, the initializing thread performs a bunch of work, then sets a flag variable and issues a write-release. All other threads check the flag variable, perform a read-acquire, and then continue using the data signaled ready by the flag.

Applying this here, we end up with:
static DWORD thread_local_storage = TLS_OUT_OF_INDEXES;

SDL_TLSData *
SDL_SYS_GetTLSData()
{
    if (thread_local_storage == TLS_OUT_OF_INDEXES) {
        static SDL_SpinLock lock;
        SDL_AtomicLock(&lock);
        if (thread_local_storage == TLS_OUT_OF_INDEXES) {
            DWORD storage = TlsAlloc();
            SDL_MemoryBarrierRelease();
            thread_local_storage = storage;
        }
        SDL_AtomicUnlock(&lock);
    }
    SDL_MemoryBarrierAcquire();
    return (SDL_TLSData *)TlsGetValue(thread_local_storage);
}

It's tempting to simply add the word "volatile" to the declaration of thread_local_storage, but that's not exactly what we want. Visual Studio has implemented volatile as having the correct memory barrier semantics when using /volatile:ms, but other compilers simply guarantee that the they won't reorder reads and writes around it, and the CPU is completely free to ignore that. In fact, it may be safer to define a new keyword that indicates that multiple threads access a variable but doesn't have volatile semantics, and always use memory barriers when appropriate.

The final code in SDL has error handling and a fallback to a slower generic implementation in case the process is out of thread-local storage slots, but I believe this is a correct and complete example of thread-safe initialization.

Cheers!