Exploring the Galaxy: February 2011

Friday, February 25, 2011

Business Phone and FAX

For business purposes I need a contact phone and FAX number, but I don't want to have the cost and equipment requirements of a separate phone line. Given that I'm creating a virtual office, I figured there must be inexpensive and flexible ways of setting up a business phone system.

Google to the rescue, again!

Actually there are lots and lots of options available with different pricing schemes and features, but given that I'm using Google for business e-mail and document sharing, and Google Voice is free, it's a natural fit:
http://www.google.com/voice

It's a little awkward to call from - it requires you to enter the number you want to call and which phone you want to call from and then it calls both and connects the two, but it's freakin' awesome as an incoming voice message system. You can route your virtual number to any phone you want, and if it goes to voice-mail, Google will take a message, transcribe it to text, and send the message to you via SMS and e-mail.

So, now that voice is taken care of, I need a solution for FAX. Again, there are lots of options available, but given that I plan to use e-mail for most outbound communication and FAX only for receiving signed license documents, OneSuite turns out to be a good option:
http://www.onesuite.com/

You have to create an account and "charge" it with $10, but once you do that, you can activate a FAX number for unlimited incoming transmissions for $1/month. If you want outgoing FAX capability, or want the incoming FAX messages to go to multiple e-mail addresses, you can upgrade to $2.95/month + 2.5c/page for outgoing transmissions.

Business phone and FAX solution implemented. Total cost? $1/month. Pure win? Priceless.

Problem Solving in PowerPC Assembly

When you're working with other people's code, you don't always have the source, and sometimes this means you have to dive into the assembly code to figure out what's going on.

If you're lucky, there will be a small function where there are well defined inputs and outputs and you can see what's going on...

In this case someone reported a bug where if they created two different streaming textures, the program would crash deep inside glTexSubImage2D() in a routine called gleCopy(). They helpfully sent a small test program, and sure enough, it crashed on my iMac.

So, not having any leads, I start stepping through the assembly code:

...
0x0424a3fc <gleCopy+88>:        mtctr   r4
0x0424a400 <gleCopy+92>:        rlwinm  r0,r9,2,0,29
0x0424a404 <gleCopy+96>:        addi    r9,r9,1
0x0424a408 <gleCopy+100>:       lwzx    r2,r11,r0
0x0424a40c <gleCopy+104>:       stwx    r2,r3,r0
0x0424a410 <gleCopy+108>:       bdnz+   0x424a400 <gleCopy+92>
...
0x0424a464 <gleCopy+192>:       add     r11,r11,r7
0x0424a46c <gleCopy+200>:       add     r3,r3,r8
...

This is a tight copy loop, where the loop count is loaded from register r4, and the data is copied by offset from pointer r11 and stored at that same offset in poiner r3. At the end of the loop, r11 is incremented by r7 and r3 is incremented by r8.

In C this might look like:

while (rows_to_copy) {
    for (i = 0; i < count; ++i) {
        dst[i] = src[i];
    }
    src += src_pitch;
    dst += dst_pitch;
}

When I printed out the registers, I noticed an interesting thing. The source and destination pointers started out the same, but the source and destination pitches were completely different!

To understand why the source and destination pointers were the same, I looked at the code that creates streaming textures and saw I'm using an Apple extension to have OpenGL use application memory instead of internally allocated memory. Since I use that same pointer when updating the streaming textures, it makes sense that if the system thought copying needed to be done, it would copy from the pointer I passed in, to the pointer I told it to use for data storage.

But why would it think it needed to do copying? I looked on the original web page describing how to optimize Mac OS X texture upload and noticed that it had a note saying that only textures with a width of a multiple of 32 bytes would bypass the copying step. I figured that it shouldn't crash if they didn't have an aligned width, but what the heck. I resized the textures and as expected the program still crashed.

So what would cause the pitches to be different? Well, in a normal texture upload, the pitch is controlled by the GL_UNPACK_ROW_LENGTH attribute. In a flash, I realized that's probably what the extension uses to determine the original pitch of the texture. Sure enough, I was missing a call to set that attribute when creating the texture. Looking more closely at the values in assembly, the pitch that it was using for the destination was indeed the pitch value that was set for the first texture upload, and of course if pitch values don't match in a pixel copy operation, you have to copy them row by row, just like the assembly code above was doing.

Adding a call to set GL_UNPACK_ROW_LENGTH in texture creation fixed the problems!

So, even though the bug was in my code, and a pretty obvious one in retrospect, it was really helpful to be able to look at the assembly and understand WHY my code was wrong.

Cheers!

Thursday, February 24, 2011

Evolution of a Logo

This past month I've been working with a friend of mine, Shawn Bellah, on rebranding Galaxy Gameworks in preparation for GDC. This is a kind of big deal, because this is the birth of the company in the public eyes, and I want to make sure it's as fun and professional as possible.

One of the important pieces of this process is the logo. The logo should be distinctive, simple, and say something awesome about your company.

We went through a very iterative process during the logo development. We started with a bunch of brainstorming sketches that Shawn had done, where we looked at different ideas to use with the logo. We liked the double-G play in the company name, and geeked out on the space theme. We looked at lots of pictures of galaxies and nebulae, and tried a few different ideas like interlocking G's, where the top of each G was an arm of a spiral galaxy, another where the G was the center of a galaxy seen edge on, another where the G was made of welded steel plates surrounding a galactic center, etc.

But I kept coming back to the first sketch that he had done to warm up, which was a solar system with the planets aligned to form the cross-bar of the G. I felt a connection there, and I wanted to go explore the planets, and above all it just looked interesting.

So he played around with that idea, creating a galaxy suggestive of the letter G with some larger stellar objects to create interest:

We settled on one of the variants with a nice angle, the planets in alignment and a glow in the middle:

That looked nice, but wasn't suggestive enough of the G, and was a little too complex for our purposes. So Shawn removed the planets and the tail, and thickened up the lines considerably:

This was simple and showed off the two G's nicely, but I didn't really like the bulge on the outer G, so I used my crude art skills to trim it to show what I was thinking, and Shawn came back with half a dozen variants on what I was showing him.

At that point, both Lauren and I had been staring at it for too long to have any reasonable fresh opinions, so I showed the options to some friends who were artists, and they unanimously picked one of them.

So, a little more polish and we have our final logo!

It's clean, simple, and has the double G with a galaxy motif. Perfect!

Thursday, February 17, 2011

Ninja hacking on the iPhone

I'm tracking down a crash in SDL on the iPhone, and the path is not yet clear to me, but I thought some people would enjoy the view along the way.

The crash itself happens only on the real phone, not on the simulator, and it's a crash initializing an SDL_uikitopenglview, which is a view deriving from SDL_uikitview, which in turn derives from UIView.

The callstack for the crash looks like this:
_class_getMeta ()
_class_isInitialized ()
_class_initialize ()
...
objc_msgSend_uncached ()
UIKit_GL_CreateContext () at SDL_uikitopengles.m:146

Of course everything past my code is ARM assembly, which makes it a little tricky to debug. Luckily Apple has published the source to their Objective C runtime, so I can disassemble the functions using gdb and follow along:
http://www.opensource.apple.com/source/objc4/objc4-437.1

First, there's a couple useful things to know if you're poking around at this level:

The ARM calling conventions are that registers r0 through r3 are for parameters passed into functions, and they correspond to parameters from the left to the right. The return value of the function is also passed back through r0.

The Xcode debugging window has a nice interface with the code right there along with the local variables and registers. On the far right is a button to bring up the gdb console where you can do some pretty advanced things.

gdb quick reference:
b <name> - set a breakpoint at the beginning of the named function
s - go to the next line of code, stepping into function calls
n - go to the next line of code, skipping over function calls
si - go to the next assembly instruction
fin - run until the function returns
c - continue running until the next breakpoint
p <var> - print the value of a variable or register (e.g. $r0, $r1, etc.)
x <address> - lookup the symbol associated with an address
display <var> - print the value of a variable or register after each command
list - list the code around the current execution

Most of these we don't need since the Xcode UI is pretty nice, but a really handy one is 'si', since that will let us step into the assembly and then use the UI to continue tracing the execution.

So first, I set a breakpoint at the line that crashes:
view = [[SDL_uikitopenglview alloc]

Then, I bring up the gdb console and use the 'si' command a few times until I get into assembly, just to see what things look like:

I'm curious what the first parameter to objc_msgSend() is, so I use 'x $r0' and it shows that it's "OBJC_CLASS_$_SDL_uikitopenglview", which is the Objective C class definition for my custom view.

Then I use the 'b' command to set a breakpoint in the _class_initialize() function, and bring up the code so I can follow along with the assembly. When the breakpoint hits, I step into the first instruction in the function, a call to _class_getNonMetaClass(). I double check r0, and it's still my view class definition, but on return from the function, it's been set to 0!

The code that was executed is this:

static class_t *getNonMetaClass(class_t *cls)
{
    rwlock_assert_locked(&runtimeLock);
    if (isMetaClass(cls)) {
        cls = NXMapGet(uninitializedClasses(), cls);
    }
    return cls;
}

which means that somehow the class for my view didn't get into the map of classes that my program has loaded.

I did a little googling and found that Apple has a set of APIs for managing and interacting with the Objective C classes, and so I wrote a function to print them out and look for anything with SDL in it:

void print_classes()
{
    int i, numClasses;
    Class * classes;

    numClasses = objc_getClassList(NULL, 0);
    classes = malloc(sizeof(Class) * numClasses);
    numClasses = objc_getClassList(classes, numClasses);
    for (i = 0; i < numClasses; ++i) {
        char *name = class_getName(classes[i]);
        if (SDL_strstr(name, "SDL_")) {
            name; // Yay, found it!
        }
    }
    free(classes);
}

Sure enough, when I run it on the simulator I find the SDL view classes, and when I run it on the device they don't show up. If I use nm on the application binary in the app folder, I see the classes are there, in both the simulator and device binaries:
nm -m Happy | fgrep SDL_uikitopenglview
0008a287 (__TEXT,__text) non-external -[SDL_uikitopenglview context]
...
000cd06c (__DATA,__objc_data) external _OBJC_CLASS_$_SDL_uikitopenglview
000cd058 (__DATA,__objc_data) external _OBJC_METACLASS_$_SDL_uikitopenglview

So, at this point I know why the crash is happening, but I don't know why the classes aren't being loaded on the device, or how to fix it yet.

Update: Eric Wing figured this out. The problem is that the Objective C class definitions were in a static library and the linker wasn't bringing in all the code necessary to construct the classes. The solution is to add -ObjC to the "Other Linker Flags" setting for your application.

Thanks Eric! :)

Saturday, February 12, 2011

MPEG acceleration with GLSL

Video decoding is something that people are always trying to find ways to accelerate. Whether it's making HD video more HD or dynamically streaming video to textures in your game, we want it as fast and high quality as possible.

MPEG based codecs have basically two steps which are time consuming, the first is decoding each frame into a YUV colorspace image, and the second is converting the image from YUV to RGB. There is lots of information available on the MPEG stream decoding and YUV colorspaces, but here I'm going to focus on the YUV to RGB conversion.

To understand how to accelerate this process, we need to understand a little about the YUV format and how the conversion is done.

YV12 images consist of 3 planes, one Y image sized WxH, and a U and V image, sized W/2 x H/2. Put simply, the Y plane contains the luminance, which can be used alone for grayscale, and the U and V planes contain the red and blue color components, one value for each 2x2 block of output pixels.

The formula for converting from YUV to RGB is:

R = 1.164(Y - 16) + 1.596(V - 128)
G = 1.164(Y - 16) - 0.813(V - 128) - 0.391(U - 128)
B = 1.164(Y - 16)                  + 2.018(U - 128)

The basic idea for the shader is to create three different textures, one for each plane, and pull the Y, U, and V components from each texture and combine them using the above formula into the output RGB values.

The key to optimizing the shader is to recognize that the GPU shader hardware is optimized for massively parallel operations and that many of the common operations used in 3D math are optimized down to a single cycle, and then reducing number of operations as much as possible.

Looking at the formula, it can be broken down into an offset for each of the YUV components, and then a multiply and add operation on each of them, which conveniently is how a dot product is defined. So, I simply create constants for each of the operations and put it all together!

varying vec2 tcoord;
uniform sampler2D tex0; // Y 
uniform sampler2D tex1; // U 
uniform sampler2D tex2; // V 

// YUV offset 
const vec3 offset = vec3(-0.0625, -0.5, -0.5);

// RGB coefficients 
const vec3 Rcoeff = vec3(1.164,  0.000,  1.596);
const vec3 Gcoeff = vec3(1.164, -0.391, -0.813);
const vec3 Bcoeff = vec3(1.164,  2.018,  0.000);

void main()
{
    vec3 yuv, rgb;

    // Get the Y value
    yuv.x = texture2D(tex0, tcoord).r;

    // Get the U and V values
    tcoord *= 0.5;
    yuv.y = texture2D(tex1, tcoord).r;
    yuv.z = texture2D(tex2, tcoord).r;

    // Do the color transform
    yuv += offset;
    rgb.r = dot(yuv, Rcoeff);
    rgb.g = dot(yuv, Gcoeff);
    rgb.b = dot(yuv, Bcoeff);

    // That was easy. :)
    gl_FragColor = vec4(rgb, 1.0);
}

Now the test!

I used a 1024x1050 image, converted it to YV12 and then repeatedly updated a streaming texture and displayed it on the screen. I ran this test on a Mac Pro running Mac OS X using both MMX optimized software color conversion and OpenGL GLSL color conversion.

The code is available here:
http://www.libsdl.org/blog/files/yuvspeedtest.c

SDL_RENDER_DRIVER=software ./yuvspeedtest ~/bluemarble2k_big.bmp
Using software rendering
26.83 frames per second

SDL_RENDER_DRIVER=opengl ./yuvspeedtest ~/bluemarble2k_big.bmp
Using opengl rendering
1040.53 frames per second

Using hardware shader acceleration got almost a 50x speedup!

Fifty times?! That's right...FIFTY! OMG!!!

Resources:

YUV colorspace description: http://en.wikipedia.org/wiki/YCbCr
YUV format descriptions: http://www.fourcc.org/yuv.php
YUV colorspace conversion: http://www.fourcc.org/fccyvrgb.php
MPEG video compression: http://www.bbc.co.uk/rd/pubs/papers/paper_14/paper_14.shtml

Friday, February 11, 2011

Streaming textures with SDL 1.3

I was recently asked how to use streaming textures with SDL 1.3, and while it's very simple, I didn't actually find any documentation on how to do it, so here it is!

First, why would you use a streaming texture?

Static textures are designed for sprites and backgrounds and other images that don't change much. You update them with pixels using SDL_UpdateTexture().

Streaming textures are designed for things that update frequently, every few seconds, or every frame. You can also update them with SDL_UpdateTexture(), but for optimal performance you lock them, write the pixels, and then unlock them.

Conceptually they're very simple to use:

Call SDL_CreateTexture() with the SDL_TEXTUREACCESS_STREAMING access type
Call SDL_LockTexture() to get raw access to the pixels
Do any pixel manipulation you want
Call SDL_UnlockTexture()
Use the texture in rendering normally.

You can specify any RGB/RGBA or YUV format you want and SDL or the hardware drivers will do the conversion for you on the back end if it's not supported. To get the best speed you'll probably want to create the texture in the first format listed in the renderer info, although at the time of this writing SDL_PIXELFORMAT_ARGB8888 is the optimal format for all renderers.

You might also want to create a surface from the texture pixels if you're doing old style blitting using other SDL surfaces. You can do this by creating a surface with no pixel data and then filling the pixel and pitch info in later:

texture = SDL_CreateTexture(renderer,
                            SDL_PIXELFORMAT_ARGB8888,
                            SDL_TEXTUREACCESS_STREAMING,
                            width, height);

surface = SDL_CreateRGBSurfaceFrom(NULL,
                                   width, height,
                                   32, 0,
                                   0x00FF0000,
                                   0x0000FF00,
                                   0x000000FF,
                                   0xFF000000);

...
SDL_LockTexture(texture, NULL,
                &surface->pixels,
                &surface->pitch);
... draw to surface
SDL_UnlockTexture(texture)

I put together a very simple example based on the running moose by Mike Gorchak:
http://hg.libsdl.org/SDL/file/default/test/teststreaming.c

Enjoy!

Tuesday, February 8, 2011

Fun with shaders!

I just added GLSL shaders to the SDL OpenGL rendering implementation.

On my hardware this ends up being about a 200-400 FPS increase in testsprite2:

SDL_RENDER_OPENGL_SHADERS=0 ./testsprite2
4259.55 frames per second

SDL_RENDER_OPENGL_SHADERS=1 ./testsprite2
4552.88 frames per second

I also got a modest increase with testsprite, using the old SDL 1.2 API:

SDL_RENDER_OPENGL_SHADERS=0 ./testsprite
1329.16 frames per second

SDL_RENDER_OPENGL_SHADERS=1 ./testsprite
1354.20 frames per second

Woot! :)

I also noticed there's not a single example of using shaders with SDL that has full source code, so I added one:
http://hg.libsdl.org/SDL/file/default/test/testshader.c

Enjoy! :)

Saturday, February 5, 2011

Texture Streaming Performance

In my recent SDL 1.3 update I made it possible for the old SDL 1.2 API to be accelerated using texture streaming.

On my system on Mac OS X and Linux, this doubled performance!

Mac OS X

SDL 1.2 testsprite: 514.22 FPS
SDL 1.3 testsprite texture streaming with OpenGL: 1259.62 FPS
SDL 1.3 testsprite2 (hardware accelerated): 3865.16 FPS

Linux

SDL 1.2 testsprite: 495.48 FPS
SDL 1.3 testsprite texture streaming with OpenGL: 1244.55 FPS
SDL 1.3 testsprite2 (hardware accelerated): 2556.85 FPS

On my system the Windows performance got worse!

Windows

SDL 1.2 testsprite using GDI: 1030.71 FPS
SDL 1.3 testsprite using GDI: 1077.81 FPS
SDL 1.3 testsprite texture streaming with OpenGL: 623.08 FPS
SDL 1.3 testsprite texture streaming with Direct3D: 233.97 FPS
SDL 1.3 testsprite2 (hardware accelerated with OpenGL): 3027.26 FPS
SDL 1.3 testsprite2 (hardware accelerated with Direct3D): 4259.48 FPS

Clearly the Windows GDI drivers are heavily optimized for 2D performance, but why is the Direct3D streaming performance so poor?

Here's what I'm doing for OpenGL:
Texture format: GL_RGBA8, GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV
Texture update:

    glBindTexture()

    glTexSubImage2D()

    glBegin(GL_TRIANGLE_STRIP)

...

I'm not doing anything fancy with pixel buffer objects, I'm just making sure that my data is in the optimal format for processing by the OpenGL drivers.

Here's what I'm doing for Direct3D:
    Device Setup:
    pparams.BackBufferCount = 1;
    pparams.SwapEffect = D3DSWAPEFFECT_DISCARD;
    pparams.PresentationInterval = D3DPRESENT_INTERVAL_IMMEDIATE;

    Texture create:
    device->CreateTexture(width, height, 1, D3DUSAGE_DYNAMIC, D3DFMT_A8R8G8B8, D3DPOOL_DEFAULT, &texture, NULL);

    Texture update:
    texture->LockRect(0, &locked, NULL, D3DLOCK_DISCARD);
    texture->UnlockRect(0);
    device->DrawPrimitiveUP(D3DPT_TRIANGLEFAN, 2,
vertices, sizeof(*vertices));
    ...

For those who are curious, the full code can be found here:
http://www.libsdl.org/tmp/SDL-1.3.zip

So... does anyone know how to improve Direct3D texture streaming performance?

Thursday, February 3, 2011

New SDL API changes

I've been quiet the last week working on a massive restructuring of the SDL rendering API. The result is a simpler, easier to use, and easier to port system.
http://forums.libsdl.org/viewtopic.php?t=6869&start=18

Whew! Time to sleep! :)