Saturday, February 12, 2011

MPEG acceleration with GLSL

Video decoding is something that people are always trying to find ways to accelerate.  Whether it's making HD video more HD or dynamically streaming video to textures in your game, we want it as fast and high quality as possible.

MPEG based codecs have basically two steps which are time consuming, the first is decoding each frame into a YUV colorspace image, and the second is converting the image from YUV to RGB.  There is lots of information available on the MPEG stream decoding and YUV colorspaces, but here I'm going to focus on the YUV to RGB conversion.

To understand how to accelerate this process, we need to understand a little about the YUV format and how the conversion is done.

YV12 images consist of 3 planes, one Y image sized WxH, and a U and V image, sized W/2 x H/2.  Put simply, the Y plane contains the luminance, which can be used alone for grayscale, and the U and V planes contain the red and blue color components, one value for each 2x2 block of output pixels.

The formula for converting from YUV to RGB is:
R = 1.164(Y - 16) + 1.596(V - 128)
G = 1.164(Y - 16) - 0.813(V - 128) - 0.391(U - 128)
B = 1.164(Y - 16)                  + 2.018(U - 128)

The basic idea for the shader is to create three different textures, one for each plane, and pull the Y, U, and V components from each texture and combine them using the above formula into the output RGB values.

The key to optimizing the shader is to recognize that the GPU shader hardware is optimized for massively parallel operations and that many of the common operations used in 3D math are optimized down to a single cycle, and then reducing number of operations as much as possible.

Looking at the formula, it can be broken down into an offset for each of the YUV components, and then a multiply and add operation on each of them, which conveniently is how a dot product is defined.  So, I simply create constants for each of the operations and put it all together!

varying vec2 tcoord;
uniform sampler2D tex0; // Y 
uniform sampler2D tex1; // U 
uniform sampler2D tex2; // V 

// YUV offset 
const vec3 offset = vec3(-0.0625, -0.5, -0.5);

// RGB coefficients 
const vec3 Rcoeff = vec3(1.164,  0.000,  1.596);
const vec3 Gcoeff = vec3(1.164, -0.391, -0.813);
const vec3 Bcoeff = vec3(1.164,  2.018,  0.000);

void main()
    vec3 yuv, rgb;

    // Get the Y value
    yuv.x = texture2D(tex0, tcoord).r;

    // Get the U and V values
    tcoord *= 0.5;
    yuv.y = texture2D(tex1, tcoord).r;
    yuv.z = texture2D(tex2, tcoord).r;

    // Do the color transform
    yuv += offset;
    rgb.r = dot(yuv, Rcoeff);
    rgb.g = dot(yuv, Gcoeff);
    rgb.b = dot(yuv, Bcoeff);

    // That was easy. :)
    gl_FragColor = vec4(rgb, 1.0);

Now the test!

I used a 1024x1050 image, converted it to YV12 and then repeatedly updated a streaming texture and displayed it on the screen.  I ran this test on a Mac Pro running Mac OS X using both MMX optimized software color conversion and OpenGL GLSL color conversion.

The code is available here:

SDL_RENDER_DRIVER=software ./yuvspeedtest ~/bluemarble2k_big.bmp
Using software rendering
26.83 frames per second

SDL_RENDER_DRIVER=opengl ./yuvspeedtest ~/bluemarble2k_big.bmp
Using opengl rendering
1040.53 frames per second

Using hardware shader acceleration got almost a 50x speedup!

Fifty times?! That's right...FIFTY! OMG!!!



  1. The YUV offset is unused (const offest).

  2. Nope, this line uses it:
    yuv += offset;

  3. Oops, sorry for the noise, my browser's search feature has gone wrong.

  4. You test this with your SMPEG library or something else?

  5. I tested it with the YUV speed test listed above. SMPEG would be a good production test, but there's some multi-threaded re-architecting that needs to happen in that library before it's ready to use with this code.

  6. Great work Sam
    More blog posts

    does your name mean "sam lantiga or ken" in french? slouken.

    anyway you are a juggernaut.
    my theory that there are only 50 people in the universe is proven by the few replies here,

  7. Thanks!

    My name is Fresian actually, and means "of the land"

  8. Nice post! I was thinking about doing something similar to speed up JPEG processing for large images.

    I've noticed that you're only using the first 8 bits of each texture. Might this be a waste of CPU >> GPU bandwidth? How does performance improve further when you combine the UV planes as a .xy 16-bit texture access?

    Going further, I wonder if there's a clever way to combine Y with the UV planes as well? Currently you're making good use of the GPU interpolating your chroma subsampled planes.

    Perhaps it's as simple as first having the CPU place (decode) YUV into one 24bit W*H texture, where the UV planes fill only half of the texture (W/2*H/2). Then you might have two texture access in your fragment shader: i) sample Y at the current tcoord varying; ii) sample UV simultaneously at tcoord*0.5... that's 1/3 of the CPU >> GPU bandwidth required, one less texture access and two less texture units reserved in your fragment shader.


  9. Hi Fresian,

    I tried your shader script, but get chaos colors. I think maybe I choosed a wrong tcoord value.

    I can render the RGB565 imgage with Texture, and I used the same texCoords with that one.

    Can you provide some OpenGL ES 2 sample code to use your shader?

    Thanks in advance.