MPEG based codecs have basically two steps which are time consuming, the first is decoding each frame into a YUV colorspace image, and the second is converting the image from YUV to RGB. There is lots of information available on the MPEG stream decoding and YUV colorspaces, but here I'm going to focus on the YUV to RGB conversion.
To understand how to accelerate this process, we need to understand a little about the YUV format and how the conversion is done.
YV12 images consist of 3 planes, one Y image sized WxH, and a U and V image, sized W/2 x H/2. Put simply, the Y plane contains the luminance, which can be used alone for grayscale, and the U and V planes contain the red and blue color components, one value for each 2x2 block of output pixels.
The formula for converting from YUV to RGB is:
R = 1.164(Y - 16) + 1.596(V - 128) G = 1.164(Y - 16) - 0.813(V - 128) - 0.391(U - 128) B = 1.164(Y - 16) + 2.018(U - 128)
The basic idea for the shader is to create three different textures, one for each plane, and pull the Y, U, and V components from each texture and combine them using the above formula into the output RGB values.
The key to optimizing the shader is to recognize that the GPU shader hardware is optimized for massively parallel operations and that many of the common operations used in 3D math are optimized down to a single cycle, and then reducing number of operations as much as possible.
Looking at the formula, it can be broken down into an offset for each of the YUV components, and then a multiply and add operation on each of them, which conveniently is how a dot product is defined. So, I simply create constants for each of the operations and put it all together!
varying vec2 tcoord; uniform sampler2D tex0; // Y uniform sampler2D tex1; // U uniform sampler2D tex2; // V // YUV offset const vec3 offset = vec3(-0.0625, -0.5, -0.5); // RGB coefficients const vec3 Rcoeff = vec3(1.164, 0.000, 1.596); const vec3 Gcoeff = vec3(1.164, -0.391, -0.813); const vec3 Bcoeff = vec3(1.164, 2.018, 0.000); void main() { vec3 yuv, rgb; // Get the Y value yuv.x = texture2D(tex0, tcoord).r; // Get the U and V values tcoord *= 0.5; yuv.y = texture2D(tex1, tcoord).r; yuv.z = texture2D(tex2, tcoord).r; // Do the color transform yuv += offset; rgb.r = dot(yuv, Rcoeff); rgb.g = dot(yuv, Gcoeff); rgb.b = dot(yuv, Bcoeff); // That was easy. :) gl_FragColor = vec4(rgb, 1.0); }
Now the test!
I used a 1024x1050 image, converted it to YV12 and then repeatedly updated a streaming texture and displayed it on the screen. I ran this test on a Mac Pro running Mac OS X using both MMX optimized software color conversion and OpenGL GLSL color conversion.
The code is available here:
http://www.libsdl.org/blog/files/yuvspeedtest.c
SDL_RENDER_DRIVER=software ./yuvspeedtest ~/bluemarble2k_big.bmp
Using software rendering
26.83 frames per second
SDL_RENDER_DRIVER=opengl ./yuvspeedtest ~/bluemarble2k_big.bmp
Using opengl rendering
1040.53 frames per second
Using hardware shader acceleration got almost a 50x speedup!
Fifty times?! That's right...FIFTY! OMG!!!
Resources:
- YUV colorspace description: http://en.wikipedia.org/wiki/YCbCr
- YUV format descriptions: http://www.fourcc.org/yuv.php
- YUV colorspace conversion: http://www.fourcc.org/fccyvrgb.php
- MPEG video compression: http://www.bbc.co.uk/rd/pubs/papers/paper_14/paper_14.shtml
The YUV offset is unused (const offest).
ReplyDeleteNope, this line uses it:
ReplyDeleteyuv += offset;
Oops, sorry for the noise, my browser's search feature has gone wrong.
ReplyDeleteYou test this with your SMPEG library or something else?
ReplyDeleteI tested it with the YUV speed test listed above. SMPEG would be a good production test, but there's some multi-threaded re-architecting that needs to happen in that library before it's ready to use with this code.
ReplyDeleteGreat work Sam
ReplyDeleteMore blog posts
does your name mean "sam lantiga or ken" in french? slouken.
anyway you are a juggernaut.
my theory that there are only 50 people in the universe is proven by the few replies here,
Thanks!
ReplyDeleteMy name is Fresian actually, and means "of the land"
Nice post! I was thinking about doing something similar to speed up JPEG processing for large images.
ReplyDeleteI've noticed that you're only using the first 8 bits of each texture. Might this be a waste of CPU >> GPU bandwidth? How does performance improve further when you combine the UV planes as a .xy 16-bit texture access?
Going further, I wonder if there's a clever way to combine Y with the UV planes as well? Currently you're making good use of the GPU interpolating your chroma subsampled planes.
Perhaps it's as simple as first having the CPU place (decode) YUV into one 24bit W*H texture, where the UV planes fill only half of the texture (W/2*H/2). Then you might have two texture access in your fragment shader: i) sample Y at the current tcoord varying; ii) sample UV simultaneously at tcoord*0.5... that's 1/3 of the CPU >> GPU bandwidth required, one less texture access and two less texture units reserved in your fragment shader.
jpap
Hi Fresian,
ReplyDeleteI tried your shader script, but get chaos colors. I think maybe I choosed a wrong tcoord value.
I can render the RGB565 imgage with Texture, and I used the same texCoords with that one.
Can you provide some OpenGL ES 2 sample code to use your shader?
Thanks in advance.