Of Epic Proportions
In the process of doing some research on optimizing game engines I happened to come across this little gem of a presentation online and couldn't help but discuss it for my blog this week. This particular jewel I speak of comes in the form of a GDC Booth Presentation by Niklas Smedberg, Senior Engine Programmer at Epic Games. Although this presentation deals heavily with the graphical side of the implementation, Smedberg brings up several interesting points to make note of for general engine optimization.
A man and his machine
Niklas Smedberg is the Senior Engine Programmer at Epic Games and has worked on the Unreal Engine amongst his fifteen years in the industry. He has thirty years of programming and is one of the main developers responsible for Infinity Blade 2. He begins his talk touching on the architecture of mobile GPUs, so let's start there.
Traditional GPUs would make use of a processor like the NVIDIA Tegra. Tile-based GPUs, like Mali GPUs, are very different from desktop or consoles and are more commonly found on smartphones and tablets. These split the screen into tiles, each 16x16 pixels in dimension with a whole tile fitting in the GPU registers. All drawcalls for one tile are processed and the final result is written out to the RAM. This process is then repeated for each tile till the entire image is in the RAM. The rendering process for ARM's Mali begins with Software moving on to the Command Buffer and then Vertex Processing. Tiling is next followed by Intermediate Data, Pixel Processing and finally the Frame Buffer.
The Mali GPU has a pipeline with arithmetic and logical units (ALU which primarily takes care of mathematical needs) and a shader program which processes the graphics. There are 128 shader threads and every clock cycle the system pushed one shader instruction into the pipeline before switching to the next thread. This is similar to hyperthreading on the CPU but way bigger and parallel. There are also 128 pipeline stages which culminate to return one result per clock cycle. This means that in 128 clock cycles, the shader unit manages to wrap around to the first shader thread and the cycle begins once more. This whole system works like cogs chugging constantly in a very structured framework which makes the GPU efficient in processing.
The platform
On a power-limited handset (this of course is relative to a console or PC) memory access can be expensive so when we keep our render target on a chip, we save power by keeping it all contained within the GPU. Which mean speed due to no bandwidth cost for drawing or alpha-blending, cheap depth and stencil testing (since it's all just read form the GPU), and cheap MSAA (Multisample anti-aliasing). In fact it is possible to see 0-5 ms cost for MSAA, which is terrific, however you should also be wary of buffer restores in terms of colour or depth. At the end of the day, you simply need to resolve the final smaller buffer to memory.
There have been developments in the Mali processor with the new T600 which features not only a unified shader unit but OpenGL ES 3.0 among several other improvements from the Mali 400. With all this said there are of course caveats with the architecture of tile-based GPUs as with any piece of hardware. For one, if you ever leave the boundaries of a tile (say for the purpose of blurring across the screen) and switch render targets it will cost you. This is because the GPU has to process all the tiles and then write back out to memory. Another issue is that there is no "free" hidden surface removal, you have to cull and sort drawcalls from front-to-back (this means big occluders first and the skybox last). The GPU also lack texture compression for RGBA textures forcing the use of either uncompressed RGBA or two compressed RGB textures. Also, even though the T600 supports OpenGL ES 3.0, there are no current mobile platforms (Android or iOS) that support it.
Now let's talk about managing the render buffer.
Rendering & Mobile Materials
It is important to remember not to switch back and forth between render targets as each target is a whole new scene. Switching slows the entire GPU down so it is potentially possible to not have a dependent drawcall right after you switch so you can send your data out to RAM and use it a little bit later. When you are switching to another render target you should make sure you clear everything including the colour, depth and stencil buffers to avoid buffer restore (this essentially just sets some dirty bits in a register but helps a lot). It is also better to avoid a buffer resolve and use a discard extension instead (GL_EXT_discard_framebuffer). Avoid unnecessary FBO combos by which you make sure the driver does't think it needs to start resolving and restoring any buffers.
The Unreal Engine materials are often very advanced. The programmers usually give full power to the artists causing the shader graphs and nodes to be as large as they want it. In order to make this system more programmer-safe on the mobile platform, specific functions were written for the artists to choose between them instead of having them write the shader themselves. Earlier, there would be a tremendous amount of different shaders for every object which would take far too long to compile on a mobile device. So what the programmers did was to pre-render the entire complicated material into a single texture and then re-texture the output colour. But they wanted to improve this system further and so they decided to pre-render components into separate textures, added mobile-specific settings and had feature support driven by artists. On the programming side, they had one single hand-written ubershader for all materials driven by #ifdef's which allows them to simply click a checkbox to enable that specific portion of the shader code.
God Rays
To give you an idea of the amount of optimization required to move a AAA title to the mobile platform, let's take a look at how Smedberg incorporated God Rays into Infinity Blade 2. To begin, he ported the code from Gears of War on XBox 360 to OpenGL ES 2.0. This worked but of course it was very slow which meant more mobile-specific optimization which began with moving all the mathematical calculations from the pixe to the vertex shader. He then went on to pass down the data through interpolators which he quickly ran out of since most mobile platforms limit the number of interpolators to 8. So he split the radial pass into 4 draw calls which meant 32 texture look ups in total (which is the equivalent of a 256 blur kernel). Therefore the mobile version runs faster and looks better than it did on the 360!
While the original shader code looked something like this:
The mobile shader code looked more like this:
Which turned out to be both prettier and faster!
Step by step, this is how the process works:
- On the first pass, the scene is downsampled, pixels are identified, the RGB values relate to the colour from the scene while the Alpha value is the occlusion factor. This is resolved to a texture as an "unblurred source".
- The second pass takes 8 lookups from the unblurred source on an opaque draw call and then another 8 lookups to an additive draw call. This is then resolved to another texture as a "blurred source".
- The third pass, fourth and fifth passes are the same as the second leading to the final result which is also resolved to a texture.
- The sixth pass switches back to the normal scene where they clear the final buffer and move to a new render target and create an opaque fullscreen quad which composites the God Ray blur and the background scene.
Shadows
Smedberg used several techniques for shadows in the game one of the most common of which was ported once more from the XBox 360. Using this fairly standard method, a shadow depth buffer is generated from the light's viewpoint so that you have a texture. You then stencil out the pixels on the scene that may be in shadow and then compare the shadow depth and scene depth. If it is a shadow, darken it.
In essence, the character's depth is projected from the light view into a texture. Then it is reprojected into the current camera view so that a comparison can be made with the scene depth and modulate accordingly. Then the characters are drawn on top of the shadows (which means no self-shadow). It is possible to include the characters in the depth comparison but the high precision is required for the comparisons or else artifacts will show up easily.
To optimize shadow rendering, there were a couple of tricks performed including generating all shadow depth textures first in the frame in order to avoid render target switching. ARM has the ability to compare with background depth for free which could reduce the cost of shadows by at least eighty percent. Essentially, the background depth would just be read from a register. This is also very useful for any depth-blend operation which means particles will be able to have soft interactions against the world geometry instead of clipping.
In Conclusion
The Mali has a very streamlined hardware architecture and is also very data-driven which makes it easier to program and use. The Mali 400 has 16-bit precision with any mathematical calculation in the pixel shader and has free access to the background depth value.
While a lot of this information is specific to graphics on the mobile platform, there are a lot of good lessons about optimization to be learnt from Niklas Smedberg who managed to bring a superior gaming experience to the mobile platform without surrendering performance for aesthetics.
No comments:
Post a Comment