Description

This sample demonstrates the use of the NV_command_list extension. In this sample the NV_command_list is used to render a basic scene. Texturing is performed via ARB_bindless_texture.

Screenshot

APIs Used

  • OpenGL 4.4
  • GL_NV_command_list
  • GL_ARB_bindless_texture
  • GL_NV_uniform_buffer_unified_memory
  • GL_ARB_shading_language_include
  • GL_NV_shadow_samplers_cube

Shared User Interface

The Graphics samples all share a common app framework and certain user interface elements, centered around the "Tweakbar" panel on the left side of the screen which lets you interactively control certain variables in each sample.

To show and hide the Tweakbar, simply click or touch the triangular button positioned in the top-left of the view.

Technical Details

The Nv_command_list extension is built around bindless GPU pointers/handles which allow rendering scenes with hundreds of thousands of draw calls at extremely low CPU time:

  • Tokenized Rendering :
    • Commands are encoded into binary data ( tokens ) instead of issuing classic gl calls. This allows the driver of the GPU to efficiently iterate over a stream of many commands in single or multiple sequences : glDrawCommandsStatesNV(tokenBuffer, offsets[], sizes[], states[], fbos[], count)
    • The tokens are stored in regular OpenGL buffers and can be re - used across frames or manipulated by the GPU itself
    • In addition to draw calls, the tokens cover the most frequent state changes ( VBO/IBO/UBO ) and a few basic scalar changes ( blend color, polygon offset, stencil ref, etc. )
    • As tokens are only reference data ( for example UBO ), their content is free to change. You can change vertex positions or matrices freely
The tokens are tightly-packed structs and most common tokens are 16 bytes each. Below you will find the token definition to update a UBO binding.

typedef struct
{
    GLuint header;  // glGetCommandHeader(GL_UNIFORM_ADDRESS_COMMAND_NV)
    GLushort index; // in glsl: layout(binding=INDEX, commandBindableNV) uniform...
    GLushort stage;  // glGetStageIndexNV(GL_VERTEX_SHADER)
    GLuint64 address; // glGetNamedBufferParameterui64vNV(buffer, GL_BUFFER_GPU_ADDRESS, 
                                                         // &address);
} UniformAddressCommandNV;
  • State Objects
    • Costly validation in the driver can often happen as late as at draw call time or at other unexpected times, potentially causing unstable framerates. Monolithic state-objects (common in other new graphics APIs) allow us to pre-validate the core rendering state (FBO, program, blending states, etc.) and reuse it
    • Full control over when validation happens via glCaptureState (stateObject, primitiveBaseMode) and use of the current GL state's setup
    • Very efficient state switching between different State Objects
  • Pre-compiled Command List Object
    • State Objects and client-side tokens can be pre-compiled into a special object
    • Allows further driver optimization (faster State Object transitions) at the loss of flexibility (changing State Objects requires rebuilding command list object)
Sample Highlights

Depending on the availability of the extension, the sample allows switching between a standard OpenGL, token-buffer or commandlist-object modes to render the scene. Inside basic-nvcommandlist.cpp you will find the functions:

  • Sample::drawStandard()
    • The standard OpenGL approach allows rendering the scene via the standard glDrawElements function for each object on the scene
  • Sample::drawTokenBuffer()
    • The token buffer approach allows rendering the scene using list of tokens (binary data) via the glDrawCommandsStatesNV function
  • Sample::drawTokenList()
    • The token list approach allows rendering the scene using pre-compiled command list via glCallComandListNV
  • Sample::drawTokenEmulation()
    • The emulation layer allows us to roughly get an idea how the glDrawCommands* and glStateCapture work internally. Emulation may also be useful as a permanent compatibility layer for driver/hardware combinations which do not run the extension natively
Performance
The sample renders 1024 objects. Each object has a sphere or box VBO/IBO pair and references a range within a big UBO that stores per -- object data like matrix, color and texture. Half of objects on the scene use the geometry shader to transform primitives. Here are some preliminary example results for Timer Draw on a win 7 -- 64, i7 -- 860, Quadro K5000 system
Draw mode GPU time CPU time ( microseconds )
standard 850 1750
nvcmdlist emulated 830 1500
nvcmdlist buffer 775 30
nvcmdlist list 775 <1

One can see that by classic API usage the scene is CPU bound as more time is spent here than on the graphics card.

The gained performance in emulation approach comes from using bindless VBO and UBO.

The token-buffer technique is slightly slower on CPU than the pre-compiled list because the 500 State Objects (each half of scene's objects) still need to be checked every frame. The nvcmdlist techniques essentially only require a single dispatch.

The closest other way to get to this command would be by using MultiDrawIndirect and vertex divisor indexing, but it makes shaders more complex by adding an indirection parameter.