GPU Programming

The graphics pipeline running on the GPU consists of multiple consecutive stages, some of which are programmable. The focus in this lecture is on those programmable Stages, and how to program these using GLSL shaders.

Additionally, compute shaders can perform arbitrary computation and can be used in various stages of the pipeline, or without using the graphics pipeline at all.

Vertex Shader

The vertex shader is executed once for every vertex in the scene. It can transform the position and attributes of these vertices, and can perform per-vertex calculations such as lighting (which can also be done per-pixel, as explained later). A possible use case for moving vertices is also animating objects in the scene.

Input Data Structures

Input vertex data is in the form of primitives, which can be points, lines, triangles and quads. Multiple triangles can be represented in different ways:

Uniform Variables

Uniform variables have the same value for each invocation of the shader (for every vertex/pixel). They are set by the CPU.

Transformations

Coordinate transformations are an important task of the vertex shader. We express transformations using homogeneous coordinates to allow translations.

\[ \begin{bmatrix} p_x * s_x + t_x \\ p_y * s_y + t_y \\ p_z * s_z + t_z \\ 1 \end{bmatrix} = \begin{bmatrix} s_x & 0 & 0 & t_x \\ 0 & s_y & 0 & t_y \\ 0 & 0 & s_z & t_z \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} p_x \\ p_y \\ p_z \\ 1 \end{bmatrix} \] (add rotations to taste.)

Care has to be taken with normals, as those can change if geometry is non-uniformly scaled. For a point \(p\) and normal \(n\) under the transformation \(M\): \[ p' = Mp \\ \] \[ n'=(M^{-1})^Tn \]

A similar transformation matrix can be constructed for a change of basis, projecting a point onto new axes. Combined with a translation (changing the origin), this performs a change of coordinate system.

Perspective Projection

In order for correct perspective handling, the camera frustum is transformed into a cube centered at \((0,0,0)\) with side lengths 2.

Sprites

Sprites are polygons that are always rotated towards the camera. This is a cheap to render alternative to a complete model. Implementing sprites can be done by storing the central point as well as displacement vectors for the points making up the polygon. In the shader, these displacements are applied such that the polygon is facing the camera. Multiple variants of the sprite can be stored which show the object from different sides.

Animation

The vertex shader can be used to show animations, by providing two vertex positions for keyframes before and after the current time, and configuring the vertex shader to interpolate between those and correctly move the vertex. This can also be used to store multiple model configurations (such as facial expressions) as offsets from the base model and morphing continuously between them.

Skinning

A common task is animating or moving some sort of skeleton with attached geometry. Each bone contains a transformation matrix relative to its parent. Those transformations can now be animated, and vertices which are associated to a specific bone can be moved accordingly.

Problems arise for vertices near multiple bones. Rigidly associating those with a single bone results in intersecting or stretched geometry. This can be solved by assigning a continuous weight instead of a rigid assignment for associating vertices with bones.

Pixel Shader

The pixel shader (fragment shader) runs after rasterization, for every pixel. Its output is the desired color of the pixel, the inputs are the outputs of the previous stages.

Rasterization

Rasterization happens in screen space (2x2 cube). The first step in rasterization is determining inside which triangles the pixel is located. The pixel location inside the triangle is then computed in barycentric coordinates. The depth is also computed, it may be beneficial however to map the depth non- linearly in order to achieve more precision for objects close to the camera. The depth may be overridden later in the pixel shader.

Z-Buffer

The Z-buffer stores the depth value of each pixel. It is used to determine which objects are visible or hidden by other geometry. The GPU may perform an early depth test, saving pixel shader invocations for pixels which are already known not to be visible.

Lighting

Types of Light

Surface Interaction

On each point on a surface, light gets refracted (entering the surface, changing its direction) and reflected. The roughness of a surface changes how much the direction of reflected light varies. Surface materials are roughly grouped into metals, which only reflect light, and non-metals which involve refraction and subsurface scattering. To approximate scattering, we consider two types of reflected light: specular and diffuse light. The specular light is almost entirely reflected in the same angle as the incoming light, depending on the surface roughness. The diffuse light exits at a more uniform distribution of directions.

The material properties are defined by the Bidirectional Reflectance Distribution Function BRDF, \(f(l,v)\), taking the input and output directions and modeling how much light is reflected in the specified output direction.

Lambertian Reflectance Model

This simulates perfectly diffuse surface, light is reflected equally in each direction. It models the amount of reflected light depending on incident angle.

\[ I_d = (N \cdot L) * C_d * k_d \] With surface normal \(N\), light direction \(L\), light color \(C_d\) and object color \(k_d\).

Light attenuation is used to simulate decrease in light intensity depending on distance \(d\) to the source:

\[ L_p = \frac{L_i}{At_c + At_l * d + At_e * d^2} \] with constant component \(At_c\), linear component \(At_l\) and exponential component \(At_e\)

Phong Reflectance Model

The Phong reflectance model adds ambient and specular lighting. The ambient light is simulated to be constant: \[ I_a = C_a * k_a \]

The diffuse light is simulated using the Lambertian model: \[ I_d = (N \cdot L) * C_d * k_d \]

The specular light is introduced based on the dot product between the view vector \(V\) and the reflect vector (direction of reflected light) \(R\): \[ I_s = (V \cdot R) ^ \alpha * C_s * k_s \] The parameter \(\alpha\) controls the roughness of the object.

A problem occurs if the angle between \(R\) and \(V\) is greater than 90°, leading to a negative value for \(V \cdot R\). This is solved by the Blinn-Phong model.

Blinn-Phong Reflectance Model

Here, the specular term is calculated using the surface normal and the newly introduced \(H\) vector, instead of the view vector. The \(H\) vector is the sum of the view- and light vector. This results in a dot product that is always positive: \[ I_s = (N \cdot H) ^ \alpha * C_s * k_s \]

General Considerations

The final lighting is dependent on where lighting is computed, and how normals are computed. Using normals per vertex, instead of per triangle, allows for interpolation and smoother lighting. Computing lighting per pixel (interpolating normals), instead of per vertex (interpolating lighting), produces better results.

Texturing

Instead of using a constant material color when calculating lighting, we can query a texture. The texture is mapped to a coordinate space from \((0,0)\) to \((1,1)\) (\(s t\) texture coordinates). Each vertex now has an additional \((s,t)\) texture coordinate.

Mipmapping

Aliasing artifacts can occur when textured objects appear on different scales in the scene. This is solved by storing the texture in multiple (correctly down-scaled) sizes, and using a smaller texture for far away objects.

Anisotropic Filtering

Since a square screen pixel may not map nicely to a single texture pixel, OpenGL provides anisotropic filtering, which queries multiple texture pixels to improve this behavior.

Specular Textures

Specular lighting should also be affected by the texture. We introduce a separate texture that provides the color for the specular color. This can also be a one-channel texture just specifying how much light should be reflected.

Problems with Textures

Photo textures already contain illumination information for a specific condition. Instead, textures should be used which only contain the material properties. For this to look good however, we need to simulate the lighting (such as small shadows) which previously was present (but static) in the texture. This is done by storing geometric information in the texture, such as the normal map in the next section.

Normal Mapping

An additional texture is created containing the normal direction at each texel position. This allows light computation as if a more detailed surface structure was present, without actually modifying the geometry, by querying the normal map texture in the pixel shader.

Care has to be taken to transform the normal from texture space into the coordinate system in which the light is defined (world space). This requires defining a new coordinate frame for each surface on which a texture is applied. The vertex data can be appended by a tangent and binormal vector defining the directions of the texture \(s\) and \(t\) axes.

Parallax Mapping

To correctly render textures more detailed, including occlusions within the texture, more effort has to be made. One option is displacement mapping: The idea is to use a highly tessellated surface, and use a height map or displacement texture to displace the vertices in the vertex shader. This works, but is not scalable since it requires generating a large amount of additional vertices.

An alternative technique, not requiring additional vertices, is Parallax Mapping. This technique assumes a locally constant surface height, and chooses a different texel to use in the pixel shader depending on the angle between the surface and view vector. If the height at the original texel \((u,v)\) is \(h\), the view vector intersects the point \(h\) above the new texel \((u',v')\).

This technique produces suboptimal results especially for shallow viewing angles. A different implementation is less accurate but produces fewer artifacts: Here, the texel is always displaced by the height at the original texel. This may not be a very accurate approximation, but is free of the artifacts at shallow viewing angles.

Iterative Methods for Rendering Texture Geometry: Relief Mapping

Here, the true intersection between the view ray and the surface defined in the texture will be calculated. Compared to the previous examples, this correctly renders a texture based height map, but needs iteration to find the true texel coordinate.

Implementing the intersection finding using binary search requires few iterations, but artifacts occur when multiple intersections of the surface with the view ray exist, since the initial step size is large. Poor performance can occur because the loop is harder to unroll than with a linear search. Linear search also allows parallelizing texture loads for consecutive accesses.

A combined version performs an initial linear search until the first intersection is found, and then refines the intersection in the last (small) interval using binary search with few steps.

Frame Buffers

A frame buffer is a piece of memory on the GPU into which the scene is rendered. We can create custom frame buffers and render for example into a texture that will be used later.

Example: Shadow Maps

This presents a method to render shadows. A camera is placed at the location of the light. The distance from the scene to this camera is then calculated and stored in a texture. When rendering the image later, the distance from a point in the scene to the light can be compared to the value stored in the texture. If the stored distance is shorter than the calculated distance, that implies that some geometry is between the point and the light, and thus the point is in shadow.

To prevent aliasing, hide low shadow map resolution and creating a soft shadow effect, multiple samples can be taken.

Post Processing Effects

Frame buffers can be used to apply post processing effects: First render onto a texture, then apply the texture onto a quad and render it using the post-processing effects.

G-Buffer

Using g-buffers, many aspects of the rendering pipeline such as lighting information and real-world position can be saved to multiple buffers. This can be the basis for more complex effects.

The deferred shading method for example computes the illumination from the values in the g-buffer, which also contains diffuse and specular color. This means that those calculations only need to be done for parts of the scene that are actually visible! It also makes it possible to compute the lighting with multiple lights. Deferred shading is fast for a large number of lights. The number of materials however is limited, since all materials are processed with the same shader. Another disadvantage is that the lighting calculations are performed for the entire scene even if the light only illuminates a small part.

Ambient Occlusion

The ambient component in the lighting calculations simulates indirect illumination. An approximation for indirect illumination is ambient occlusion: A point that can not receive light from all directions will be darker.

Ambient occlusion can be calculated in screen space using g-buffers, which saves computational resources: Pixels around the current pixel are sampled to approximate the amount of occlusion.

An improvement can be made by only selecting pixels to sample in the half hemisphere defined by the surface normal.

An even more accurate result can be achieved by approximating the angle of the largest cone coming from the current position, instead of sampling.

SSAO does have some drawbacks, especially since no information is present for parts of the scene that are occluded: An object in the background may be wrongly occluded by foreground objects, even if they are far away. Removing the influence of the foreground object does not produce more information about occlusion by geometry between the object and background.

SDFs

Geometry Shader

The geometry shader is located between the vertex- and pixel shader in our pipeline. It is executed once for every geometric primitive.

Inputs and Outputs

As well as lines and triangles with adjacency information. The input is defined in the shader as

layout (triangles) in;

layout (triangle_strip, max_vertices = 3) out;

The max_vertices parameter is required to specify the maximum number of vertices that the shader may output.

The geometry shader receives inputs just as any other shader, but since it processes a primitive instead of a single vertex, the inputs will be arrays containing each input for each of the vertices making up the primitive.

The usual gl_Position output from the vertex shader is available at gl_in[i].gl_Position. Additional inputs are available to receive an ID of the current primitive and shader invocation.

Outputs do not become arrays, we instead call EmitVertex() to emit the output variables for each vertex one after another. Once all vertices have been output, we call EndPrimitive().

Geometry Generation Example: Normal visualization

The geometry shader can also generate new geometry. This is explained here with an example in which a steep pyramid shall be generated in the middle of each triangle, visualizing its normal.

This requires generating four output triangles from one input triangle primitive (the original triangle as well as all three sides of the pyramid). To accomplish this, the max_vertices parameter is increased to 12. The triangles are output individually (not as a triangle strip), and EndPrimitive() is thus called four times in the shader.

Another example given is the billboard example for molecule visualization, with the difference being that the vertices for the displaced billboard corners are not generated on the CPU, but created in the geometry shader.

An example for the adjacency feature of the input is given by the application of creating borders around a model: The input data now consists of six vertices per triangle, containing not only the current triangle but also the remaining vertices of the three triangles that share an edge with the current triangle.

Shadow volumes

A way of generating shadows is to produce geometry representing the projection of our objects based on the light position (the shaded volume). When rendering, we can check if a point is inside a shadow volume or not.

Using triangle adjacency, the contour edges of an object with respect to the light (instead of the camera, like in the “borders” example) can be found. These edges are then used to create a side of the shadow volume.

Shadow volumes produce higher resolution (sharper) shadows than shadow-mapping, at the cost of more expensive computation.

Layered rendering

Another feature of the geometry shader is rendering the same object to multiple images in one draw-call.

The geometry shader can calculate the projection of each object into multiple camera configurations. The pixel shader is then executed for each of those images. An example where this is needed is a CUBE map, where the scene is rendered into the six faces of a cube.

In the geometry shader, before EmitVertex() we can update gl_Layer to select which layer of the current texture we want to render to. It is also not required to loop over layers in the shader, instead instancing can be used to execute the shader more than once for every primitive. The number of invocations is specified in the shader:

layout (invocations = 16, points) in;

Tessellation Shader

Tessellation produces a more detailed model from a model with less triangles. The tessellation shader allows us to subdivide triangles based on the distance to the camera or other parameters.

In order to perform this tessellation, not one but two additional stages are introduced between the vertex- and geometry shader: The Tessellation Control Shader and the Tessellation Evaluation Shader.

The Tessellation Control Shader is executed once for each vertex of the input primitives. Its goal is to determine the tessellation level, or how much the triangle should be subdivided.

The Tessellation Evaluation Shader is executed once for each new vertex that was created after subdivision. Its goal is to move the new vertex to the correct position.

Actually generating the new primitives and vertices is done by the GPU hardware, we can only configure that step using the control shader.

Subdivision Patterns

Tessellation Control Shader

The goal of the TCS is to generate control points (inputs into subdivision: three for triangles, four for quads) and to define the tessellation levels.

This shader is executed once for each output control point, but all executions for the same patch are connected: They share the same output and can use synchronization.

Tessellation levels

For triangles, there is an inner tessellation level and three outer tessellation levels. The inner tessellation level is applied first and controls how many inner triangles are created (by dividing all outer lines into \(n\) segments, with inner tessellation level \(n\)). Then, the outer tessellation levels specify how many segments each of the original outer lines should actually end up with.

The outer tessellation level is needed to correctly process adjacent triangles, which may have different inner tessellation levels.

Usually, tessellation levels are dynamically calculated based on how far the object is from the camera!

Tessellation of quads can be similarly controlled, although now there are four outer tessellation levels and two inner tessellation levels (one for each axis).

Inputs

Some inputs are again provided in a gl_in[], providing for example the vertex positions. gl_InvocationID provides the ID of the vertex inside the current primitive (indexing gl_in), and gl_PrimitiveID identifies the primitive. gl_PatchVerticesIn contains the number of vertices in the current patch.

Custom shader inputs can be used as usual, in array form like in the geometry shader.

Outputs

gl_out[i].gl_Position contains the vertex position as usual, the tessellation level as described above has to be written to gl_TessLevelOuter[4] and gl_TessLevelInner[2].

Custom outputs for the TES are per control point by default, but can be specified to be per primitive using

patch out vec3 customVar;

Synchronization

TCS instantiations can read outputs from other instantiations belonging to the same patch. This requires synchronization, which is provided by the barrier() function, ensuring that all executions have written their outputs.

Tessellation Evaluation Shader

Inputs + Outputs

The TES is executed for each new vertex. It receives the barycentric coordinates of each vertex in the original triangle as input, and interpolates the new position for the vertex: gl_in[i].gl_Position contains the positions of the patch control points, gl_TessCoord contains the barycentric coordinates of the current vertex.

Barycentric Coordinates

layout (triangles, equal_spacing, ccw) in;

Multiple spacing modes such as equal_spacing, fractional_even_spacing and fractional_even_spacing are available. The barycentric coordinates provides the vertex position as a weighted sum of the control point positions.

For quads and isolines, the coordinates are given as \(uv\) coordinates inside the original quad.

Advanced Application: Bezier Curves

An isolines tessellation with the first tessellation level being 1, a bezier curve is rendered. The second tessellation level specifies the number of segments in the curve.

The TES calculates the position of each vertex by evaluating the bezier curve along the second axis of the quad (the one with varying tessellation level).

Advanced GLSL

Transform Feedback

Uniform Buffer Objects

Shader Storage Buffer Objects

GPU Architecture

GPUs have progressed from the first fixed-function architectures, to programmable shader stages and now unified shader execution units. Today’s GPUs can execute many shader instantiations and programs in parallel, independently of the shader stage. This is also what opened the door to a wide variety of general purpose GPU computing (GPGPU).

Processor Architecture

In order to leverage the data-parallel nature of the graphics processing pipeline, large scale SIMT architectures are used. Processors consist of multiple cores with a shared fetch/decode stage and execution context. Multiple SIMT processors are combined into larger complexes, which may be repeated multiple times. The GTX 1080 for example contains 20 “Streaming Multiprocessors”, each of which contains four 32-times-parallel SIMT processors, resulting in a total of 2560 cores in the GPU.

Thread Execution

Threads are executed in warps of 32 threads each. This matches the size of one SIMT processor.

Thread Divergence

Since all threads on the same SIMT processor execute the same instructions, special attention has to be paid to conditional branching.

Predicated instructions

This is only viable if the branch is small, otherwise the performance cost of executing both branches is too high.

Voting

This approach allows the threads to vote which branch to execute. If all threads vote for the same branch, the other branch is not executed.

Thus, care has to be taken to avoid big branches that diverge within the same warp.

Memory

Accessing memory introduces latency, for example on the order of 100 cycles. The bandwidth between the processor and memory is also limited. Stalling the processor while waiting for memory worsens utilization.

The memory closest to the cores are registers, shared memory and L1 cache. For example, the GTX 1080 features

The cache works without explicit user control and exploits spacial and temporal locality of data accesses. A typical cache line size is 32 bytes, and transfer between memory and cache happens for entire cache lines only. Efficient usage of the cache, and proper alignment of data to cache lines are imperative for optimal performance, the performance benefit usually originates in reduced memory bandwidth requirements, which is often the bottleneck for processing speed.

The GPU hides memory latency by oversubscribing the SM with warps, and executing other warps while some wait for memory. On the GTX 1080, each SM can be assigned 64 active warps, while it’s only able to execute four warps in parallel.

Switching between warps is fast, and is not comparable to a context switch on the CPU. This does however mean that the number of warps per SM (GPU occupancy) is also limited by the total register usage.

Shared Memory

Shared memory is fast memory close to the cores, with the restriction that it is local to one streaming multiprocessor. The usual use case is to explicitly move a hot region of memory to shared memory before computation.

Overview: Graphics Pipeline

Vertex Shader

Input Data Structures

Uniform Variables

Transformations

Perspective Projection

Sprites

Animation

Skinning

Pixel Shader

Rasterization

Z-Buffer

Lighting

Types of Light

Surface Interaction

Lambertian Reflectance Model

Phong Reflectance Model

Blinn-Phong Reflectance Model

General Considerations

Texturing

Mipmapping

Anisotropic Filtering

Specular Textures

Problems with Textures

Normal Mapping

Parallax Mapping

Iterative Methods for Rendering Texture Geometry: Relief Mapping

Frame Buffers

Example: Shadow Maps

Post Processing Effects

G-Buffer

Ambient Occlusion

SDFs

Geometry Shader

Inputs and Outputs

Geometry Generation Example: Normal visualization

Shadow volumes

Layered rendering

Tessellation Shader

Subdivision Patterns

Tessellation Control Shader

Tessellation levels

Inputs

Outputs

Synchronization

Tessellation Evaluation Shader

Inputs + Outputs

Barycentric Coordinates

Advanced Application: Bezier Curves

Advanced GLSL

Transform Feedback

Uniform Buffer Objects

Shader Storage Buffer Objects

GPU Architecture

Processor Architecture

Thread Execution

Thread Divergence

Predicated instructions

Voting

Memory

Shared Memory

Compute Shader

Cuda

Synchronization

Concurrency: Streams