August 2022
The graphics pipeline running on the GPU consists of multiple consecutive stages, some of which are programmable. The focus in this lecture is on those programmable Stages, and how to program these using GLSL shaders.
The basic pipeline consists of
Additionally, compute shaders can perform arbitrary computation and can be used in various stages of the pipeline, or without using the graphics pipeline at all.
The vertex shader is executed once for every vertex in the scene. It can transform the position and attributes of these vertices, and can perform per-vertex calculations such as lighting (which can also be done per-pixel, as explained later). A possible use case for moving vertices is also animating objects in the scene.
Input vertex data is in the form of primitives, which can be points, lines, triangles and quads. Multiple triangles can be represented in different ways:
Triangle List: Each triangle consists of three points. Adjacent triangles mean some points are stored twice.
Triangle Strip: The stored points make up a triangle strip in a zig-zag pattern, from the bottom left towards the right. This way each additional triangle only requires one additional point, the last two points implicitly belong to the triangle.
Indexed Triangles: This performs deduplication of vertices by storing each point once and defining triangles by the indices of the points.
Input data is stored and uploaded in buffers, which can then be associated to (vertex-)shader inputs.
Uniform variables have the same value for each invocation of the shader (for every vertex/pixel). They are set by the CPU.
Coordinate transformations are an important task of the vertex shader. We express transformations using homogeneous coordinates to allow translations.
\[ \begin{bmatrix} p_x * s_x + t_x \\ p_y * s_y + t_y \\ p_z * s_z + t_z \\ 1 \end{bmatrix} = \begin{bmatrix} s_x & 0 & 0 & t_x \\ 0 & s_y & 0 & t_y \\ 0 & 0 & s_z & t_z \\ 0 & 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} p_x \\ p_y \\ p_z \\ 1 \end{bmatrix} \] (add rotations to taste.)
Care has to be taken with normals, as those can change if geometry is non-uniformly scaled. For a point \(p\) and normal \(n\) under the transformation \(M\): \[ p' = Mp \\ \] \[ n'=(M^{-1})^Tn \]
A similar transformation matrix can be constructed for a change of basis, projecting a point onto new axes. Combined with a translation (changing the origin), this performs a change of coordinate system.
In order for correct perspective handling, the camera frustum is transformed into a cube centered at \((0,0,0)\) with side lengths 2.
TODO: World view transformation#
Sprites are polygons that are always rotated towards the camera. This is a cheap to render alternative to a complete model. Implementing sprites can be done by storing the central point as well as displacement vectors for the points making up the polygon. In the shader, these displacements are applied such that the polygon is facing the camera. Multiple variants of the sprite can be stored which show the object from different sides.
The vertex shader can be used to show animations, by providing two vertex positions for keyframes before and after the current time, and configuring the vertex shader to interpolate between those and correctly move the vertex. This can also be used to store multiple model configurations (such as facial expressions) as offsets from the base model and morphing continuously between them.
A common task is animating or moving some sort of skeleton with attached geometry. Each bone contains a transformation matrix relative to its parent. Those transformations can now be animated, and vertices which are associated to a specific bone can be moved accordingly.
Problems arise for vertices near multiple bones. Rigidly associating those with a single bone results in intersecting or stretched geometry. This can be solved by assigning a continuous weight instead of a rigid assignment for associating vertices with bones.
The pixel shader (fragment shader) runs after rasterization, for every pixel. Its output is the desired color of the pixel, the inputs are the outputs of the previous stages.
Rasterization happens in screen space (2x2 cube). The first step in rasterization is determining inside which triangles the pixel is located. The pixel location inside the triangle is then computed in barycentric coordinates. The depth is also computed, it may be beneficial however to map the depth non- linearly in order to achieve more precision for objects close to the camera. The depth may be overridden later in the pixel shader.
The Z-buffer stores the depth value of each pixel. It is used to determine which objects are visible or hidden by other geometry. The GPU may perform an early depth test, saving pixel shader invocations for pixels which are already known not to be visible.
On each point on a surface, light gets refracted (entering the surface, changing its direction) and reflected. The roughness of a surface changes how much the direction of reflected light varies. Surface materials are roughly grouped into metals, which only reflect light, and non-metals which involve refraction and subsurface scattering. To approximate scattering, we consider two types of reflected light: specular and diffuse light. The specular light is almost entirely reflected in the same angle as the incoming light, depending on the surface roughness. The diffuse light exits at a more uniform distribution of directions.
The material properties are defined by the Bidirectional Reflectance Distribution Function BRDF, \(f(l,v)\), taking the input and output directions and modeling how much light is reflected in the specified output direction.
In the following, simple BRDFs are presented.
This simulates perfectly diffuse surface, light is reflected equally in each direction. It models the amount of reflected light depending on incident angle.
\[ I_d = (N \cdot L) * C_d * k_d \] With surface normal \(N\), light direction \(L\), light color \(C_d\) and object color \(k_d\).
Light attenuation is used to simulate decrease in light intensity depending on distance \(d\) to the source:
\[ L_p = \frac{L_i}{At_c + At_l * d + At_e * d^2} \] with constant component \(At_c\), linear component \(At_l\) and exponential component \(At_e\)
The Phong reflectance model adds ambient and specular lighting. The ambient light is simulated to be constant: \[ I_a = C_a * k_a \]
The diffuse light is simulated using the Lambertian model: \[ I_d = (N \cdot L) * C_d * k_d \]
The specular light is introduced based on the dot product between the view vector \(V\) and the reflect vector (direction of reflected light) \(R\): \[ I_s = (V \cdot R) ^ \alpha * C_s * k_s \] The parameter \(\alpha\) controls the roughness of the object.
A problem occurs if the angle between \(R\) and \(V\) is greater than 90°, leading to a negative value for \(V \cdot R\). This is solved by the Blinn-Phong model.
Here, the specular term is calculated using the surface normal and the newly introduced \(H\) vector, instead of the view vector. The \(H\) vector is the sum of the view- and light vector. This results in a dot product that is always positive: \[ I_s = (N \cdot H) ^ \alpha * C_s * k_s \]
The final lighting is dependent on where lighting is computed, and how normals are computed. Using normals per vertex, instead of per triangle, allows for interpolation and smoother lighting. Computing lighting per pixel (interpolating normals), instead of per vertex (interpolating lighting), produces better results.
Instead of using a constant material color when calculating lighting, we can query a texture. The texture is mapped to a coordinate space from \((0,0)\) to \((1,1)\) (\(s t\) texture coordinates). Each vertex now has an additional \((s,t)\) texture coordinate.
Aliasing artifacts can occur when textured objects appear on different scales in the scene. This is solved by storing the texture in multiple (correctly down-scaled) sizes, and using a smaller texture for far away objects.
Since a square screen pixel may not map nicely to a single texture pixel, OpenGL provides anisotropic filtering, which queries multiple texture pixels to improve this behavior.
Specular lighting should also be affected by the texture. We introduce a separate texture that provides the color for the specular color. This can also be a one-channel texture just specifying how much light should be reflected.
Photo textures already contain illumination information for a specific condition. Instead, textures should be used which only contain the material properties. For this to look good however, we need to simulate the lighting (such as small shadows) which previously was present (but static) in the texture. This is done by storing geometric information in the texture, such as the normal map in the next section.
An additional texture is created containing the normal direction at each texel position. This allows light computation as if a more detailed surface structure was present, without actually modifying the geometry, by querying the normal map texture in the pixel shader.
Care has to be taken to transform the normal from texture space into the coordinate system in which the light is defined (world space). This requires defining a new coordinate frame for each surface on which a texture is applied. The vertex data can be appended by a tangent and binormal vector defining the directions of the texture \(s\) and \(t\) axes.
To correctly render textures more detailed, including occlusions within the texture, more effort has to be made. One option is displacement mapping: The idea is to use a highly tessellated surface, and use a height map or displacement texture to displace the vertices in the vertex shader. This works, but is not scalable since it requires generating a large amount of additional vertices.
An alternative technique, not requiring additional vertices, is Parallax Mapping. This technique assumes a locally constant surface height, and chooses a different texel to use in the pixel shader depending on the angle between the surface and view vector. If the height at the original texel \((u,v)\) is \(h\), the view vector intersects the point \(h\) above the new texel \((u',v')\).
This technique produces suboptimal results especially for shallow viewing angles. A different implementation is less accurate but produces fewer artifacts: Here, the texel is always displaced by the height at the original texel. This may not be a very accurate approximation, but is free of the artifacts at shallow viewing angles.
Here, the true intersection between the view ray and the surface defined in the texture will be calculated. Compared to the previous examples, this correctly renders a texture based height map, but needs iteration to find the true texel coordinate.
Implementing the intersection finding using binary search requires few iterations, but artifacts occur when multiple intersections of the surface with the view ray exist, since the initial step size is large. Poor performance can occur because the loop is harder to unroll than with a linear search. Linear search also allows parallelizing texture loads for consecutive accesses.
A combined version performs an initial linear search until the first intersection is found, and then refines the intersection in the last (small) interval using binary search with few steps.
A frame buffer is a piece of memory on the GPU into which the scene is rendered. We can create custom frame buffers and render for example into a texture that will be used later.
This presents a method to render shadows. A camera is placed at the location of the light. The distance from the scene to this camera is then calculated and stored in a texture. When rendering the image later, the distance from a point in the scene to the light can be compared to the value stored in the texture. If the stored distance is shorter than the calculated distance, that implies that some geometry is between the point and the light, and thus the point is in shadow.
To prevent aliasing, hide low shadow map resolution and creating a soft shadow effect, multiple samples can be taken.
Frame buffers can be used to apply post processing effects: First render onto a texture, then apply the texture onto a quad and render it using the post-processing effects.
Using g-buffers, many aspects of the rendering pipeline such as lighting information and real-world position can be saved to multiple buffers. This can be the basis for more complex effects.
The deferred shading method for example computes the illumination from the values in the g-buffer, which also contains diffuse and specular color. This means that those calculations only need to be done for parts of the scene that are actually visible! It also makes it possible to compute the lighting with multiple lights. Deferred shading is fast for a large number of lights. The number of materials however is limited, since all materials are processed with the same shader. Another disadvantage is that the lighting calculations are performed for the entire scene even if the light only illuminates a small part.
The ambient component in the lighting calculations simulates indirect illumination. An approximation for indirect illumination is ambient occlusion: A point that can not receive light from all directions will be darker.
Ambient occlusion can be calculated in screen space using g-buffers, which saves computational resources: Pixels around the current pixel are sampled to approximate the amount of occlusion.
An improvement can be made by only selecting pixels to sample in the half hemisphere defined by the surface normal.
An even more accurate result can be achieved by approximating the angle of the largest cone coming from the current position, instead of sampling.
SSAO does have some drawbacks, especially since no information is present for parts of the scene that are occluded: An object in the background may be wrongly occluded by foreground objects, even if they are far away. Removing the influence of the foreground object does not produce more information about occlusion by geometry between the object and background.
TODO: SDFs
The geometry shader is located between the vertex- and pixel shader in our pipeline. It is executed once for every geometric primitive.
Primitives for use in the geometry shader are
As well as lines and triangles with adjacency information. The input is defined in the shader as
layout (triangles) in;The output of the geometry stage can be
And is defined in the shader as
layout (triangle_strip, max_vertices = 3) out;The max_vertices parameter is required to specify the
maximum number of vertices that the shader may output.
The geometry shader receives inputs just as any other shader, but since it processes a primitive instead of a single vertex, the inputs will be arrays containing each input for each of the vertices making up the primitive.
The usual gl_Position output from the vertex shader is
available at gl_in[i].gl_Position. Additional inputs are
available to receive an ID of the current primitive and shader
invocation.
Outputs do not become arrays, we instead call
EmitVertex() to emit the output variables for each vertex
one after another. Once all vertices have been output, we call
EndPrimitive().
The geometry shader can also generate new geometry. This is explained here with an example in which a steep pyramid shall be generated in the middle of each triangle, visualizing its normal.
This requires generating four output triangles from one input
triangle primitive (the original triangle as well as all three sides of
the pyramid). To accomplish this, the max_vertices
parameter is increased to 12. The triangles are output individually (not
as a triangle strip), and EndPrimitive() is thus called
four times in the shader.
Another example given is the billboard example for molecule visualization, with the difference being that the vertices for the displaced billboard corners are not generated on the CPU, but created in the geometry shader.
An example for the adjacency feature of the input is given by the application of creating borders around a model: The input data now consists of six vertices per triangle, containing not only the current triangle but also the remaining vertices of the three triangles that share an edge with the current triangle.
A way of generating shadows is to produce geometry representing the projection of our objects based on the light position (the shaded volume). When rendering, we can check if a point is inside a shadow volume or not.
Using triangle adjacency, the contour edges of an object with respect to the light (instead of the camera, like in the “borders” example) can be found. These edges are then used to create a side of the shadow volume.
The process of rendering shadows using shadow volumes is as follows:
Shadow volumes produce higher resolution (sharper) shadows than shadow-mapping, at the cost of more expensive computation.
Another feature of the geometry shader is rendering the same object to multiple images in one draw-call.
The geometry shader can calculate the projection of each object into multiple camera configurations. The pixel shader is then executed for each of those images. An example where this is needed is a CUBE map, where the scene is rendered into the six faces of a cube.
In the geometry shader, before EmitVertex() we can
update gl_Layer to select which layer of the current
texture we want to render to. It is also not required to loop over
layers in the shader, instead instancing can be used to execute
the shader more than once for every primitive. The number of invocations
is specified in the shader:
layout (invocations = 16, points) in;Tessellation produces a more detailed model from a model with less triangles. The tessellation shader allows us to subdivide triangles based on the distance to the camera or other parameters.
In order to perform this tessellation, not one but two additional stages are introduced between the vertex- and geometry shader: The Tessellation Control Shader and the Tessellation Evaluation Shader.
The Tessellation Control Shader is executed once for each vertex of the input primitives. Its goal is to determine the tessellation level, or how much the triangle should be subdivided.
The Tessellation Evaluation Shader is executed once for each new vertex that was created after subdivision. Its goal is to move the new vertex to the correct position.
Actually generating the new primitives and vertices is done by the GPU hardware, we can only configure that step using the control shader.
Multiple subdivision patterns are available:
The goal of the TCS is to generate control points (inputs into subdivision: three for triangles, four for quads) and to define the tessellation levels.
This shader is executed once for each output control point, but all executions for the same patch are connected: They share the same output and can use synchronization.
For triangles, there is an inner tessellation level and three outer tessellation levels. The inner tessellation level is applied first and controls how many inner triangles are created (by dividing all outer lines into \(n\) segments, with inner tessellation level \(n\)). Then, the outer tessellation levels specify how many segments each of the original outer lines should actually end up with.
The outer tessellation level is needed to correctly process adjacent triangles, which may have different inner tessellation levels.
Usually, tessellation levels are dynamically calculated based on how far the object is from the camera!
Tessellation of quads can be similarly controlled, although now there are four outer tessellation levels and two inner tessellation levels (one for each axis).
Isolines require just two outer tessellation levels.
Some inputs are again provided in a gl_in[], providing
for example the vertex positions. gl_InvocationID provides
the ID of the vertex inside the current primitive (indexing
gl_in), and gl_PrimitiveID identifies the
primitive. gl_PatchVerticesIn contains the number of
vertices in the current patch.
Custom shader inputs can be used as usual, in array form like in the geometry shader.
gl_out[i].gl_Position contains the vertex position as
usual, the tessellation level as described above has to be written to
gl_TessLevelOuter[4] and
gl_TessLevelInner[2].
Custom outputs for the TES are per control point by default, but can be specified to be per primitive using
patch out vec3 customVar;TCS instantiations can read outputs from other instantiations
belonging to the same patch. This requires synchronization, which is
provided by the barrier() function, ensuring that all
executions have written their outputs.
The TES is executed for each new vertex. It receives the barycentric
coordinates of each vertex in the original triangle as input, and
interpolates the new position for the vertex:
gl_in[i].gl_Position contains the positions of the patch
control points, gl_TessCoord contains the barycentric
coordinates of the current vertex.
The only required output is the position gl_Position of
the vertex.
The order and spacing-mode of tesselation are specified:
layout (triangles, equal_spacing, ccw) in;Multiple spacing modes such as equal_spacing,
fractional_even_spacing and
fractional_even_spacing are available. The barycentric
coordinates provides the vertex position as a weighted sum of the
control point positions.
For quads and isolines, the coordinates are given as \(uv\) coordinates inside the original quad.
An isolines tessellation with the first tessellation level being 1, a bezier curve is rendered. The second tessellation level specifies the number of segments in the curve.
The TES calculates the position of each vertex by evaluating the bezier curve along the second axis of the quad (the one with varying tessellation level).
This can be extended to bezier surfaces using quad tessellation.
GPUs have progressed from the first fixed-function architectures, to programmable shader stages and now unified shader execution units. Today’s GPUs can execute many shader instantiations and programs in parallel, independently of the shader stage. This is also what opened the door to a wide variety of general purpose GPU computing (GPGPU).
In order to leverage the data-parallel nature of the graphics processing pipeline, large scale SIMT architectures are used. Processors consist of multiple cores with a shared fetch/decode stage and execution context. Multiple SIMT processors are combined into larger complexes, which may be repeated multiple times. The GTX 1080 for example contains 20 “Streaming Multiprocessors”, each of which contains four 32-times-parallel SIMT processors, resulting in a total of 2560 cores in the GPU.
Threads are executed in warps of 32 threads each. This matches the size of one SIMT processor.
Since all threads on the same SIMT processor execute the same instructions, special attention has to be paid to conditional branching.
Using this method:
The thread then selects one of the results based on the predicate.
This is only viable if the branch is small, otherwise the performance cost of executing both branches is too high.
This approach allows the threads to vote which branch to execute. If all threads vote for the same branch, the other branch is not executed.
Thus, care has to be taken to avoid big branches that diverge within the same warp.
Accessing memory introduces latency, for example on the order of 100 cycles. The bandwidth between the processor and memory is also limited. Stalling the processor while waiting for memory worsens utilization.
GPUs implement a memory hierarchy similar to CPUs:
The memory closest to the cores are registers, shared memory and L1 cache. For example, the GTX 1080 features
The cache works without explicit user control and exploits spacial and temporal locality of data accesses. A typical cache line size is 32 bytes, and transfer between memory and cache happens for entire cache lines only. Efficient usage of the cache, and proper alignment of data to cache lines are imperative for optimal performance, the performance benefit usually originates in reduced memory bandwidth requirements, which is often the bottleneck for processing speed.
The GPU hides memory latency by oversubscribing the SM with warps, and executing other warps while some wait for memory. On the GTX 1080, each SM can be assigned 64 active warps, while it’s only able to execute four warps in parallel.
Switching between warps is fast, and is not comparable to a context switch on the CPU. This does however mean that the number of warps per SM (GPU occupancy) is also limited by the total register usage.
Shared memory is fast memory close to the cores, with the restriction that it is local to one streaming multiprocessor. The usual use case is to explicitly move a hot region of memory to shared memory before computation.
TODO: Compute Shader
TODO: CUDA