unvirtual

Space partitioning and ray traversal with `cukd`

unvirtual — Fri, 23 Mar 2012 00:00:00 PDT

A first working version of my parallelized kd-tree implementation, cukd, is available on github. A good moment to analyze how it performs so far.

What it does

Given a triangle soup, cukd constructs a kd-tree on NVidia GPUs utilizing the CUDA framework, using empty space cutting, median splitting and surface area heuristics (SAH). Once the tree is constructed, it takes a list of rays — their origins and directions - and traces them through the scene, thereby searching for triangle – ray intersections, returning indices to the foremost intersected triangles and traversal costs.

Here’s a visualisation of the ray traversal cost for the Stanford Dragon with roughly 200k triangles.

Red and green pixels represent triangle hits/no hits, resepectively, and the brightness encodes the traversal cost.

How well it works

Quite to my surprise, the first version of cukd shows nice performance. So far, the main focus of development was on correctness in implementing the algorithm detailed in Real time KD-Tree Construction on Graphics Hardware by Zhou et al., choosing the optimal interplay of algorithms like parallel reductions and scans, reducing memory allocations on and transfers to the GPU and using structure of arrays wherever possible. Only secondary (though very important) were hardware specific considerations, like memory alignment, using of texture/shared/constant memory and so on, though these changes can be incorporated quite easily in the future. Still, the performance is quite impressive, topping almost 90 million rays/sec for a small scene:

Building the kd-tree for the Stanford dragon consisting of 300k triangles takes around 350ms on my NVidia GTX 760, not really in real time territory. 40 million rays/sec for traversal of this scene however pretty much is. Of course, there is more to raytracing than just ray-triangles intersections, but it’s a nice start. Here the rays are cast line by line in the scene. Switching to Morton order should improve cache coherence and therefore ray traversal time even more.

Looking closer at the graph, it’s obvious that there is a lot of overhead one can potentially get rid of. While the tree creation time rises linearly and rather slowly – going from ~6k to 300k triangles, a factor of 50, only requires 8 times the computation time –, the large ~60ms offset around ~6k triangles indidcates room for improvement.

The distribution of the traversal cost per ray (i.e. the number of steps through the tree plus the number of triangle intersection tests) looks the same for every size of the model:

The peak around ~40 steps for rays intersecting some triangle is exactly where one would expect it, somewhere between the average leaf depth of the tree (18 in this case) and the average number of triangles in each leaf (26 in this case). In the no-hit case, however, there is a large spread to fairly high values. I suppose this has to do with some kind of problem in empty space splitting. Indeed, the above picture looks very much different when we allow more empty space in each node.

Apparently there are huge nodes containing only few triangles at their boundaries, something that shouldn’t really happen.

K-d trees with CUDA

unvirtual — Mon, 05 Mar 2012 00:00:00 PST

Spatial acceleration structures are crucial whenever relations between multi-dimensional data are to be analyzed. For instance, nearest neighbour searches in particle-based fluid simulations or ray-triangle intersections in raytracing both profit largely from space subdivisions.

Generating these structures on graphics hardware in a fast parallelized way opens new opportunities for real time applications and in the following the main ingredients for the construction of one of these structures – the so-called k-d tree – on GPUs is discussed. “K-d tree construction on Graphics Hardware” by Zhou et al. discribes such an approach to k-d tree construction, a method mainly consisting of two iterations over nodes of the tree:

Large Node Stage

In the first step of the k-d tree construction the top tree structure is established by cutting of empty space in node bounding boxes and by splitting nodes into children at the median of a node’s tight triangle bounding box. The split axis is chosen to be the longest node edge. A node is marked as a leaf node, as soon as it contains less than a fixed number of elements [N_l], with [N_l = 64] in this case.

The resulting child nodes contain a minimal amount of empty space and cutting the tight bounding box guarantees a termination of the process by subdividing triangles into child nodes at each step. In the case of a triangle being cut in the process, the triangle is clipped and the resulting left and right triangle bounding boxes are sorted into the respective child nodes.

Most of this stage is either parallelized over nodes or triangles, building the tree iteratively from top to bottom as long as nodes with [N \lt N_l] elements exist.

Small Node Stage

In a subsequent step, the resulting leaf nodes – the small nodes – are further subdivided using Surface Area Heuristics (SAH) (further detail in Heuristic Ray Shooting Algorithms by Vlastimil Havran).

SAH provides a model for the estimation of ray traversal cost through a spatial acceleration structure, a crucial measure for fast ray tracing. The basis for SAH cost estimation is: Given a ray that intersects a bounding box, what is the probabilty that this same ray is also intersecting a bounding volume within the same box?

The answer is simple and is given by the ratio of the surface areas of node A and B. With this at hand, one can estimate the average number of internal and external nodes as well as the number of necessary ray-triangle intersection per ray. A weighted sum over these numbers (with weights being per-node and per-intersection costs) gives an approximation of the total cost.

Obviously, one would like to minimize this cost already during the construction of the k-d tree and not only after the fact.

As a prerequisite, given the small nodes, a list of all possible candidate splitting planes is constructed, which are taken to be boundary planes of the triangle bounding boxes in any given node.

At each step during the small node stage, the resulting cost for each splitting plane in every node is estimated using the SAH model and the splitting plane with the minimal SAH cost is chosen. This process terminates as soon as the estimated cost of a child becomes larger than the number of elements in this node.

Here, again, both the splitting candidate determination and node splitting are performed in parallel of the small nodes.

Some Implementation Details

These are mainly notes to myself, but maybe they are helpful for someone else, too.

Node and Triangle Data

All relevant node data required for the construction phase is kept in flat lists, stored in Structs of Arrays (SoA) to allow for coalescing during access in CUDA kernels. Simplified, a list containing nodes and elements can be stored in the following way:

struct NodeList {
    // indices to the left and right nodes in this list
    thrust::device_vector<int> left_nodes, right_nodes;
    // split axis of current node
    thrust::device_vector<int> split_axis;
    // split position of current node
    thrust::device_vector<float> split_position;
    // indices to the first triangle contained in the current node
    thrust::device_vector<int> first_element_idx;
    // number of elements in the current node
    thrust::device_vector<int> node_size;
    // indices to triangles in the original triangle array
    thrust::device_vector<int> element_idx;
}

Triangle data (vertex information) is static and stored in an array (maybe texture?). We only keep track of indices to this list in the NodeLists and do not shuffle the original data around. We keep an own list of triangle bounding boxes, that is allowed to grow. Instead of storing clipped triangles after node splitting, we only use the triangle vertex information to compute left and right triangle bounding boxes after triangle splits and assign them to the corresponding nodes.

Using `thrust`

The most common operations on these arrays are transformations, reductions, scans, appending of other lists or new elements, compactifications and copies. For most of these operations, the thrust framework comes in handy. To copy nodes from one NodeLists to another for instance, we can use thrust::zip_iterator<>

typedef thrust::tuple<int, int, int, float, int, int> NodeTuple;
typedef thrust::tuple<thrust::device_vector<int>::iterator,
                      thrust::device_vector<int>::iterator,
                      thrust::device_vector<int>::iterator,
                      thrust::device_vector<float>::iterator,
                      thrust::device_vector<int>::iterator,
                      thrust::device_vector<int>::iterator> NodeTupleIterator;

NodeList list1, list2;
NodeTupleIterator begin =
    thrust::make_tuple(list1.left_nodes.begin(),
                       list1.right_nodes.begin(),
                       list1.split_axis.begin(),
                       list1.split_position.begin(),
                       list1.first_element_idx.begin(),
                       list1.node_size.begin());
NodeTupleIterator end =
    thrust::make_tuple(list1.left_nodes.end(),
                       list1.right_nodes.end(),
                       list1.split_axis.end(),
                       list1.split_position.end(),
                       list1.first_element_idx.end(),
                       list1.node_size.end());
NodeTupleIterator result =
    thrust::make_tuple(list2.left_nodes.begin(),
                       list2.right_nodes.begin(),
                       list2.split_axis.begin(),
                       list2.split_position.begin(),
                       list2.first_element_idx.begin(),
                       list2.node_size.begin());

thrust::copy(thrust::make_zip_iterator(begin),
             thrust::make_zip_iterator(end),
             thrust::make_zip_iterator(result));

Empty space cutting

To ensure that every split gives exactly two new nodes in the next NodeList to be processed, empty space cutting has to be done recursively for each node before median splits are done. The resulting empty splits are not appended to the next NodeList, but placed directly into the intermediate tree respresentation. The next nodes’ bounding boxes are adjusted accordingly.

Chunks

To get the most out of parallelization, the elements are organized in chunks of a fixed number of elements. The idea is to do parallel operations on these independent chunks and gather the per node results later with a segmented or keyed reduction. Since chunks need not to be filled completely, we need to keep track of the chunk size, the index of the first element in element_idx and the owning node index.

Processing these chunks in custom kernels is easy: set the grid size to the number of chunks and the thread size to the maximal chunk size. Often, we’ll need to do per-chunk reductions, to get the chunk axis aligned bounding boxes for instance. This is currently not possible with thrust in a simple way. Instead, we use a single-block reduction function taken from CUDA SDK example and call it in a custom kernel.

To get node bounding boxes from chunk bounding boxes in the above example, we can simply use thrust::reduce_by_key instead of a custom segmented reduction implementation.

Surface Area Heuristics (SAH)

Once no large nodes are avilable to be processed, a list of candidate split planes for each small node is needed. We take the boundary planes of the triangles’ axis aligned bounding boxes, which for a small node containing at most 64 elements generates at most 6*64 split candidates per node.

After some experimenting, the best solution to get the minimal SAH cost of each plane seems to be a somewhat convoluted kernel configuration. We compute the SAH cost for each split in an own thread, the grid is configured to be as large as the number of nodes. Each SAH cost is stored in shared memory and once the threads of a block have synced, we perform a partitioned reduction to find the minimum split cost, separated by axis. Finally, the minimum of the resulting three values is taken as minimal SAH cost for the node. To keep the number of threads a power of two, we take 128 threads and process two split candidates (along the same axis) per thread.

The following device function should do the trick:

// reduce a tuple to a single value
template<unsigned int size,
         unsigned int tuple_length,
         typename T, class Method>
__device__
T partition_reduction_device(T* input, int* offset,
                             int n_elements, int* index) {
    int tid = threadIdx.x;
    T result;
    __shared__ T temp[size];
    __shared__ int shared_index;
    if(tid == 0)
        shared_index = -1;

    if(tid < size) {
        temp[tid] = Method::neutral_element();
        if(tid < n_elements) {
            temp[tid] =
                Method::reduction_operator(input[tid + offset[0]],
                                           input[tid + offset[1]]);
#pragma unroll
            for(int i = 2; i < tuple_length; ++i)
                temp[tid] =
                    Method::reduction_operator(temp[tid],
                                               input[tid + offset[i]]);
        }
        __syncthreads();

        // reduction in a single block
        result = reduction_device<size, T, Method>(temp);
        __syncthreads();

#pragma unroll
        for(int i = 0; i < tuple_length; ++i)
            if(result == input[tid + offset[i]])
                shared_index = tid + offset[i];
    }
    __syncthreads();

    if(tid == 0) {
        *index = shared_index;
    }
    return result;
}

Final tree representation

For the final representation of the tree in preorder sort, the paper seems to suggest a flat array (the bottom-up procedure computes sizes of the required number of elements per node). We can simply use a list of ints, where each node is stored with at least two consecutive ints. For non-leaf nodes, we can pack the right child index, split axis and flags indicating empty space cuts of left and right children into the first number, while the second one stores the split position cast to an int. If a non-leaf node is stored in the next int, it is left of the parent, so we do not need to store this information. For leaf nodes, we store the element count in the first number and the following N numbers contain indices to the N elements contained in the leaf node.

Ray traversal

A nice choice for a parallel ray traversal algorithm is also outlined in Appendix C of “Heuristic Ray Shooting Algorithms”. Each ray traverses the scene within two nested while loops and no information on the node bounding boxes has to be provided, just the spliiting plane positions. The basic principle is to traverse the tree from split-plane to split plane, processing first intersected nodes first and pushing far nodes into a stack. In case of empty nodes, we should only push the non-empty node to the stack (if applicable) and jump to the next node immediately.

Raytracing and acceleration structures

unvirtual — Sun, 19 Feb 2012 00:00:00 PST

This is going to be a series of posts about raytracing, acceleration structures and, eventually, photon mapping on the GPU - a project I’ve been working on for a while in my spare time. I’ve finally decided to put the current state of this project out into the wild, most of which is an implementation based on “Real Time KD-Tree Construction on Graphics Hardware” by Zhou et al.

The following is mostly an introductory article for all those around me who are wondering what I’m so excited about, for more details on the current state of the project and the github link check out this post. Further down is a nice video, though.

Let’s trace some rays …

Raytracing is basically the process of taking data like this:

    # vertex coordinates
    0.0321981 0.0565738 -0.0503247
    0.0330387 0.056612 -0.0501226
    0.0321019 0.0575225 -0.0502063
    0.0330546 0.0575024 -0.0503454
    0.0313077 0.0556267 -0.0500367
    # ...
    # edges
    3 1 2 3
    3 1 0 2
    3 8 7 5
    3 7 4 5
    # ...

and transforming it into a nice looking picture (created with Blender):

The data is a list of triangles in 3D space that form a 3D model of stuff, a dragon in this case. The triangles are encoded by their vertex positions and edge information. Combined with additional information like positions, brightness and colors of light sources and materials that describe the interaction of light with the model raytracing is capable of producing photo realistic images like the one above.

What we perceive with our eyes in the real world are a bunch (actually a gazillion) of photons falling onto our retina. These photons bounced around the world getting reflected or scattered on surfaces and “changing” their wavelength and polarization. Raytracing simulates and approximates light falling into a virtual camera (our eye) after having interacted with objects, but the process of gathering the light information is reversed. Of all those photons flying around, only a very small amount of them in fact makes it through our small pupils. So instead of simulating all possible paths of light being emitted from light sources, rays are shot out of the camera into the world and each ray is tested for an intersection with an object that might or might not be illuminated.

At each intersection, the distance and direction to all possible light sources is computed and, given the properties of the material, its translucency, reflectance, etc., the brightness and color of light along the ray back to the camera is determined. Shading can be directly obtained from the angle between the intersection - light direction and the surface normal. With this idea, a whole lot of effects can be simulated by sending out secondary rays from the light source or intersection points: mirror reflections, refraction, shadows, projections and much more. To arrive at the final image, a plane is placed in front of the camera and intersections of the light rays with this plane give pixel positions and colors, composing the projection of the 3D scene on a 2D plane.

Ok, so we know what to do: take 3D triangle data, send out some rays, check for each ray if it intersects with any triangle, compute the light being bounced back, and we are done. If we want reflections and refractions, shoot some more rays from the intersection points, which can be implemented recursively. This is in fact so simple, there exists an implementation that fits on a business card (scroll down to “Minimal ray tracer”).

Not so fast, space partitioning is faster

There is one drawback however: the naive approach requires plenty of computation time, as every ray has to be tested for intersection with every single triangle in the scene. In the vast majority of cases the tested triangle is nowhere even near the ray. Even worse, testing for triangle intersection requires expensive dot and cross products, checking if the found triangle is occluded, etc., just to find 99.9% of the time: nope, no intersection, next triangle, please. Certainly, we can do better than that, right?

That’s where acceleration data structures come into play.

The basic idea is simple: subdivide the volume containing a soup of triangles to be rendered into smaller boxes and sort triangles into them. Testing for ray-triangle intersection then is a two-step process: traverse the boxes along the ray and if the current box is not empty, check for intersection with the triangles contained in the current box only. If no triangle was hit, we’ll move on to the next box. This technique is especially effective if the full bounding volume can be partitioned such that empty and non-empty space is well-separated and each box contains only a few triangles. In the former case, no ray-triangle intersection testing has to be performed at all once a hit with an empty box was determined. How’s that for a speed-up?

There are of course many ways to organize the space partitioning and different data structures have been developed for this task (Octrees, R-Trees, etc.). In the following we will focus on the so-called k-d tree using axis aligned bounding boxes, a binary tree representation of the space subdivision. Each node is associated with a splitting plane defined by a normal vector along one of the main spatial axes. The left and right child nodes represent the bounding volumes to the left and right of this plane, respectively. Constructing the tree top-down, each level of the tree subdivides the initial volume into smaller chunks and keeps track of triangles contained in them.

A basic k-d tree implementation is quite straight-forward and simple, just have a look at the minimal a dozen of lines or so Python required for the construction of a kd-tree storing point clouds, only a little bit more effort is required for triangle soups.

something

K-d tree construction and traversal on the GPU

Using a k-d tree speeds up raytracing by orders of magnitude on a single CPU system, beautiful pictures can be rendered in minutes. But what if we’d like even more speed? Maybe so much speed that we could do raytracing in real time?

Before even thinking of the raytracing component, one has to think about the acceleration structure, the k-d tree. If the scene we want to render is static and all triangles composing the scene are fixed in space, the tree has to be constructed only once and one doesn’t have to worry about performance too much. On the other hand, if any part of the scene changes or moves, the tree has to be reconstructed. Since this could happen every frame, the tree building better be quick.

Parallelization to the rescue! Not double or quad core CPUs, massive paralellization using the GPU.

It turns out, however, things become not that as simple as in the dozen lines of code above anymore, if we strive for efficiency. We have to worry about using the best parallelized algorithms and respect the restrictions of the hardware. The tree construction has to be split into fragments, that can be treated independently without exchange of information between processes.

Thankfully, as with most awesome things, this work has already been done. “Real time KD-Tree Construction on Graphics Hardware” describes an efficient algorithm for k-d tree construction. Following the algorithm outlined in the paper and filling in the missing pieces, I’ve jumped into the implementation.

So far, I have one half of the algorithm working, creating a tree containing a given number of triangles in its leaves after cutting off empty space and performing median splitting.

Implementing this little beast turns out to be quite more involved than it looks like at first, more details about the approach in another post. Until then, above is the Stanford Dragon (200k triangles) rendered with the resulting node bounding boxes of the tree in Blender, terminating the construction at 2048, 1024, 512, 256, 128 and 64 triangles per node, respectively. It’s very helpful to have a visual representation for debugging purposes (and there are some inconsistencies in the empty space cutting apparent).