Nick's Voxel Blog: Improving Generation Performance

There are quite a few high profile games coming out in the next year or so that make use of voxels to drive dynamic content. Two in particular have really caught my attention, for different reasons. The first was Hello Games' No Man's Sky which is in a similar vein to my own project, using voxel data and Dual Contouring to generate traditional polygonal meshes.

The other is MediaMolecule's new game Dreams which implements a novel rendering technique driven by Signed Distance Fields. You can read the slides presented by Alex Evans at SIGGRAPH here, and there's also a video of a shortened version of the presentation available here. If you've got a spare half an hour I really recommend watching the video.

Seeing both of these games develop has been inspiring. Dreams because they got to a similar-ish place to where my current project is and then decided it was a bit of a dead-end and they could do something much better. That has given me ideas for the long term and some things to chew over in the project as it currently stands. No Man's Sky on the other hand was a much more immediate inspiration since at its core the tech involved is quite similar (No Man's Sky being a lot more sophisticated of course!). One video in particular caught my attention when it came out.

If you've worked on a Dual Contouring engine something in particular about that video should grab your attention: how bloody fast the transitions between the planets are.

If you watch some of my older videos on YouTube its clear how slow my implementation was. I'd since moved most of the work to the GPU via OpenCL but my engine was still orders of magnitude slower than No Man's Sky. That was made even more inspiring when I found out that No Man's Sky was generating the mesh data on the CPU not the GPU! That made it pretty clear that something in my design or implementation was causing the slowdown and there was a better way to do it.

Starting Point

The old implementation was described in a previous post so I'll give an overview here. The world space is divided into discrete 3D volumes (chunks) and each chunk is responsible for storing the Signed Distance Field data required to generate an octree and subsequently a mesh. As an optimisation the octree data (i.e. the fully constructed nodes) for each chunk was stored and could be gathered as required to produce a mesh for any level of detail. Constructing a chunk therefore means evaluating the nodes of every level of detail no matter how far away from the current camera position the chunk happens to be.

A constructed chunk would then look something like this:

Conceptual contents of a chunk

To construct a mesh for a given LOD, 8^n chunks are queried and the nodes for relevant LOD are selected and copied into a new buffer which holds all the LODn nodes for all the chunks inside the volume. For LOD0 nodes this is simple, just lift the LOD0 nodes from a single chunk and run the Dual Contouring algorithm over them. For LOD1 up to 8 nodes are selected (typically only 3-5 would actually contain the surface/exist) and all the LOD1 nodes selected, copied into a buffer and the run through the DC algorithm. And so on.

For the first couple of LODs this isn't too bad but that 8^n term soon starts to catch up with us. 1 node, 8 nodes, 64 nodes (!), 512 nodes (!!), 4096 nodes -- holy shit! That description alone gives a good indication of why the old implementation was so slow but the poor design was compounded by the implementation.

How No to Write GPGPU code

Writing good GPGPU code requires new techniques to perform seemingly simple tasks. For example if you have an array of data and want to categorise it somehow (e.g. find out which voxels actually contain the isosurface...) then you'll need to use a Scan and Compact technique. These problems are obvious when you start writing GPGPU code but there are more subtle traps further down the line. The particular trap I'd fallen into here was organising the chunk data in such a way that constructing the data using the GPU was always going to be incredibly slow.

The data organisation was partly driven by the use case described above where a chunk would be required to supply its LODn nodes for any LOD at any given time. It was also driven by an assumption I had made about how the LOD > 0 nodes had to be constructed. In the original description of the algorithm the octree's the leaf nodes (i.e. LOD0) are created by sampling the SDF and all the other nodes in the tree are created by combining groups of up to 8 child nodes until eventually the root node is created. That meant that for a chunk to be able to handle a request for LODn nodes, LODn-1 to LOD0 nodes would also need to be created first, and so the whole octree needed to be created at once and stored for later use.

Each chunk was represented by a instance of a struct like this (some complications omitted):

    const int MAX_OCTREE_DEPTH = ...;
    struct GPUChunk
    {
        cl::Buffer    positions, normals;
        int           nodeOffsets[MAX_OCTREE_DEPTH];
    };

The data in each buffer is organised of contiguous runs of nodes of a given size, ordered from LOD0 to LOD(MAX_OCTREE_DEPTH-1).

Actual data layout for a chunk

The nodeOffsets array stores the index of the first node of each LOD, and so could be used to find both the first node and the number of nodes in the contiguous block, e.g.:

    const int firstNode = lod == 0 ? 0 : chunk.nodeOffsets[lod];
    const int numNodes = chunk.nodeOffsets[lod+1] - firstNode;

This makes the process of copying the nodes simple: get the start index and size and copy the data from one buffer to the single big buffer. Unfortunately what that meant in reality was issuing potentially hundreds or even thousands of 'Copy Buffer' GPU commands when constructing the bigger LOD nodes. That in itself is bad, but it gets worse.

As I mentioned before there was an assumed dependency on the child nodes of any node during construction, that the creation of LODn nodes depended on all the LODn-1 children having already been created. The only viable option was to construct the nodes in order from LOD0 (the LOD0 being created by sampling the SDF) up to LODn.

The octree is actually implemented as a hash table where each node in the octree has a unique ID created from its position and depth. The ID is created by encoding the node's position in relation to its siblings, i.e. each of the 8 child slots has a unique position starting at [0, 0, 0], [0, 0, 1], ... [1, 1, 1]. This position is encoded as a 3 bit string/number and concatenated with all the node's parent nodes unique IDs. This means that given any node's unique ID the parent's unique ID can be generated by bit shifting the last 3 bits from the node's own code, i.e.

    const int parentID = nodeID >> 3;

The first stage of the chunk construction is sampling the SDF to find the active voxels. This produces a list of unique IDs for the LOD0 nodes. This list can be processed to find the parent codes and so we end up with a loop like this:

    currentIDs = leafIDs
    allIDs.insert(currentIDs)
    while (size(codes) != 1)
    {
        parentIDs = calcParentIDs(currentIDs)
        allIDs.insert(parentIDs)
        currentIDs = parentIDs
    }

Except this is on the GPU so really the loop is more like this:

    cl::Buffer allIDs = CreateGPUBuffer(sizeof(int) * MAX_CHUNK_NODES)
    cl::Buffer currentIDs = leafIDs;
    CopyGPUBuffer(allIDs, currentIDs);
    while (size(codes) != 1)
    {
        cl::Buffer parentIDs = RunGPUKernel(calcParentIDs(currentIDs));
        CopyGPUBuffer(allIDs, parentIDs);
        currentIDs = parentIDs;
    }

(The OpenCL C++ Wrapper makes this code a bit nicer since the 'currentIDs = 's are just handled by updating the cl::Buffer object to reference a different cl_mem rather than requiring a copy.)

This is already looking pretty bad since as the LOD increase from zero the size of the data set being processed is being reduced with the last couple of levels processing only a few hundred or even a few dozen items. That means the overhead of issuing the GPU commands is out-weighing the cost actually performing the work. That's one problem with this code.

There's another complication with this loop that I've not mentioned yet. If any node in the list has a sibling then duplicate parent IDs will be generated. E.g. if node 010011 and node 010010 exist then both will generate the parent code 010. These duplicate values need to be found and removed.

Removing duplicates from a list usually involves either sorting the list explicitly or doing some other processing that involves sort-like behaviour. Sorting on the GPU is not a particularly straightforward operation and tends to have a lot of overhead with lots of GPU commands involved. To really make it worthwhile the data set has to be very large, and in this case the lists of parent IDs will at most be thousands and in the worst case only a dozen or so.

The algorithm I implemented was from this fantastic paper by Guy Blelloch which is itself worthy of a blog post. It was better than a full-blown sort but still involves too much overhead for my data sizes. The algorithm works by iterating over the values and splitting the input list into two new lists, values known to be unique and values that might be duplicates. The process continues until the list of possible duplicates is empty. The algorithm is (very) roughly as follows:

    potentialDuplicates = inputData;
    while (size(potentialDuplicates) > 0)
    {
        sparseKnownUnique = RunGPUKernel(findSomeUniqueValues(potentialDuplicates));
        knownUnique = ScanAndCompact(sparseKnownUnique);

        sparsePotentialDuplicates = RunGPUKernel(findPotentialDuplicates(potentialDuplicates, knownUnique));
        potentialDuplicates = ScanAndCompact(sparsePotentialDuplicates);
    }

The ScanAndCompact operations depend on performing two blocking reads from the GPU, which are slow. Typically the algorithm would run for between 5 and 10 iterations which meant paying the price for up to 20 blocking reads. Anything which calls this process is itself going to be slow, and even worse the octree creating called it multiple times.

The real loop is therefore more like this:

    int bufferOffset = 0;
    int nodeOffsets[MAX_LOD];
    nodeOffsets[0] = bufferOffset;

    cl::Buffer allIDs = CreateGPUBuffer(sizeof(int) * MAX_CHUNK_NODES)
    cl::Buffer currentIDs = leafIDs;
    CopyGPUBuffer(allIDs, currentIDs, bufferOffset);
    
    for (int i = 1; i < MAX_LODS; i++)
    {
        cl::Buffer parentIDs = RunGPUKernel(calcParentIDs(currentIDs));

        // high-cost process that involves multiple blocking reads
        parentIDs = RemoveDuplicates(parentIDs);
        
        bufferOffset += size(parentIDs)
        nodeOffsets[i] = bufferOffset;

        CopyGPUBuffer(allIDs, parentIDs, bufferOffset);
        currentIDs = parentIDs;
    }

There are some more steps involved in actually constructing the octree but it was this process that dominated the time take for the whole process.

All of that adds up to an octree creation process which depends on looping over a very slow operation (along with other problems like having data sizes which are too small to cover the GPU command overhead), and a mesh generation process which involved potentially thousands of GPU copy buffer commands.

The octree creation itself was slow but it was made worse by another requirement: in order to create a mesh for a given LOD all the chunks which contribute to the LOD node need to first be created. For LOD0 that's no problem, create one chunk use the octree to generate a mesh. LOD1 is up to 8 chunks. LOD2 is up 64 chunks. LOD3 is up to 512 chunks. That 8^n term is catching up with us again. In order to be able to create large meshes representing the world at low detail required not only a huge amount of processing (e.g. 512 chunks with a 64^3 volume size means processing 134,217,728 nodes!) but a poor implementation which required a lot of synchronisation between CPU and GPU.

The reasons for my implementation being slow are pretty clear now, but that leaves another question unanswered -- how is No Man's Sky so much faster?

An Ass of U and Me

After banging my head against the wall trying to make that implementation faster and reading lots of papers about voxels, octrees and GPGPU programming I was getting pretty frustrated at not being able to make any progress. I took a step back and looked at the whole process and realise it was all predicated on an assumption I had never actually tested: that the only leaf nodes could be created from sampling the SDF and all other nodes had to be constructed by combining the pre-existing child nodes.

My assumption was that creating non-leaf nodes by sampling the SDF would lead to artifacts that would cause gaps in the seams and potentially other hellish problems I hadn't considered. Since I was not making any progress with my attempts to improve the existing implementation this seemed worth trying, despite my specticism.

Removing the 'LODn-1 must exist before creating LODn' constraint simplifies the octree creation. Now instead of requiring that first the chunk volume was sampled as 64x64x64 voxels and then all the parent nodes were created (involving the painful looping process above), only the initial step of sampling the voxels is required. The sampling process is pretty straightforward, every chunk consists of a 64x64x64 voxel volume with the size of the voxels doubling each time so that the single octree covers the volume of what would have been multiple chunks in the old system. Everything else about the process remains more or less as described in the OpenCL post.

The old process was to pre-generate the octrees and then to fulfill a request for a LODn mesh, gather all the pre-generated nodes from the contributing chunks and use these nodes to create firstly a new octree and then subsequently the requested mesh. With the constraint removed that process becomes a lot simpler: to fufil a request for an LODn mesh, create a LODn octree by directly sampling the SDF and then generate the mesh from the octree. That might seem like a subtle difference for the LOD0 nodes, but for the other LOD nodes it makes a huge difference. Creating an LODn octree only requires sampling a single 64x64x64 volume (i.e. 262,144 voxels) -- compared to the 134,217,728 nodes required for an LOD3 node in the old system that's a huge reduction.

Not only does removing the constraint mean the octree creation itself is a lot faster (no more hellish loops) but it also means that the number of octress that need to be created is also reduced.

I made the changes expecting some improvement, found that in fact I had reduced my test scene's load time from about 80s to just about 2-3s. After making some other changes to account for the new design I was able to shave a bit more off that finally able to get No Man's Sky-like performance when generating the volume from scratch (which is presumably how those fast transitions work).

I posted a video a while ago after I had made these changes, and hopefully through the awful colours, brightness and YouTube compression you can see how fast the regeneration is here.

The video also shows that my assumption was wrong -- there aren't any artifacts introduced by sampling at different resolutions. Overall this rewrite had a fantastic outcome, I managed to get the generation speed somewhere in the No Man's Sky ballpark and learned a few lessons about how not to write GPGPU code!

There's Always a But

Unfortunately these changes completely broke the CSG operation handling and I had to rewrite that too, but that's for another post...

10 comments:

Toxic Wolf30 December 2015 at 11:52
I've been reading your blog for a while now trying to fully understand how DC works so that I can implement it myself. Your posts have been a great source of insight, however there is one topic that still eludes my understanding.

I understand that the algorithm typically samples a Signed Distance Function as a source of the volumetric data, however how does this work with runtime-modifiable terrain? In what format do you store the volumetric data once it has been modified?
Zauber Paracelsus1 January 2016 at 08:25
I had been reading around and discovered another isosurface algorithm called "Adaptive Skeleton Climbing". I lack the mathematical knowledge to fully understand what it is doing, though one thing it does which (allegedly) eliminates seam/gap issues at LOD transitions, is that it first generates the lowest LOD level, and then uses that as input for generating the next LOD level, etc. No idea if it supports smooth and hard edges like dual-contouring, though.

Is that technique (generating lower LOD's first) something that could be applied practically to what you're doing?

The research paper and code is published here: http://www.cse.cuhk.edu.hk/~ttwong/papers/asc/asc.html
JohnnySabotage4 January 2016 at 23:14
Nick,
I'm just a huge fan of yours. I have spoken to you on reddit a few times. Your ability to share solutions is second to none. I love you brother, thank you.
Unknown3 May 2016 at 05:26
Hey man, just stumbled on your blog and it looks awesome! I'm about to give it a thorough read but I was wondering if these techniques are viable for regenerating geometry every frame. I have about 100x100x100 voxels NOT generated by noise, that update every frame. (Something like polygonizing dense particles that have discrete positions in a 3d grid, in real time). I am targeting 2-3 fps for high range hardware. Is it doable?

Nick's Voxel Blog

Tuesday, 22 December 2015

Improving Generation Performance

Starting Point

How No to Write GPGPU code

An Ass of U and Me

There's Always a But

10 comments:

Blogroll

About Me