Monday 19 October 2015

Video update for October 2015

I've posted a new video to YouTube showing some of the speed improvements I've made recently.



Since the last video in Feburary I've moved the entire voxel -> mesh pipeline (except for the seam mesh generation) to the GPU via OpenCL and reworked how the volume is processed. I'm planning on a blog post to explain the changes but the main thing is that I no longer always sample the field at a 1:1 resolution. E.g. for a LOD1 node I now only sample a 128x128x128 volume instead of 8, and for LOD2 only 1 volume instead of 64. This has had a huge impact on the performance which has helped with other features, like the dragging operations you can see in the video.

14 comments:

  1. It seems you are some steps ahead of me. Part of the reason might be because I lost the will to work on anything for like a whole month but life is life.

    I have yet to move the voxel -> mesh pipeline entirely to the GPU. Seam generation is still kind of broken, and I've only gotten a start to changing the sampling resolution to no longer be 1:1

    What I have done is implement volumetric placement of resources (textures) (however limited that feature currently is). And next, while this might perhaps be a bad move, I figured out I can test if a chunk of any resolution is empty by sampling it a rather low resolution. Like, even if it's a 64 * 64 * 64 chunk I can test it with a 4 * 4 * 4 resolution check to see if it's empty, and I originally implemented this change on a 1:1 ratio, but had it handle the empty checks ahead of the actual chunk filling operations, so that I could maximize the amount of empties it can test, before it needed to voxelize an actual filled chunk. I did this because I could not run the second half of the voxelization (which I handle in a secondary thread) at the same time as I run a GPU operation. Why? Because Unity (I'm fairly sure it's Unity's fault at least).

    Anyway... The 1:1 ratio thing is what I did at first. But then I figured out: Hmmm, if I can test even high res chunks at a low res for checking if it's empty, why not just increase the res and scale of the test proportionately to cover the same amount of space and overall sampling resolution of that space, with less individual GPU accesses. And using a Octree-based restructuring, that's exactly what I did...

    However, that actually brings me to where I am now. I actually have a couple leftover bugs. One: I haven't been able to get modification of empty chunks working right, now that I've restructured how they are consuming space. In general I think it's simply a matter of figuring out where I made the modification, then subdivide down to chunk size space at the place I made the modification. And for when I completely empty out a chunk, I believe I need to subdivide to account for the newly emptied space.

    I'm probably just getting the position wrong somehow... I initially made a mistake in my refactored empty space testing code in which I wasn't feeding the GPU the proper sampling position (I needed to offset the position by half the size of the empty pocket of space in question).

    Technically there also seems to be a problem with my chunk deletion as well right now, but no other new bugs besides that.

    Other than that, I made it so that the voxel scale of the world can be less than 1 by multiplying a few things by a floating point value. It was actually easier to implement that than I thought it would be. And I implemented a Space Colonization based Tree generation code that I haven't really expanded on nearly enough yet. I do know that the way I'm actually drawing the trees into the world needs to be completely refactored, as drawing each tree as an array of small individual primitive shapes that need to all be sampled by voxelization is WAY too slow. I haven't implemented a true procedural generation system for trees and other such objects yet, but I think I came up with an idea that has some potential quite recently.

    Other than fixing up those leftover bugs, my next ideas for optimization of chunk loading include: Sampling multiple chunks of the same LOD all in one (which will also save some time in my empty space testing area as well) and eventually moving the rest of the voxel -> mesh process to the GPU so that I can finally get it off that secondary thread, which should double the speed of loading chunks that actually do get mesh in them. And MAYBE there is something I can do about making a better solution to generating collision detection. I'm not really sure about it really, I know that it can take like 30 to 40 ms to generate the collision mesh for a 64 * 64 * 64 res chunk using Unity's collision mesh.

    ReplyDelete
    Replies
    1. And for when I completely empty out a chunk, I believe I need to subdivide to account for the newly emptied space.

      I believe I need to UNsubdivide

      Delete
    2. This comment has been removed by the author.

      Delete
    3. I have no idea why my comments doubled.

      Delete
    4. Ah crud just realized I should of used that other comment to extend my message, as I had something else to say in regards to trees, and I should post a few screenshots of what I've been doing.

      First I'll post my update screenshots:
      http://i.imgur.com/XATqhZ1.jpg
      I implemented few updates up to this screenshot. I implemented volumetric resources such that you could find grass on the top layer of the planet, dirt below, and rock somewhat below that. And even though it's not visible in this, I also made a single octave noise field for fluctuating the surface height of the planet. And I implemented trees using Space Colonization with a placement method along the surface of the planet that I can best surmise as NOT actual procedural generation.

      http://i.imgur.com/GCNCUEB.jpg
      Here I implemented the voxel scale factor that allows me to have voxels smaller than a meter. The showcased scale in this screenshot is 0.25 meter voxels. Also, I'm not sure if I noted this before at any point, but I'm a brony. The pony in the screenshot doesn't really have much to do with anything regarding the project, but I already had a pony solution from my other project and I wanted to pony it up a bit.

      http://i.imgur.com/c1bKDTW.jpg
      No changes here, just show casing a tree drawn at 0.25 meter voxel scale. Note you can actually make out the the spheres used to draw it now at this detail.

      And these 2 are in regards to my latest change, the restructuring of how I test empty space
      http://i.imgur.com/2t6guK4.png
      First showing a 2 kilometer scene of 0.5 meter voxels, loaded up in somewhat over a minute.

      http://i.imgur.com/UUgwZot.png
      And this shows a 2 kilometer scene of 1 meter voxels, but the point I wanted to show in this case is the Octree of cubes representing each individual empty space test.


      What I have to say in regards to trees, and potentially a lot of other things later on, is to do with how I should draw them into the world.

      My issue as I mentioned before is that I'm drawing the trees as an array of individual spheres. That's quite a lot of spheres each voxel in a chunk has to test against. And well... My only ideas on how to lower the cost, I'm not sure how to implement.

      My ideas basically consist of somehow implementing density algorithms that generate customizable lines and curves (which I'm not sure how to do) and tell it to sample shapes along those lines/curves.

      My actual idea for how I tell it to use a shape gets it's basis from this
      http://i.imgur.com/9qesAq1.png
      This is a screenshot from further back than my empty space octree update, but I wanted to show this one here specifically. Basically, the shape you see here is a grid of toruses wrapped around the contours of a sphere. I actually manage this by basically only sampling a sphere and a single torus, only with a tiny additional cost of running a modulus operation on the x, y, and z to have it repeat the torus shape. And because of the sphere thing, it contours that grid of toruses along the shape of the sphere. I actually technically sample a second sphere to erase the insides of the generated sphere because it still was technically generating a full 3D grid of torus throughout the whole sphere, so I needed to erase the inside to get the effect that I show in the screenshot.

      The point still stands that I managed to get all these toruses drawn with a single sampling of the torus shape by using modulus based repetition. IF I can somehow mix this concept with the the shape of the curve being a line or a spline curve or something, I will be able to lower the sampling cost of drawing a tree, and potentially lots of other shapes in the future, by quite a margin.

      Delete
    5. That's quite a lot to reply to but I'll do my best!

      > It seems you are some steps ahead of me. Part of the reason might be because I lost the will to work on anything for like a whole month but life is life.

      I think that's quite a natural thing if you're working on a project like this by yourself for an extended period. E.g. a couple of times over the course of this I've taken a break of a month or more from it & every time I've come back with more enthusiasm than I had before. Its better to take a break occasionally than to get burnt out.

      > 4x4x4 check versus the 64x64x64

      Have you got that bit on the GPU? I did some experiments with using a heightmap to reduce the number of voxels I'd need to check based on the Y axis value for the heightmap and found that it hardly made any difference. I.e. it didn't cost very much to actually process the empty voxels.

      > However, that actually brings me to where I am now. I actually have a couple leftover bugs. One: I haven't been able to get modification of empty chunks working right ...

      I was looking at this problem myself recently. Originally I had the clipmap octree being pruned of any nodes that don't contain the surface but I found that was a pain in the arse when the CSG ops cause the (now deleted) node to contain the surface. Instead now I just have a "full" octree where a leaf node is created for every chunk and then if the mesh generation doesn't produce a mesh I mark the node "empty". CSG operations can then touch a node and mark it "not empty" in which case the mesh generation is tried again on the next update. That seems a bit wasteful but it makes life a lot simpler.

      > I do know that the way I'm actually drawing the trees into the world needs to be completely refactored, as drawing each tree as an array of small individual primitive shapes that need to all be sampled by voxelization is WAY too slow. I haven't implemented a true procedural generation system for trees and other such objects yet, but I think I came up with an idea that has some potential quite recently.

      This sounds pretty cool, its something I've been meaning to get started with (procedural gen of something other than terrain) for a while but not got around to. Interestingly I had to tackle the 'apply lots of CSG operations at once' problem to fix the LOD switching after changing the way the data was stored. Previously I always had the LOD0 density field available and just wrote the CSG operations to that, but now only the currently visisble LOD node's density field is loaded. That means you can do a CSG op then move the camera and the op would not be present on the mesh when the LOD switched. I'll cover that in the blog post about how this all works now.

      > MAYBE there is something I can do about making a better solution to generating collision detection. I'm not really sure about it really, I know that it can take like 30 to 40 ms to generate the collision mesh for a 64 * 64 * 64 res chunk using Unity's collision mesh.

      Again, something I've just tackled! :) What I do is generate a 3D texture (or really, a 2D image array) recording the signed distance field values for the terrain and any CSG ops. That means I can use sphere tracing to quickly cast rays into the volume, which is what I use for the positioning of the CSG ops in the video. Unfortunately that comes with quite a bit of overhead since I'm only tracing 1 ray and doing it on the GPU, but I should be able to add collision detection for a lot of entities quite easily in future (and its a lot better than using Bullet which is what I did before!).

      > CSG gen stuff

      That all looks pretty cool! I've not really got to that stuff yet so I don't have much input (yet!).

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. (4th try to post here)
    Thanks for DC CPU example... it slightly differs from dc.java, not sure if EQF works same, because of torn edges at subtracted geometry.

    about empty space skipping, there may be an optimization:

    void sample_grid_spherical_opt(i32_t x,i32_t y,i32_t z){
    float sgn=Density_Func(x,y,z), rad=fabs(sgn);
    grid[x][y][z]=sgn, grid_set[x][y][z]=true;
    if(rad<3)return;// too close to surface - better compute real value
    sgn=sign(sgn);
    const vec3i cs(max( 0.0f,x-rad), max( 0.0f,y-rad), max( 0.0f,z-rad)),
    ce(min(rootsize+1,x+rad), min(rootsize+1,y+rad), min(rootsize+1,z+rad));
    loopi(cs.x,ce.x)
    loopj(cs.y,ce.y)
    loopk(cs.z,ce.z) if(!grid_set[i][j][k]){
    val=length(vec3(x,y,z)-vec3(i,j,k));
    if((val+3)<rad){
    grid[i][j][k]=(rad-val)*sgn, grid_set[i][j][k]=true; // interpolate SDF value for grid
    }
    }
    };
    call it recursively until size 8 reached
    with that, I got boost from 13.66 sec to 6.59 sec for tree root size 256 (about 50% of time)
    it saves 16.656.995 function calls of SDF samples (3.081.360 instead 19.738.355) and do interpolation - this actually, an rough approximation because SDF not always return radius for empty sphere.
    how much time you spend on one block generation? cpu|gpu

    ReplyDelete
    Replies
    1. Hi,

      I think the problem you mention (torn edges) might be related to your optimisations. If you use an approximation of the density func rather than the func directly then you get problems with discontinuities in the surface. Although the CPU implementation of the noise functions are very slow so I understand why youre trying to do this :)

      Not got any timings for the CPU but it was very slow compared to the GPU version. The first part of the pipeline I moved over was the density field generation and the speed up was amazing. I think it used to take a couple of mins to generate about 8 64^3 voxel chunks.

      Delete
    2. just look at your example app, isn't edges are torn for you?
      http://2.bp.blogspot.com/-zlqHzbv-h0s/VF7DHYgJCNI/AAAAAAAAAFs/JUYW7TR85AA/s1600/scene.jpg
      http://imgur.com/GtA7kMg
      thats why I think QEF might move vertices way off, in non-manifold way http://faculty.cs.tamu.edu/schaefer/research/dualsimp_tvcg.pdf or something else

      so, how much GPU version takes msec for some cubical space? 256^3 for example
      thanks

      Delete
    3. That's not what I imagined you meant by "torn". I think the problem in my sample is due to the QEF impl needing tweaked (use the one from my QEF repo on github instead) & a separate problem with how the normals are calculated for some vertices. E.g. if you think of the vertex on the corner of a cube, it will get an average of the normals of the 3 faces it belongs to which can make the edges look soft/wrong depending on how they are shaded. In my latest video I'm using normals calculated in the OpenGL shaders to help fix that and make the edges of cubes etc sharper.

      I can't build a 256^3 volume as it stands (I get some errors that I've not looked in to) but to go from nothing to a mesh for a 64^3 volume (which is my current size) is about 10-15 ms I think. To create the an empty/full SDF that has not surface takes about 5-7ms at that size. The interesting thing with the OpenCL code is that typically the bottlenecks come from having to read some data back into the CPU to drive the next phase of the OpenCL work, rather than the GPU being really busy and taking time to complete tasks. I'm running all this on a GTX 970 but it ran pretty well on my old 9790.

      Delete
  4. This comment has been removed by the author.

    ReplyDelete
  5. That's an awesome filled video Nick! Love the progress!

    How are you doing the LOD changes? I mean, i know the basics that you've been over before, but how do you tell each chunk to render what LOD? I've been toying with having each OctreeNode contain a Surface class and an LOD id, so that you can tell each Parent Octree what LOD should be displayed, and it can dive down.. but doing that, or having each octree node independantly looking up a distance from the player.. SO expensive! I'm struggling with getting just a system running that doesn't use more CPU than the DC component at this point.. I would LOVE to see a blog post on that as well!

    ReplyDelete
    Replies
    1. Thanks :)

      The current method is a bit hacky which is why I've not written it up. I run the update a few times every second (I think at 5hz currently) and each time the whole clipmap octree is traversed top-down. I have a set of constant distances which are associated with each size of clipmap node (i.e. LOD level) and as the octree is traversed I measure the distance from the camera to the node and compare that against the constant distance. If the measured distance is less than the constant distance *and* none of the parent nodes are active then the node becomes active and a mesh generation event is queued. Traversing top-down allows the clean up of the old active nodes to be queued too.

      This is not a solution I'm particularly happy which is why I've not written it up (it can get pretty slow when the volume size is large). I do have an idea for a better approach which I might try soon, if that works I'll write that up.

      Delete