a hitch in my plan: opening a complicated model on a windows 11 vm running on half a CPU core on a potato takes a hot minute to compile a few hundred shaders 🙃
this is the problem model in the first image. each color is a different shader that had to be generated by tangerine to render the voxel, and the average complexity of the generated shaders is also high.
the second image is the generated part for one of the these shaders. its basically the object structure w/ all the params pulled out.
here's some very entertaining reading about the same problem in a completely different project https://dolphin-emu.org/blog/2017/07/30/ubershaders/
update: I hacked together an interpreted mode and it works great for this :D
anyways, here's the code for just the interpreter if you are curious. it is quite short. https://github.com/Aeva/tangerine/blob/excelsior/shaders/interpreter.glsl
when i get around to adding occlusion culling this should become quite fast, as a lot of the frame time is burned rendering voxels you can't see. visibility feedback could also be used to prioritize the compiling queue. this also might mean a wysiwyg editor could be possible since the time to render is instant once the octree is solved. lots exciting stuff.
@jonbro yes :D the root of the tree contains the entire model, which would be too slow to render compiled or otherwise. the octree splits to eliminate dead space, and as it does so each node removes the parts of the CSG tree that can't effect it resulting in a simpler SDF
@jonbro yes. here's the implementation if you're interested https://github.com/Aeva/tangerine/blob/excelsior/tangerine/sdfs.cpp#L1156
@aeva I guess I can't think of an alternate way to approach :D
this is really cool to see an end to end implementation of this.
@aeva awesome! I'm not sure I'm ready to revive my toy voxel sdf thingy, but these notes are gonna be my starting point if i do.
I gave up at the culling SDF ops stage, so I could never really have complex models :(
@jonbro so far this approach is working quite well for me. the main problem is the distance fields aren't exact after any set operators, so it can't cull as aggressively on the CPU as i would like it to. it also definitely needs clustered occlusion culling. I think this strat has promise though.
@aeva I don’t know if it works on any other platform but on macOS you could achieve async shader compilation by using shared contexts — each context could compile one shader at once. ‘Course, some of the compilation was still deferred to see the relevant state at draw time, so you tended to need to draw with each shader and the correct state bound, so it was a huge pain in the butt…
@OneSadCookie i had a go at the shared contexts approach, and it ended up causing a lot of mystery behavior, like long half minute hangs in strange places like timing queries and such. the problem with shared contexts is that when you opt into them, the driver then turns on a ton of hazard tracking and synchronization it normally doesn't have to do, and thus is not as well QA'd as the main stuff.
@OneSadCookie there's also an extension for parallel shader compiling that doesn't work properly on nvidia, and there's also the shader binary trick where you compile it in another process and then load the binary. a common problem to all of these is that opengl likes to recompile shaders it already compiled the first time you use them and every time you change pipeline state for Reasons
@OneSadCookie i'm planning on jettisoning gl for vk because of this, but the going is slow because vk is supremely unpleasant to write for some reason.
@aeva ah yeah, that is all very sucks. And I haven’t written Vk myself but I’ve seen somebody’s setup code and some of the complexity reflected through WGPU, so I can understand the desire to stick with GL for a bit longer!
@aeva You could try compiling to SPIR-V in a separate thread, that way the driver only has to go from bytecode to machine code.
I use shaderc:
It has several options, like optimization level, source language, and target environment (I'm using HLSL + vulkan).
It's shipped as part of the vulkan SDK, so I just LoadLibrary and load its functions to use from C.
There's also glslang:
Also dxc can compile hlsl to dxil or spir-v:
I am not 100% sure, but it looks like it would not be a huge undertaking if you are already using GL3.3 core.
There are some code snippets on the GL wiki linked above, it looks pretty trivial to change the cpu code side.
@lh0xfb i'm targeting 4.2 with some extensions. i don't think i'm using anything particularly exotic
@aeva That should be fine; I think the main requirement is explicit attribute and binding locations, so that everything links together consistently.
Here's a site where someone shared their before / after of converting their GLSL shader over to work with SPIR-V: https://eleni.mutantstargoat.com/hikiko/opengl-spirv/
I think it may also affect your cpu code with regard to relying on GL to perform reflection on your shader, eg. to find the binding id of a uniform. With SPIR-V you'd instead need to know which binding id you want to update.
It's still a hell of a lot simpler than vulkan, where you have to also provide even more info about *everything* and do descriptor set allocation, writes, and lifetime / state management such that descriptors are not modified while the gpu is using them.
I do the laziest/simplest thing possible, where I only have one global descriptor set per frame, I update it once before any rendering is done, and reclaim it a couple frames later.
@mcc so tangerine is essentially a shader compiler, where the input is a csg graph and the output is a bag of shaders. most stuff generates a low number of permutations and the driver caches it. at worst its usually a few seconds of waiting before the model renders and then everything is fast.
@mcc i've been planning on building an interpreted variant to run while compiling in the background, but async shader compiling on opengl is really hard to do so i've been putting it off. however i think in this use case i don't want to compile the shaders at all, so an interpreter would be perfect
@mcc it wouldn't be that hard to put together from what I have right now since all of the shape parameters are extracted and passed in via a buffer already, and I'd just have to come up with some op codes to pass in the same way
@aeva so I guess like… it sounded like your goal was to put it on a low power device. Why do you need to generate the cache on a low power device? Could you not generate the cache and the shader bag on the high power device and zap it over to the low power one at init time?
@aeva the fact you are planning to deploy exactly one unit makes several workflows viable that otherwise maybe would not be
@mcc so a quick example, these are the shaders that the compiler produces for the step pyramid vs the wheel-o-problems. they can't be produced ahead of time since they are specific to the models
@mcc since i was using llvmpipe on the potato, i could probably have a different machine with the same version of llvm pipe compile the final glsl and extract the binary and so on, but gl makes that kind of thing very brittle. i'm not confident that it would work between two different OSes, and idk if it uses CPU specific extensions when available
@mcc however, if i build an interpreter for the generated part then i can avoid the problem entirely
The original server operated by the Mastodon gGmbH non-profit