Hey everyone, time for a christmas update!
Firstly, hope everyone's holiday season has been going well so far!
Now, lots of very interesting developments, so this may get a bit long in the tooth, but I'm a huge nerd and this stuff is cool
Ok, so. Hard to say how much everyone is familiar with in regards to threading. Torque's broadly always been a single-threaded engine. There are exceptions(some sound stuff, physics thread, etc) but by and large all the work Torque does through it's normal run and sim has all been single threaded.
Why does this matter? Well, single-threaded performance for CPUs - that is, how much and how fast a CPU can do a workload on a single thread - has been beginning to plateau for a while now. While efficiency improvements are obviously continuous, ultimately, there's only so much computation you can push through in a timeframe. Which is why, of course, CPUs started adding multiple cores, hyperthreading, etc. So workloads can be divvy'd up across cores and threads, allowing multiple workloads to happen at once.
Game engines have been slow to multithread, simply because multitheading isn't easy even in simpler programs, what with the need to make sure you don't try and read or write to the same memory at the same time with different threads or corruption/crashing happens. And in something as intertwined as a game engine, it's many times more challenging.
That said, ultimately, even game engines need to suck it up and properly utilize all that CPU power that's just sitting there begging to be used. Over the past few years most major engines have shifted over to it, but Torque's been lagging behind on it, partially because none of us on core development were especially familiar with it. We've had brushes, but hadn't familiarized ourselves with the how-to-do enough to really rework the engine to make it happen.
The good news is, that's changing now.
We ended up getting into contact with the Life Is Feudal
guys and they had been working on their own branch since before the engine went MIT, but offered to pass along a good chunk of the work they'd done for us to use and integrate where possible. They've *seriously* worked hard, and what they've passed along so far has been a treasure trove, so make sure to send them some thanks and maybe check out their game.
But one of the things they passed along that particularly piqued our interest was texture streaming. As in, loading in textures progressively, lower-to-higher mips. So you can quickly load in all the textures your rendering requires, and then over the course of a few frames, then load the higher quality mips of the same texture to get the proper, full detail. Obviously, this is a workload best done in a threaded way, and sure enough they'd reworked a good chunk of the threading system in Torque in order to make that happen.
@ Azaezel has been the major spearhead in figuring out the port work to get the threading management improvements in, and I've helped along with @ Hutch working with out boy at LiF to puzzle the integration. There's some bugs to hammer out yet, but the basic integration of the improvements is looking solid so far. Once that's been finalized, we can move on to the texture streaming, and other delicious opportunities this provides.
Opportunities, you say? Indeed I do say. Let me explain.
The core of the improvements mainly update the thread handling to be more modern and standard(using std::thread, etc) and add some features modern thread handling provides, such as conditional tracking. This lets us understand the status of a thread/work item much better. So, for example, we go to process an image we're loading in, we can spool it up into a work item thread through the ThreadPool, and the improvements make it *much* easier to track if it's done or not, letting us actually use the texture when the loading work is concluded. In the old setup, you have to do some manual setup work for tracking stuff like that, so using threads was always kind of a pain.
So, it's easier to manage threads, which means they're easier to use. This means that we can and *will use them. So, outside of texture streaming, what are we looking at threading?
So, so many things. Two biggies are the next major points, but we plan on sprinkling threaded behavior around in a LOT of spots. GroundCover item placement? Thread that sucker so it goes faster. Loading a model off the disk and doing the setup work to prep it in memory/load the animations? Thread it up. Plan is to thread as many things as can be reasonable converted into a threaded workload so we're using as many cores of the CPU as we can, as often as we can. This will lead to lower load times, less load stutter, and a much higher baseline of functional performance.
Not to be confused with Entity-Component systems(hah, just kidding, feel free to be confused, naming conventions are hard!) the distinction is relatively straightforward, but an important one.
The current way I have entities and components set up in the engine is as a Entity/Component configuration. You have an Entity object you throw in the scene, and then add Component objects to it to apply functionality and behavior to said Entity(like adding a MeshComponent to get it to render a mesh). The Components contained data AND they also implemented that data, so a MeshComponent had what model to render, scale, etc. But also had the code logic to actually DO the rendering of the model. When render-time came, we iterated over all our mesh components(we actually used a globally accessible interface list, but that's not important yet) and told them to do their work. We similarly iterated over all the regular components that did tick updates and so on.
While working with LiF guys on the thread stuffs, they mentioned they, too had begun shifting to components and the like, so we got talking about that and they used a modified Entity/Component/Systems setup. I'd looked at this approach back when I started working on E/C stuff, but there were issues with script integration and other problems so I abandoned it. But as we talked, ideas began to form, so I leapt into action and quickly drafted a prototype.
One of the things you want to avoid at all costs(as hard as it is) in software is Cache Thrashing, or loss of cache locality. Basically, when you go to process a chunk of memory in the CPU, you pull it out of RAM and toss it into the CPU cache. As fast as ram is, the cache is way faster. But it's also waaaay smaller. However, when you're processing that chunk of memory, if you have to refer to a chunk of memory that isn't in the cache, you pretty much have to dump it, get the newly relevent memory chunk, cache it, and then keep working.
When dealing with pointered objects, this is hard to avoid. Because Torque uses a central list of objects via the SimDictionary, it means referncing into that is likely going to kill cache locality, which leads to slowness. Making cache locality happen with Entity/Components is, thus, hard.
But with the discussions, a new notion was formed. We had a convenient system already in place via the component interfaces - a system where when a component is created, it automatically adds itself to a global list that makes it cheap and easy to access them without having to go do a lookup through the SimDictionary and iterate through all loaded objects.
So I repurposed this system to make...uh, Systems.
So calling back to our MeshComponent example, the old way was that the MeshComponent contains the data of what to render, and then when the time comes, it also did the rendering. In the new setup, the implementation is separated out and moved into a MeshRenderSystem.
This does the actual work of making rendering happen. We then have a data interface that is a list inside our system, which contains the actual data-to-use. Model, scale, transform, etc.
The component itself basically acts as our 'real' object, which can be manipulated, changed via script, etc. So the structure is like so:
----Pointer to a unique instance of MeshRenderSystemData
When a MeshComponent is made, it allocates a MeshRenderSystemData which is stored into a list in the system. The data contains everthing relevent to making the rendering of a mesh happen. The component merely gets/sets the data.
When it comes time to render, the MeshRenderSystem basically gets told "do rendering" and it very quickly iterates over it's local list of MeshRenderSystemData and renders them very quickly. Because the data is locally contained in that array, we maintain locality and it helps performance even more.
A bonus with the system feeding off a general data container like our MeshRenderSystemData is that it doesn't care where that data came from, as long as it's formatted correctly. This means that we can use one system, like our MeshRenderSystem, to render anything as long as the data is formatted right. Player model, terrain, particle effects, etc. Doesn't matter we can just crunch the data and go. Which means that we have fewer paths in/out, and it makes the code cleaner and easier to maintain.
From here, we get to do something really neat. Because the data is all self-contained in the MeshRenderSystemData, we can then implement the MeshRenderSystem to process that data in a threaded way.
So when we go to render our objects, we can chop up our list into chunks, and then assign a thread to each chunk and process that data as fast as is physically possible, in a largely asyncronous way. This will DRASTICALLY reduce overhead time when we go to render. Likewise, this is also faaaaaar cleaner a setup code-wise, as we have less jumping and hopping around through a dozen files to get from 'the engine wants to render a frame' to 'time for our object being rendered to submit to the API for actual drawing'.
This makes the code easier to understand and maintain, and also more performant, again. One such way is how updates will happen for all this stuff.
The current way, we have a global list of 'tickable' items, they do processTick, then advanceTime if appropriate. Then at some point later during the update, they'll render as well.
With the systems-focused setup, the flow is a little more linear and cohesive. Namely, when we go to do our main update, we'll walk through our systems, in-order doing our update systems first(physics, animation updates, etc).
And then building off that data, we can do our render systems. Because we can do these linearly, the code'll be a little more comprehensible in how a given update goes. But inside that system, we can spool up a number of threads equal to the CPU's available, and crunch our data as quickly as possible before moving on to the next system. This lets us thread stuff aggressively, but keeps the actual order of tasks sane and easier to debug/expand. It's not as theoretically fast as a work-stealing fiber-task system where threads do whatever's available and just makes it happen/work, but it IS easier to maintain and debug.
I've already got this implemented in the R&D with the mesh component, so the new setup definitely works. I just need to get the other components converted over.
This ties into the next nifty bit of work:
So, the GFX Render API wrapper system is a good system, but it's showing it's age. It was written for D3D9, upgraded to D3D11, and then had OpenGL support added onto it. But ultimately, it was still designed around the D3D9 ethos, which is not how modern render APIs behave, meaning we're not nearly as performant as we could and should be when doing rendering.
We've been looking at options for updating it for a good while now, but it's a bit of a 'thing' because it touches a lot of places, so a good approach wasn't concrete. @ Timmy and @ Hutch had looked into this sort of thing before as a side project in their free time, and @ Hutch recently started messing with it in a more full, way.
The idea is a clean, new implementation of GFX. He's started out by learning how Vulkan works, then using those design notions to design a OpenGL wrapper. So you structure the data in a vulkan-like way, but the backend interprets the data for OpenGL and renders it very efficiently. The good news is that modern openGL is also absurdly efficient when you structure things correctly, so starting with OpenGL keeps platform flexibility, but will still render really fast.
Afterwards, we can look at implementing Vulkan properly for more forward-looking prospects on newer hardware.
But for the meantime, this lends to some neat advantages. Namely, we can do some stuff like tying back into the Systems mentioned above, and very quickly, cleanly processing our render data directly into universal buffer objects for drawing.
This is important, because the actual overhead cost of that is minuscule. We can basically just throw our data into that buffer en-mass when we do our render updates, and it propagates to the GPU with minimal driver overhead. Which means the actual draw calls are very cheap because the data it cares about is all efficiently pushed via the buffer objects. We also only ever need to update stuff that changes, so if a model hasn't changed since the last frame, we can leave the data and cut down on more processing.
Because we skip so much extra fluff by going system-to-GFX3 in that way, we further compound the efficiency gains. This means that rendering any given object should have FAR less CPU requirements and it'd mostly be about being able to actually just crunch the workload on the GPU. Which leads to much more free CPU for other things, which we can then utilize in all this new threaded work.
One convenient thing with all this ties back into the 'updating GFX is a huge pain in the butt', in that between my prior workblog mention of refactoring the render path to be camera-oriented, and the new Systems based entity/component stuff, most all our core render path for a lot of objects is essentially parallel to the old code. Meaning we can actually conveniently sandbox the new stuff out and test it without having to go all-in on a total strip-out and replace until everything is actually ready. This makes the work faster and easier to keep everyone sane
I don't believe this will be done for 4.0, but we *might* have an experimental-tag version of it in that for prototyping/testing. If we do, I would suggest fully expecting things to break horribly
A much better bet is to expect it for 4.1.
Another recent improvement was rewriting how components networked. Originally, I had components inherit off of NetObject, and were ghosted down if the component needed to do stuff on the client, such as rendering. However, this became pretty complicated when you had components that had dependencies(the animation component requires a mesh component to exist and be added/loaded to work), and relying on delay-detection ghosting voodoo was a bad time.
So I reworked it all. Now, components don't ghost down at all. Instead, entities are in charge of managing a component's network behavior. If a component is flagged for being networked, the entity will network across the type of component to be created, and will spool up a instance on the client side and do the adding automatically. Then for any networked components, the entity will get the network data from the components and integrate it into the entity's network stream.
This keeps components having a full netmask so we only have to send the data we actually updated, but is cleaner and more straightforward than relying on them being ghosted along like regular netObjects. This removes the uncertainty and has helped stabalize things a lot. From here, we can further refine how much data we need to network across to make things even more efficient, but this is a nice boost. Currently, if nothing is happening(nothing is moving/updating) an entity and it's components hold 0 bandwidth consumption, which is pretty excellent.
I've been working on this for a good while now, but the initial pass is almost done. Was hoping to get the last main issues wrapped this weekend, but the holidays predictably slowed everything down. My current goal is to get this wrapped and PR'd by the end of the week.
Beyond all that, there's a good bit of stuff we haven't really even bit in to from the LiF guys regarding other improvements, some examples of shaders they wrote we can port some math from, etc, etc. I'd also mentioned it before, but I'm hashing out a new, proper demo/map that we can use for testing stuff, which we should be able to see more of in the coming weeks. @ Azaezel also got reflection probes working in OpenGL for the PBR stuff as well, which is fantastic.
So yeah, whooooole crap-ton of really interesting work that's really going to give T3D a hearty boost performance and functionality wise while making stuff easier to maintain. Good times
Peace out, and have an excellent New Year!
(Also, if you think all this development is pretty neat, maybe consider tossing a bit of support our way via my Patreon