Re: Multithreading Doom's playsim

Wed Mar 24, 2021 4:55 am

Data oriented programming makes more sense on some platforms like PS3 or Xbox 360.
It could be done in a quite complex way involving domain specific language, code generation, and manual handling of corner cases.

Do note that the approach was advertised for console projects without any modding capabilities.
Those were platforms with fixed amount of RAM and VRAM, no paging aka swap, relatively high branching and cache miss penalties.
Cache friendliness by itself is very nice, although carelessly coded mods and huge detailed maps can slow down any engine on a superfast system.

Re: Multithreading Doom's playsim

Wed Mar 24, 2021 10:31 am

Graf Zahl wrote:It really depends on what you need. 9 years ago I had the choice of going mid-range or spend €200 more on a more powerful CPU. So, knowing my gaming habits, I merely bought a low mid-range GPU but invested a bit more into the CPU. I still run that system, admittedly it has problems decoding 4K videos but since I cannot display them, who cares?

But I am slowly reaching the stage where an upgrade may make sense because for many tasks the 4 core CPU and 8 GB of RAM won't cut it anymore. Still, getting 9 years of life out of a computer surely isn't bad at all! With a weaker CPU I may have had to upgrade 4 years ago already.

Given that I believe you're using a first gen Core i7 which is similar to the i5-760 that I was using, I definitely disagree here. It's a little hard to say for sure since I don't know of anyone that's done retrospective benchmarking on the first gen i5 vs i7, but lets assume for a minute that the effectiveness of hyperthreading is similar to that of Skylake/Kaby Lake. What we've seen is that it was only in the last couple years that the additional threads have proven particularly useful for games. At which point if you upgraded in 2017/2018 you'd be sitting with a 6 or 8 core processor. Combined with architectural improvements and other platform improvements, even if this doesn't necessarily help you for gaming the modern processor shred compiles compared to Nehalem. (Although I'm judging compile performance by Linux which has less of a file IO bottleneck.)

Even if one believed that AMD had no hope of recovering and assumed Intel would release mainstream quad cores forever, my dual core Broadwell laptop was actually able to keep pace with my i5-760 in compiling although its been too long for me to recall the exact numbers and there may have been a few other variables to consider. In any case the gap was a lot smaller than Intel's stagnation would have suggested it would be.

Now I get that there are other variables (cost of new motherboard, memory, etc), nor am I necessarily saying that a Nehalem system isn't still very usable today. But I do feel like certain people get too caught up in trying to avoid upgrading that they actually do themselves a disservice.

I personally run a high end system now, but that's because I realize that with the time savings I get the high end processor pays for itself. Still targeting a 3-5 year upgrade cycle myself. Or more specifically an every other generation cycle, only reason I haven't upgraded to 3rd gen Threadripper is because none of the boards have 5 PCIe slots like my X399 board does.

Re: Multithreading Doom's playsim

Wed Mar 24, 2021 11:22 am

Well, my computer still runs any but the most overblown Doom levels at 60+ fps. It also compiles the GZDoom project in less than a minute.
From that point there really was no good reason to upgrade earlier. Since a full upgrade costs €1000, it must be worth that - so far it wasn't.

Re: Multithreading Doom's playsim

Wed Mar 24, 2021 12:25 pm

Graf Zahl wrote: such an approach is very, very hostile toward code readability and refactorability...The loss of productivity and maintainability will inevitably take its toll here.

Then why most popular game engines (Unreal, Unity) uses similar approach to store data, an entity component system? It literally work in the same way. Group values that are used mostly together and make them a separate class.
Or it is used only to obtain actors with multiple, different behaviors (dynamic light, model, collision, etc) without using multiple inheritance?
Or are they similar, but not the same ways to manage large chunks of data?

Re: Multithreading Doom's playsim

Wed Mar 24, 2021 12:33 pm

Blzut3 wrote:By the way you do know "server-grade memory" means slower with higher latency right? Granted the modules usually have a lot of overclocking headroom, but most server boards don't allow taking advantage of that.

Pity, I naturally assume server things are better than consumer things :( I often see server CPUs in the same family clocked slower than the slower clock speed of the desktop chips and have wondered why for a long time. I guess they are hoping for more parallel use and thus better use of more cores.

Re: Multithreading Doom's playsim

Wed Mar 24, 2021 12:41 pm

This is very different. In these engines the actor is not the entity, it owns the entity. It is more a means of encapsulation than being cache friendly so that the game code is free to detach its own data from the engine, e.g. have actors without entity or actors with multiple entities, or entities attached to very different game objects, etc. This lack of separation is admittedly one of the bigger problems in Doom, because to render a sprite you need an actor. If actors were separated from entities you could do effect actors managing multiple sprites or easily build a better particle system. But with Doom's design where these two areas are not well separated it makes things a lot harder.

Re: Multithreading Doom's playsim

Wed Mar 24, 2021 12:44 pm

MartinHowe wrote:
Blzut3 wrote:By the way you do know "server-grade memory" means slower with higher latency right? Granted the modules usually have a lot of overclocking headroom, but most server boards don't allow taking advantage of that.

Pity, I naturally assume server things are better than consumer things :(
#

Better is relative to the use case.

MartinHowe wrote:I often see server CPUs in the same family clocked slower than the slower clock speed of the desktop chips and have wondered why for a long time. I guess they are hoping for more parallel use and thus better use of more cores.


For a server the reliability factor is very important. And yes, single process performance is far less important compared to good parallelism. The single workloads are rarely that demanding, but a server needs to process lots of them.

Re: Multithreading Doom's playsim

Wed Mar 24, 2021 12:54 pm

Graf Zahl wrote:the only bloat we have to contend with is AMD's lousy OpenGL performance.

Okay, i am going to bite because this claim keeps getting reiterated. lets test GZDoom against a reference rasterizer from both Nvidia and AMD then. That there is degression is one thing but the orders of magnitude that get mentioned in various terminologies aren't.

This was also asked years ago but to this day no definitive consensus to this myth for better or worse has been given, which is also the reason why im bringing this up. Back then a consensus never came because the requester went on to do so in increasingly hostile ways.

The worse part of calling all this? I can't find that forsaken thread anymore nor the user's name. But it listed the results of a reference OpenGL implementation, and it challenged Graf to run that test on ATI/AMD hardware. Graf's words were, far as i can remember, along the lines of ''Test it for me instead and we will talk.'' The requester didn't take that kindly however and called dibs.

Its at the tip of the tongue, but i cannot for the life can come up with the user's name or the thread. Its annoying because it severely degrades the argument i try to make here. :cry:

Edit; Hold on to your hineys. I just remembered who it was. Thank you, old me. (Contains some interesting commentary regarding the AMD regression).
Edit 2: This 2019 post by Graf explains things. But this 2010 post by Rachael agreed with VC. And this 2014 post by Leilei suggested a similar thing to test GZ against the Mesa3D software rasterizer (In the sense of setting a reference)
Edit 3: Found it.

Re: Multithreading Doom's playsim

Wed Mar 24, 2021 1:20 pm

AMD's lousy performance is tied to one property of their driver that has been an issue for at least 12 years and that has never changed: Issuing a draw call (i.e. calling glEnd in immediate mode or glDrawArrays or glDrawElements blocks the entire thread it runs on for a significant amount of time. There's no way around it, vertex buffers change nothing about it *AT ALL*.
The only way to speed up AMD with OpenGL is to reduce the number of draw calls - without incurring even more overhead by doing so.

VortexCortex's efforts would have led to nothing - his main argument was the same old "immediate mode = bad", vertex buffers = good" that you can read everywhere but totally sidesteps the cold hard truth of a blocking API call in the driver being performance poison.
Based on witnessed performance, NVidia only queues the request but actually performs the dispatch on a worker thread, freeing the app side dispatcher to do more work during that time.
This also gets somewhat confirmed when considering that it is beneficial on NVidia to not letting the dispatcher run at full speed but interleaving draw calls with some other processing. GZDoom does this by deferring vertex generation for walls until the draw loop. If I pregenerate them in the collection loop, the engine becomes slower, because the collection loop takes longer, but the draw call dispatcher occasionally gets stalled because the worker thread isn't fast enough.

Re: Multithreading Doom's playsim

Wed Mar 24, 2021 1:27 pm

Redneckerz wrote:
Graf Zahl wrote:the only bloat we have to contend with is AMD's lousy OpenGL performance.

Okay, i am going to bite because this claim keeps getting reiterated. lets test GZDoom against a reference rasterizer from both Nvidia and AMD then. That there is degression is one thing but the orders of magnitude that get mentioned in various terminologies aren't.

This was also asked years ago but to this day no definitive consensus to this myth for better or worse has been given, which is also the reason why im bringing this up. Back then a consensus never came because the requester went on to do so in increasingly hostile ways.

The worse part of calling all this? I can't find that forsaken thread anymore nor the user's name. But it listed the results of a reference OpenGL implementation, and it challenged Graf to run that test on ATI/AMD hardware. Graf's words were, far as i can remember, along the lines of ''Test it for me instead and we will talk.'' The requester didn't take that kindly however and called dibs.

Its at the tip of the tongue, but i cannot for the life can come up with the user's name or the thread. Its annoying because it severely degrades the argument i try to make here. :cry:

Edit; Hold on to your hineys. I just remembered who it was. Thank you, old me. (Contains some interesting commentary regarding the AMD regression).
Edit 2: This 2019 post by Graf explains things. But this 2010 post by Rachael agreed with VC. And this 2014 post by Leilei suggested a similar thing to test GZ against the Mesa3D software rasterizer (In the sense of setting a reference)
Edit 3: Found it.

First of all - most of what I see you doing here is bringing up old drama just for the sake of it.

Second of all - I wish I knew then what I know now.

Third of all - AMD is provably worse with OpenGL on Windows. It's not just GZDoom, it's every OpenGL game in existence. Does AMD make good cards? Sure. But the OpenGL drivers are trouble. Always have been. Plus, literally every other major update, AMD introduces new bugs in their OpenGL implementation. It's simply non-stop fuckery on that front. The only reason it works better on Linux is because AMD is a lot friendlier to open source on Linux than NVidia ever was, and therefore AMD drivers are no longer a colossal fuckfest, it can be improved by the community, there.

I have no doubt that when it comes down to the bare metal that AMD cards are cheaper and can outperform equivalent generation NVidia cards in at least half or more tests in the bare-bone metrics. But when you add the cruft on top of it that is the AMD drivers, that all goes right down the drain.

I really would appreciate you not bringing that up again, much less name dropping me for something I said a decade ago.

Re: Multithreading Doom's playsim

Wed Mar 24, 2021 2:09 pm

Graf Zahl wrote:AMD's lousy performance is tied to one property of their driver that has been an issue for at least 12 years and that has never changed: Issuing a draw call (i.e. calling glEnd in immediate mode or glDrawArrays or glDrawElements blocks the entire thread it runs on for a significant amount of time. There's no way around it, vertex buffers change nothing about it *AT ALL*.
The only way to speed up AMD with OpenGL is to reduce the number of draw calls - without incurring even more overhead by doing so.

Yeah, that is what the 2019 posting from your end is about. I wish i had found that sooner.

I don't agree with VC's efforts back then but the general idea - testing against a reference to determine what the cause for the significant degression is is not a unreasonable one. As the Edit post mentions, i don't disagree that there is a performance deficit due to AMD's driver implementation - I am more interested as to why said deficit is so significant to the point of being unreasonable.

ZDoomGL was tested against a reference implementation and that provided the support for what users were experiencing back then already (That ZDoomGL's performance flunked against the GZDoom builds at the time). The reference implementation test showed that ZDoomGL had half the performance of GZDoom and was the cause of unoptimized code.

Something which you also said back in the day either way. But now there was permanent and irrefutable evidence to that statement.

Rachael wrote:First of all - most of what I see you doing here is bringing up old drama just for the sake of it.

You know me better then that. Im not bringing it up just to stick it to Graf - Heck the first Edit highlights what i was after, back in 2020. Its about the general concept of a reference implementation, not VC''s execution to that idea from years past.

Rachael wrote:Third of all - AMD is provably worse with OpenGL on Windows. It's not just GZDoom, it's every OpenGL game in existence. Does AMD make good cards? Sure. But the OpenGL drivers are trouble. Always have been. Plus, literally every other major update, AMD introduces new bugs in their OpenGL implementation. It's simply non-stop fuckery on that front. The only reason it works better on Linux is because AMD is a lot friendlier to open source on Linux than NVidia ever was, and therefore AMD drivers are no longer a colossal fuckfest, it can be improved by the community, there.

Be as it may, and i don't disagree there. However, GCN cards are known for their longevity in newer games (Which undoubtely is backtraceable to the simple fact that the last-gen consoles used GCN-based GPU's, for one.)

Rachael wrote:I have no doubt that when it comes down to the bare metal that AMD cards are cheaper and can outperform equivalent generation NVidia cards in at least half or more tests in the bare-bone metrics. But when you add the cruft on top of it that is the AMD drivers, that all goes right down the drain.

I mean it makes sense. Even in an extreme edge case like the demoscene, Nvidia often gets the nod for stability (Though most of the issues focus on specific DirectX related dll's.) But for me, its about the performance deficit exhibited by the OGL driver under AMD hardware. Based on implementation, i would expect a regression, but not in the orders of magnitude on display here.

That's only why im arguing the reference implementation case - To determine if performance regression of GZDoom on OGL AMD hardware is this significant when exposed to a reference test case instead of a user test case.

Rachael wrote:I really would appreciate you not bringing that up again, much less name dropping me for something I said a decade ago.

I was collecting evidence to support the general idea of a reference implementation, not to showcase support for the requesters method of trickling down on Graf. If i gave that impression, i apologize.

Re: Multithreading Doom's playsim

Wed Mar 24, 2021 2:26 pm

I don't agree with VC's efforts back then but the general idea - testing against a reference to determine what the cause for the significant degression is is not a unreasonable one. As the Edit post mentions, i don't disagree that there is a performance deficit due to AMD's driver implementation - I am more interested as to why said deficit is so significant to the point of being unreasonable.


I cannot look into the driver - but the entire performance characteristics point to what I already said:
- AMD performs the entire render call dispatch - validation, setup, etc. on the calling thread.
- NVidia just queues the request and processes on a background worker thread. This may take just as long but doesn't block the calling thread which is then free to do more work while the draw call gets processed.

ZDoomGL was tested against a reference implementation and that provided the support for what users were experiencing back then already (That ZDoomGL's performance flunked against the GZDoom builds at the time). The reference implementation test showed that ZDoomGL had half the performance of GZDoom and was the cause of unoptimized code.


ZDoomGL's code is overall half as fast - that matches my own analysis. But on top of that it is doing a readback of the GL matrices - which causes a bad render stall. This is actually what kills it. And from looking at the code, this may also be what kills Polymer for Build - but I haven't done any closer analysis there.

Be as it may, and i don't disagree there. However, GCN cards are known for their longevity in newer games (Which undoubtely is backtraceable to the simple fact that the last-gen consoles used GCN-based GPU's, for one.)


AMD is fine with D3D and Vulkan. The problem is solely with OpenGL. But that's an important point: If you actively use some OpenGL software, AMD is utterly toxic.

That's only why im arguing the reference implementation case - To determine if performance regression of GZDoom on OGL AMD hardware is this significant when exposed to a reference test case instead of a user test case.


Here's your reference case: GZDoom's Vulkan backend does not exhibit any of these issues, it is just as fast as on NVidia hardware.

Re: Multithreading Doom's playsim

Wed Mar 24, 2021 3:35 pm

MartinHowe wrote:Pity, I naturally assume server things are better than consumer things :( I often see server CPUs in the same family clocked slower than the slower clock speed of the desktop chips and have wondered why for a long time. I guess they are hoping for more parallel use and thus better use of more cores.

As Graf said, server grade stuff is better for a specific definition of better. Hardware performance isn't a 1 dimensional scale. In the most general sense, server hardware is binned for throughput at the cost of latency and peak performance.

  • Server memory is probably the most meaningless difference once you get past ECC vs non-ECC. It's all the same dies, but server memory is strictly binned to JEDEC specifications. That is you could theoretically have a die that hits JEDEC spec and not a clock faster. Most consumer memory modules are rated well beyond JEDEC. Since a server memory module uses the same dies one could theoretically take a server DIMM and overclock it to perform like gamer memory, but this requires a board that can do memory overclocking and have support for server memory. In the case of unbuffered DIMMs you can find this with Ryzen. I personally am running ECC 2666MHz UDIMMs overclocked to 2933MHz with vastly reduced timings. I did this because I could, and the cost of ECC isn't (at the time at least) that much higher than non-ECC memory. But since the server memory isn't binned for low latency/high speeds I took a risk at getting even one dud die that doesn't overclock at all. Fortunately I've been rock solid.
  • Server hardware does allow you to run RDIMM/LRDIMM/FBDIMMs which allow you get higher capacities, but RDIMMs do have higher latency, and LRDIMMs even higher. I never was able to find a clear answer on how much latency were talking when I was doing research on the topic, since the target audience doesn't care. Either you need the capacity or you don't.
  • As an aside there was once a time where ECC actually was a performance hit in exchange for stability. As I understand it this stopped being a thing around the time of DDR2.
  • Server CPUs and consumer CPUs have largely been identical except for the clocks and core counts. Here server hardware is binned for lower voltage which allows higher average clocks for a given amount of power while potentially reducing maximum clocks. HOWEVER, I must note that Epyc 2nd generation has different firmware from Ryzen/Threadripper. In the case of Epyc the prefetcher/branch predictor is tuned differently to increase throughput, and IIRC the turbo algorithm is altered to produce more consistent clocks.
  • Having more cores sometimes comes with hidden performance deficits. Specifically in many generations the highest core count parts are effectively equivalent to having two (or more) sockets in one. Be it literally as in the case of Socket 771, or AMD's Socket C32 and G34. Or as sort of an intermediate speed tier which we can see with the dual ring Xeons (Haswell and Broadwell), or first gen Epyc and Threadripper.
  • Server motherboard allow for multiple CPUs, and while having more compute available will extend the life of a system long term, this isn't a linear increase. Not just because not everything is parallel, but also because the CPU to CPU interconnect is slow. Most server workloads are largely independent processes and can avoid having one socket talk to the other. But consumer workloads are generally not of this type. In your case you avoid some of the problem with Socket 771 being a traditional northbridge and Pentium D and Core 2 Quad already had to bounce off the northbridge to communicate with the other cores. But with tighter integration the difference between in socket and out of socket communication has increased.
  • Server GPUs are largely the same as the consumer GPU, but you'll be arbitrarily locked out of things like fan control. Driver optimizations are different, but of course if you use a Pro AMD GPU on Linux such optimizations don't exist. That and the pro drivers have at times not included the game optimizations for whatever reason. But at least Autodesk will acknowledge your issues.
  • Server storage devices are about the only thing that is universally better than consumer stuff, but in the case of hard drives this comes at the cost of noise and power (one of the few cases where server hardware actually wants to draw more power). I'm not aware of any particular disadvantage to server SSDs, they just cost a lot since they usually have more flash for higher endurance.
On the second hand market server hardware is a lot of fun since once it comes off lease it deprecates heavily. So despite all the down sides it's hard to argue with an old dual Xeon system if one is paying like $300 to get similar performance in some workloads to a modern $1,000 build if you ignore the gotchas. But it absolutely is the case that the hardware is tuned for different workloads and isn't strictly better.

It's a shame that Intel and AMD started segregating their server and consumer sockets. It was nice how with X79 and X99 you could get a low end quad or hex core and then upgrade to a 12 or 22 core later when they came off lease. Or how you used to be able to get a Xeon E for the mainstream socket which gave you hyper threading at a lower cost than an i7 at the cost of being multiplier locked and no iGPU. Suppose you could do the same with low end Xeon and Epyc parts on a server board, but then your paying the performance penalties from the start.

Re: Multithreading Doom's playsim

Thu Mar 25, 2021 11:18 am

Redneckerz wrote:I don't agree with VC's efforts back then but the general idea - testing against a reference to determine what the cause for the significant degression is is not a unreasonable one. As the Edit post mentions, i don't disagree that there is a performance deficit due to AMD's driver implementation - I am more interested as to why said deficit is so significant to the point of being unreasonable.


I don't think testing against a reference implementation is going to tell us much more than we already know. What counts as "unreasonable"? Let me reiterate again that AMD's OpenGL drivers have been troublesome from the very beginning. I've had AMD's off and on for over two decades now, the only thing that has improved "relatively" in that time was their Direct3D (which used to be a whole lot worse as well).

I can understand getting 20-30 FPS on an AMD sucks, but it's not what I would call "unreasonable". I call that "beyond our control". Unreasonable is when there are actual graphical glitches, or it doesn't render at all, or even just crashes.

Despite your justifications given here I still think you are quoting as "reasonable" what I see as "entitled". For two reasons:

1) From Graf's point of view, GZDoom is purely a hobby/free-time project.

2) Fixing OpenGL code is a massive pain in the ass. No doubt there is a desire to do it in the first place but I am pretty sure leaving an OpenGL debugging session does not give feelings of euphoria, satisfaction, and happiness. You're asking him to trawl in the mud for an issue that is beyond his control and that he does not have the hardware for in the first place, and for what, maybe a minor performance boost if that, for maybe 15% of users? (based on a broad assumption that only 30% of people can't use Vulkan and maybe half of those use AMD - but that number is truthfully likely to be far less anyhow)

The reason why this point is such a stick to me is because I invited Graf to remotely debug on my own machine which at the time had an AMD HD4850, which is an entry-level OpenGL 3.3 card. At the time the card was already a year before the end of its support life, before AMD was to terminate support completely for the card.

And the thing I remember most was just how much he went through the code trying to change things, this and that and the other, and just simply nothing seemed to work. In the end he was able to figure out what the actual problem was (the stencil buffer was effectively broken) but it was an issue that he could not fix at the time. I remember that I felt bad putting him through that.

At least these days the AMD drivers work most of the time. (keyword: most) - yes there are driver pushes that break things but the majority of the time things work, even if they are slow. Furthermore, the Vulkan alternative that is now available works much better anyhow, and pretty much negates the need to work on AMD GL support to begin with because honestly, that platform is rapidly on its way out now, it's really not worth the time and energy to invest into it. If you can run Vulkan, do so.

Honestly, if I didn't need OpenGL support for other things I would have said "fuck you" to NVidia a long time ago. I hate NVidia as a company even more than I hate AMD, and AMD's cards are more reasonably priced. But even today (and outside of GZDoom) I still need good OpenGL support and unfortunately NVidia holds the monopoly on that right now.

Redneckerz wrote:Based on implementation, i would expect a regression, but not in the orders of magnitude on display here.

The thing NVidia and AMD have in common is they are both shitty companies looking out only for the bottom line.

The only point where this diverges where their priorities are. OpenGL simply isn't it for AMD, and I am fairly certain NVidia is well aware of this. Once OpenGL-to-Vulkan wrappers start becoming more available and widespread this paradigm will no doubt shift, but that point will also be a nail in the coffin for OpenGL overall anyhow.

I think this is relevant because you say you don't expect things to be as shitty as they are for AMD - but you fail to realize that AMD is just focusing on what butters their bread. Just like any corporation. Even on NVidia, OpenGL is pretty shitty, it's just less so. Literally anything that uses any other driver interface on Windows simply works better. Be it Vulkan, classic Direct3D, or even D3D12.

Redneckerz wrote:I was collecting evidence to support the general idea of a reference implementation, not to showcase support for the requesters method of trickling down on Graf. If i gave that impression, i apologize.

If it's so important to you, I suggest you read up on the documentation on how to set this up yourself, and then run GZDoom in a debugger with the performance counters enabled. If you really do think there is any insight to be gleaned from this, then I encourage you to collect the data and share it. Keep in mind that using a real-life hardware is always going to give you better information than a reference driver will, but if you do not have an AMD and you do somehow manage to find one of their reference drivers, then that would be the way to go.

Re: Multithreading Doom's playsim

Thu Mar 25, 2021 11:27 am

Rachael wrote:And the thing I remember most was just how much he went through the code trying to change things, this and that and the other, and just simply nothing seemed to work. In the end he was able to figure out what the actual problem was (the stencil buffer was effectively broken) but it was an issue that he could not fix at the time. I remember that I felt bad putting him through that.


For the record - it was the clip planes, not the stencil buffer, that were broken. And they were so broken that just enabling them made the driver lose it.

Rachael wrote:I think this is relevant because you say you don't expect things to be as shitty as they are for AMD - but you fail to realize that AMD is just focusing on what butters their bread. Just like any corporation. Even on NVidia, OpenGL is pretty shitty, it's just less so. Literally anything that uses any other driver interface on Windows simply works better. Be it Vulkan, classic Direct3D, or even D3D12.


Which tells us clearly that the problem is how OpenGL is designed. This simply cannot work. I still wish there was something more accessible than Vulkan. These low level APIs are not made for hobbyist projects