[gizdoom] Lazy palette shader

Post by **dpJudas** » Sun Sep 04, 2016 11:27 pm

One other thing that could probably improve the algorithm a little bit would be to convert the colors to linear space first before calculating the distance. Doom uses sRGB, which has a gamma of 2.2, but a fast approximation is 2.0. The BestColor function thus ends up like this:

Code: Select all

int BestColor (const uint32 *pal_in, int r, int g, int b, int first, int num)
{
	// Convert search color to linear (using 2.0 gamma instead of 2.2 for speed reasons):
	r = r * r;
	g = g * g;
	b = b * b;

	const PalEntry *pal = (const PalEntry *)pal_in;
	int bestcolor = first;
	uint32_t bestdist = 0xffffffff;

	for (int color = first; color < num; color++)
	{
		int x = r - (pal[color].r * (int)pal[color].r);
		int y = g - (pal[color].g * (int)pal[color].g);
		int z = b - (pal[color].b * (int)pal[color].b);
		uint32_t dist = (uint32_t)(x*x) + (uint32_t)(y*y) +(uint32_t)(z*z);
		if (dist < bestdist)
		{
			if (dist == 0)
				return color;

			bestdist = dist;
			bestcolor = color;
		}
	}
	return bestcolor;
}

Edit: removed the divide by 255 because it only causes precision loss for this algorithm.

Post by **Rachael** » Mon Sep 05, 2016 12:08 am

Well - again, you're right.

If you want to test this diff, the cvar has been renamed to "gl_palette_tonemap_algorithm", but I think you already know what the results are going to be.

I had an idea about writing an HSV checker, instead of an RGB checker. I think the code for conversion was already available in v_palette.cpp, maybe? However, doing this for 256k colors will be a bit of a hit on the processing time, I'm guessing. Everything else in the algorithm would be the same, it would just input and compare HSV's instead.

Also in this diff I went ahead and implemented the table clear on CVar change, so a restart is no longer required since it seems to automatically rebuild it. What I would like to see is at least a higher precision in selecting the hue and luminescence values, not *as* worried about saturation. It seems by comparing last-values my algorithm was inadvertently correct but only because of the order of the Doom palette. It's the same algorithm by which I generate custom COLORMAPs and TINTTABs (heretic/hexen), though, and I've used it so much in the past 18 years that I have it memorized.

Post by **Gez** » Mon Sep 05, 2016 1:36 am

Eruanna wrote:I had an idea about writing an HSV checker, instead of an RGB checker.

Take a look at SLADE's colorimetry options. You can make its color matcher check in RGB, HSV, or Lab spaces. Honestly I feel the RGB matching is what works best for Doom.

Post by **Rachael** » Mon Sep 05, 2016 6:54 am

dpJudas wrote:One other thing that could probably improve the algorithm a little bit would be to convert the colors to linear space first before calculating the distance. Doom uses sRGB, which has a gamma of 2.2, but a fast approximation is 2.0. The BestColor function thus ends up like this:

Oh wow, I didn't even see that post. That implementation has errors (I suspect overflow), so I am going to try and fix it and then try it with GZDoom. If that works it may even work better in software mode, too.

Gez wrote:Take a look at SLADE's colorimetry options. You can make its color matcher check in RGB, HSV, or Lab spaces. Honestly I feel the RGB matching is what works best for Doom.

I'll take your word for it.

Post by **Rachael** » Mon Sep 05, 2016 8:11 am

Okay, I fixed the new version BestColor using doubles instead of uint_32's. We're taking 256 to the power of 4, while technically it fits inside a uint_32 it does not allow for much else.

For now, I kept all code for comparison using CVars, but that's not going to be suitable for a pull request. If I do such a thing, I will most likely be removing these CVars and there will be only one algorithm available.

Now - onto the implementation:

I like it, but it does have problems of its own. Mostly - it gets the primary colors correct, but when there are colors that are really distant from the palette (have higher/lower saturations than what's available) it doesn't seem to always pick the best "looking" color.

Best place to try this out - Doom 1 - "E3M3" do "warp -800 150" - use "gl_lightmode 8" - also, pretty much all of E3M7.

Here's the diff with changes, along with a pre-compiled exe for others to try (since devbuilds are behind):

https://mega.nz/#!9JkjABwZ!Fh8ci7vKUPZc ... drE4T7Qi5k (reuploaded, was missing gzdoom.pk3 which also changed since latest devbuild)

And here's just the diff:

gzdoom-custom-algorithms.diff.gz: (1.72 KiB) Downloaded 107 times

To test this implementation, please use

Code: Select all

gzdoom +set gl_palette_tonemap_algorithm 2 +set r_colormatcher_algorithm 1

This only needs to be done once - the CVars will save to your config.

If you change r_colormatcher_algorithm, in both Software and GL mode the tables do not automatically get rebuilt. A "restart" ccmd will fix that. gl_palette_tonemap_algorithm does not have this problem and will take effect immediately. These CVars are in place for comparison and notation purposes.

With all that said - maybe there's a happy medium somewhere? A slightly lower gamma ramp, if available, maybe?

Also - I really do not like the idea of setting a "max" distance. The reason why I initialize the distance on the first iteration of the loop is it allows me to alter the algorithm any way I choose without having a maximum other than what the number structures themselves support. Notice how even you had to set a new max after putting in the gamma ramp? If we fix the v_palette.cpp code, I really would prefer the first iteration to initialize the numbers, rather than allowing it to be done outside the loop as before. I think the code looks cleaner, and it is more flexible. Correct me if I am wrong.

One more thing - if we do put in a floating exponent gamma ramp (such as 2.2 or 1.4 or whatever) - you can still retain the speed in processing simply by first going ahead and applying exponents to (0-255) and storing the results in an array, since these are the only numbers you are going to use, anyway. Then you only have to read the array to get the proper results, rather than recalculating it 768k times. Will speed up processing tremendously. Doom's original Gamma correction system did hard-coded precalculated arrays for exactly this reason.

Post by **dpJudas** » Mon Sep 05, 2016 10:07 pm

Eruanna wrote:Okay, I fixed the new version BestColor using doubles instead of uint_32's. We're taking 256 to the power of 4, while technically it fits inside a uint_32 it does not allow for much else.

Oops. Yes, my code did overflow a 32 bit integer. For a final version I would probably have divided by two (a very tiny loss of precision in this case), but your double version works too of course.

Eruanna wrote:I like it, but it does have problems of its own. Mostly - it gets the primary colors correct, but when there are colors that are really distant from the palette (have higher/lower saturations than what's available) it doesn't seem to always pick the best "looking" color.

My knowledge of color theory is a bit too poor to suggest anything else than the calculation of distance in linear space I already did. I know that for light calculations linear colors are very important, but for a 'best color' algorithm? I honestly don't know.

I can create a SSE 2 accelerated version of the function (since apparently original version needed speed enough to get a MMX version), but this only makes sense if you'd rather have this color match algorithm over the original.

Eruanna wrote:With all that said - maybe there's a happy medium somewhere? A slightly lower gamma ramp, if available, maybe?

You can certainly experiment with that. Just change the "r = r*r" stuff to "r = pow(r, gamma)" and see what effects you get. Linear comparison is with 2.2 as gamma, and sRGB is with 1.0, which is same as the original algorithm.

Eruanna wrote:Also - I really do not like the idea of setting a "max" distance.

I normally would write the code as you do too, but in this case I think the usage of a "max" distance is to get rid of an extra comparison in the speed critical inner loop.

I agree for general code one should always go for cleaner, readable and more flexible code layout - the only exception I'd say is when you can no longer afford the luxery, which is often the case in any part of zdoom where there's suddenly MMX or SSE involved. Whoever wrote the original function found the entire C function to be too slow and replaced it with MMX.

Eruanna wrote:One more thing - if we do put in a floating exponent gamma ramp (such as 2.2 or 1.4 or whatever) - you can still retain the speed in processing simply by first going ahead and applying exponents to (0-255) and storing the results in an array, since these are the only numbers you are going to use, anyway. Then you only have to read the array to get the proper results, rather than recalculating it 768k times. Will speed up processing tremendously. Doom's original Gamma correction system did hard-coded precalculated arrays for exactly this reason.

If that optimization alone makes it meet the speed requirements, yes. Not sure why this function needs to be so fast, but apparently it does. If it still needs to be faster you can apply the table pre-process, "max" distance, and SSE all at the same time.

Post by **Rachael** » Mon Sep 05, 2016 10:58 pm

dpJudas wrote:Not sure why this function needs to be so fast, but apparently it does.

Potentially 67.1~ million loops. That's just for the tonemap part.

You know, I am amazed it works as fast as it does with that many iterations.

Here's some other uses of that function:

ZDoom's internal rgb555 table: 8.4~ million iterations
ZDoom creates a new colormap/fogmap (live while playing): 2~ million iterations

I'll try and get something working later on. I'm probably going to remove the CVars from my working copy and just use the defaults I suggested earlier - if it does seem like we need to improve ZDoom's original algorithm, I would prefer my code to be submittable soon.

Post by **dpJudas** » Mon Sep 05, 2016 11:26 pm

Eruanna wrote:Potentially 67.1~ million loops. That's just for the tonemap part. You know, I am amazed it works as fast as it does with that many iterations.

You make that sound like it is a lot. That's about 33 frames of pixels at 1920x1080, something zdoom does at 200+ FPS on my computer - on a single core. For a one time calculation of a table this is not a big deal. If that was its only usage I'd personally not optimize the function unless someone noticed slow boot times.

Eruanna wrote:ZDoom creates a new colormap/fogmap (live while playing): 2~ million iterations

Now this is a much better reason. Microstuttering while playing sucks.

Post by **Rachael** » Mon Sep 05, 2016 11:37 pm

You may already know this - and sorry if you do, it's not my intent to explain something you already know.

If you want to test what it "feels" like, CPU-wise, on older systems, Windows Vista and later includes an option to dial down the CPU frequency directly in the power options. Just set the CPU frequency to something like 5% minimum 5% maximum (it will only go as low as your CPU actually allows, don't worry), and then force ZDoom to run on a single core. (Using cmd.exe, you can type "start /affinity 0x1 zdoom.exe" to do this) If ZDoom still runs decently after you do this, you've probably optimized it enough.

The TestFade and TestColor CCMD's are great for testing this function - because it will do those "live" iterations I mentioned. If you notice a delay after using them, then others may notice it too.

It's not a "perfect" emulation, per se, I am sure you already know, because modern CPU's have a lot of optimizations and better cache than older ones, but it does help you to get a feel for what's going on.

Post by **dpJudas** » Mon Sep 05, 2016 11:55 pm

People with older systems are used to waiting anyway.

More seriously, because the old function used MMX I would personally include a SSE intrinsics version for any updated version of it - with the assumption that it was time critical in the past and therefore might still be.

If my goal was to make zdoom boot times better tho, then I'd focus my attention far more on other areas where there is more to gain. Like the JPEG, PNG, and pk3 loaders.

Post by **Rachael** » Tue Sep 06, 2016 12:07 am

I think you're right - the oldest system running ZDoom right now probably isn't going to notice much difference between any ASM algorithm at all and any C algorithm as far as this function goes simply because of how little it is actually used during live play.

However, I will still pre-calculate the exponent table, because there's no way in hell I am running a floating exponent calculation millions of times for a single table, especially with numbers that are guaranteed to repeat a number of times through said calculations.

leileilol · Post by **leileilol** » Tue Sep 06, 2016 12:11 am

For Engoo I only deal with generation on load time only. It's still somewhat playable on Pentium, though the 18-bit lookup generation on first startup takes too long making it a bit prohibitive on lower specs (486/P5's)

would be faster if that 18bpp table was cached per palette lump's checksum to file, and then all the map-load color table builds go through that table instead of bestcolor

FmodEx and all the new actor code already makes Zdoom inappropriate for the low pentium end anyway, and the more that's brought up, the more useful legacy performance features get deprecated in the spite of it *cough*r_detail*cough*

Post by **Rachael** » Tue Sep 06, 2016 12:53 am

I didn't get nearly such slow times, even when I crippled my CPU. If that does become a problem we can definitely cache the tables, but I don't think it will be needed.

However, doing pow()'s without tables did increase the load times a lot. It took an additional 2-4 seconds to load the game on V_Init, and then upon entering the game (during tonemap gen) it took another 10 or so seconds. This was on a pretty powerful CPU, too (about 2~ years old). Keep in mind - without pow()'s it was nearly instant before, which is why I think such caching shouldn't be necessary.

I'm going to develop something that creates a pow() table the first time the function is executed only, and then reuses it on every further iteration. That should decrease load times back to normal, again, I think.

Post by **Rachael** » Tue Sep 06, 2016 1:43 am

Alright. Done. This really looks the best so far, in my opinion. Still uses BestColor. @Lei - if you see this, can you test this with any of your older systems you may have available and let me know if the load times are acceptable? (You will have to compile 32-bit builds yourself, though, sorry)

Diff:

patch3.diff.gz: (1.27 KiB) Downloaded 106 times

Precompiled:
https://mega.nz/#!RMV2ST7L!bvefG5ivawEM ... 9hrZiOVgp0

leileilol · Post by **leileilol** » Tue Sep 06, 2016 1:56 am

Can't atm, i'lll mention pcem though since it's a little more reliable for canon instruction cycle timings than most pc emulators/vms, only problem is the usual setup and rom hunts

IIRC In engoo, making a rgb555 table takes 2 seconds on a Pentium 166, and a rgb666 table would take a bit over 20, and that's with BestColor pulled out of qlumpy. On a AM5x86-160 (one of the fastest 486s) this process takes at least a minute

ZDoom

[gizdoom] Lazy palette shader

Re: [gizdoom] Lazy palette shader

Re: [gizdoom] Lazy palette shader

Re: [gizdoom] Lazy palette shader

Re: [gizdoom] Lazy palette shader

Re: [gizdoom] Lazy palette shader

Re: [gizdoom] Lazy palette shader

Re: [gizdoom] Lazy palette shader

Re: [gizdoom] Lazy palette shader

Re: [gizdoom] Lazy palette shader

Re: [gizdoom] Lazy palette shader

Re: [gizdoom] Lazy palette shader

Re: [gizdoom] Lazy palette shader

Re: [gizdoom] Lazy palette shader

Re: [gizdoom] Lazy palette shader

Re: [gizdoom] Lazy palette shader

Login • Register