Yet another new ZDBSP!

News about ZDoom, its child ports, or any closely related projects.
[ZDoom Home] [Documentation (Wiki)] [Official News] [Downloads] [Discord]
[🔎 Google This Site]
User avatar
Graf Zahl
Lead GZDoom+Raze Developer
Lead GZDoom+Raze Developer
Posts: 49252
Joined: Sat Jul 19, 2003 10:19 am
Location: Germany

Post by Graf Zahl »

It uses SSE2 and its use is optional. There's exactly one function that's using it.
User avatar
randi
Site Admin
Posts: 7749
Joined: Wed Jul 09, 2003 10:30 pm
Contact:

Post by randi »

And plain old SSE offered virtually no benefit over x87 math, so that's why it's SSE2. (Although that was with VC++, GCC might do better with it; I haven't tried.)
User avatar
Hirogen2
Posts: 2033
Joined: Sat Jul 19, 2003 6:15 am
Operating System Version (Optional): Tumbleweed x64
Graphics Processor: Intel with Vulkan/Metal Support
Location: Central Germany
Contact:

Post by Hirogen2 »

Here are some results. I picked one of the most complex mapsets ever, here, Phobos: Anomaly Return.

Normal nodes.

Code: Select all

shanghai$ cat /proc/cpuinfo 
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 8
model name      : AMD Athlon(tm) XP 2000+
stepping        : 0
cpu MHz         : 1666.765
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow up ts
bogomips        : 3335.23
gcc version 4.1.0 (SUSE Linux 10.1)
*** regular compile, regular run (i.e. without SSE*)
16.23s 16.22s 16.37s
*** compiled with -msse -mfpmath=sse, regular run (i.e. with generated SSE1)
14.98s 14.99s 14.89s

Speedup: 8.82% (may vary - the longer the BSP process takes, the more accurate this gets)

Code: Select all

athlon$ cat /proc/cpuinfo 
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 6
model           : 4
model name      : AMD Athlon(tm) processor
stepping        : 2
cpu MHz         : 900.093
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips        : 1801.56
*** regular compile, regular run (i.e. without SSE*)
26.3s 26.28s 26.3s
*** compile with -mmmx -m3dnow
26.35s (I abort here - almost the same)

Code: Select all

$ cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Celeron(R) CPU 2.00GHz
stepping        : 9
cpu MHz         : 1997.313
cache size      : 128 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips        : 3999.55
gcc version 3.3.5 20050117 (prerelease) (SUSE Linux 9.3)
*** regular compile, run with --no-sse
15.13s 15.00s 15.02s
*** regular compile, regular run (i.e. with hand-written SSE2)
15.7s 15.8s 16.3s 15.7s
(slowdown compared to 15.13/15/15.02: 5.48%)
*** compile with -msse -mfpmath=sse, run with --no-sse (i.e. we run with generated SSE1)
14.64s 14.63s 14.68s
(speedup compared to 15.13/15/15.02: 8.36%)
*** compile with -msse -mfpmath=sse, regular run (i.e. using generated SSE1, hand-written SSE2)
15.63s 15.97s 15.66s
*** compile with -msse -msse2 -mfpmath=sse, run with --no-sse (i.e. we run with generated SSE1/SSE2)
15.83s 16.34s 15.82s
*** compile with -msse -msse2 -mfpmath=sse, regular run (i.e. with all the fun)
15.78s 15.92s 15.82s

Code: Select all

gwdu105$ cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 5
model name      : AMD Opteron(tm) Processor 248
stepping        : 10
cpu MHz         : 2191.962
cache size      : 1024 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow
bogomips        : 4308.99
TLB size        : 1088 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp
processor       : 1
(same thing)
gcc version 3.3.3 (SuSE Linux) (SUSE LINUX Enterprise Server 9 (x86_64))
*** regular compile, regular run (hand-written SSE)
5.48s 5.44s 5.45s
*** compile with -msse -msse2 -mfpmath=sse, regular run (hand-written/generated SSE)
5.48s
*** compile with -msse -msse2 -mfpmath=sse, rip VerifySSE() out of the source code, regular run (only generated SSE)
5.41s
*** regular compile, rip VerifySSE() out
5.51s
User avatar
Hirogen2
Posts: 2033
Joined: Sat Jul 19, 2003 6:15 am
Operating System Version (Optional): Tumbleweed x64
Graphics Processor: Intel with Vulkan/Metal Support
Location: Central Germany
Contact:

Post by Hirogen2 »

GL nodes (-only) this time (zdbsp -x)

Code: Select all

athlon-xp$
CFLAGS+="", args+="":
  18.26s 18.32s 18.47s
CFLAGS+="-msse -mfpmath=sse", args+="":
  16.85s 16.83s 16.88s
Speedup: 8.70%

celeron$
CFLAGS+="", args+="--no-sse":
  17.67s 17.68s 18.19s
CFLAGS+="", args+="":
  18.57s 18.99s 18.5s (slowdown cf 1st: 4.70%)
CFLAGS+="-msse -mfpmath=sse", args+="":
  18.37s 18.52s 18.35s
CFLAGS+="-msse -mfpmath=sse", args+="--no-sse":
  17.29s 17.75s 17.96s
CFLAGS+="-msse -msse2 -mfpmath=sse", args+="":
  18.65s 18.98s 18.62s
CFLAGS+="-msse -msse2 -mfpmath=sse", args+="--no-sse":
  18.62s 19.12s 19.27s 18.63s
Did I just show that (Intel's implementation) of SSE2 sucks over SSE1? :p Or maybe it's just the compiler that is not too good on SSE2 yet...

(Note for those who can't grasp -msse and --no-sse combination: Note that --no-sse turns off the hand optimized thing and instead uses whatever the compiler generates, which in turn is defined by CFLAGS)


These results should give us thinking. Let the compiler do the optimizations, it is mostly as good as hand-optimization.
And plain old SSE offered virtually no benefit over x87 math, so that's why it's SSE2. (Although that was with VC++, GCC might do better with it; I haven't tried.)
A benefit of 8% is quite a thing. Depending on how much math one does, it may go up to 17% (a time test I did with the oggvorbis encoder sometime ago).

All tests were done on zdbsp svn rev 226.
Post Reply

Return to “ZDoom (and related) News”