Yet another new ZDBSP!
- Hirogen2
- Posts: 2033
- Joined: Sat Jul 19, 2003 6:15 am
- Operating System Version (Optional): Tumbleweed x64
- Graphics Processor: Intel with Vulkan/Metal Support
- Location: Central Germany
- Contact:
Here are some results. I picked one of the most complex mapsets ever, here, Phobos: Anomaly Return.
Normal nodes.
Normal nodes.
Code: Select all
shanghai$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 8
model name : AMD Athlon(tm) XP 2000+
stepping : 0
cpu MHz : 1666.765
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow up ts
bogomips : 3335.23
gcc version 4.1.0 (SUSE Linux 10.1)
*** regular compile, regular run (i.e. without SSE*)
16.23s 16.22s 16.37s
*** compiled with -msse -mfpmath=sse, regular run (i.e. with generated SSE1)
14.98s 14.99s 14.89s
Speedup: 8.82% (may vary - the longer the BSP process takes, the more accurate this gets)
Code: Select all
athlon$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 6
model : 4
model name : AMD Athlon(tm) processor
stepping : 2
cpu MHz : 900.093
cache size : 256 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 sep mtrr pge mca cmov pat pse36 mmx fxsr syscall mmxext 3dnowext 3dnow
bogomips : 1801.56
*** regular compile, regular run (i.e. without SSE*)
26.3s 26.28s 26.3s
*** compile with -mmmx -m3dnow
26.35s (I abort here - almost the same)
Code: Select all
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 2
model name : Intel(R) Celeron(R) CPU 2.00GHz
stepping : 9
cpu MHz : 1997.313
cache size : 128 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe cid xtpr
bogomips : 3999.55
gcc version 3.3.5 20050117 (prerelease) (SUSE Linux 9.3)
*** regular compile, run with --no-sse
15.13s 15.00s 15.02s
*** regular compile, regular run (i.e. with hand-written SSE2)
15.7s 15.8s 16.3s 15.7s
(slowdown compared to 15.13/15/15.02: 5.48%)
*** compile with -msse -mfpmath=sse, run with --no-sse (i.e. we run with generated SSE1)
14.64s 14.63s 14.68s
(speedup compared to 15.13/15/15.02: 8.36%)
*** compile with -msse -mfpmath=sse, regular run (i.e. using generated SSE1, hand-written SSE2)
15.63s 15.97s 15.66s
*** compile with -msse -msse2 -mfpmath=sse, run with --no-sse (i.e. we run with generated SSE1/SSE2)
15.83s 16.34s 15.82s
*** compile with -msse -msse2 -mfpmath=sse, regular run (i.e. with all the fun)
15.78s 15.92s 15.82s
Code: Select all
gwdu105$ cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 15
model : 5
model name : AMD Opteron(tm) Processor 248
stepping : 10
cpu MHz : 2191.962
cache size : 1024 KB
fpu : yes
fpu_exception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow
bogomips : 4308.99
TLB size : 1088 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp
processor : 1
(same thing)
gcc version 3.3.3 (SuSE Linux) (SUSE LINUX Enterprise Server 9 (x86_64))
*** regular compile, regular run (hand-written SSE)
5.48s 5.44s 5.45s
*** compile with -msse -msse2 -mfpmath=sse, regular run (hand-written/generated SSE)
5.48s
*** compile with -msse -msse2 -mfpmath=sse, rip VerifySSE() out of the source code, regular run (only generated SSE)
5.41s
*** regular compile, rip VerifySSE() out
5.51s
- Hirogen2
- Posts: 2033
- Joined: Sat Jul 19, 2003 6:15 am
- Operating System Version (Optional): Tumbleweed x64
- Graphics Processor: Intel with Vulkan/Metal Support
- Location: Central Germany
- Contact:
GL nodes (-only) this time (zdbsp -x)
Did I just show that (Intel's implementation) of SSE2 sucks over SSE1?
Or maybe it's just the compiler that is not too good on SSE2 yet...
(Note for those who can't grasp -msse and --no-sse combination: Note that --no-sse turns off the hand optimized thing and instead uses whatever the compiler generates, which in turn is defined by CFLAGS)
These results should give us thinking. Let the compiler do the optimizations, it is mostly as good as hand-optimization.
All tests were done on zdbsp svn rev 226.
Code: Select all
athlon-xp$
CFLAGS+="", args+="":
18.26s 18.32s 18.47s
CFLAGS+="-msse -mfpmath=sse", args+="":
16.85s 16.83s 16.88s
Speedup: 8.70%
celeron$
CFLAGS+="", args+="--no-sse":
17.67s 17.68s 18.19s
CFLAGS+="", args+="":
18.57s 18.99s 18.5s (slowdown cf 1st: 4.70%)
CFLAGS+="-msse -mfpmath=sse", args+="":
18.37s 18.52s 18.35s
CFLAGS+="-msse -mfpmath=sse", args+="--no-sse":
17.29s 17.75s 17.96s
CFLAGS+="-msse -msse2 -mfpmath=sse", args+="":
18.65s 18.98s 18.62s
CFLAGS+="-msse -msse2 -mfpmath=sse", args+="--no-sse":
18.62s 19.12s 19.27s 18.63s
(Note for those who can't grasp -msse and --no-sse combination: Note that --no-sse turns off the hand optimized thing and instead uses whatever the compiler generates, which in turn is defined by CFLAGS)
These results should give us thinking. Let the compiler do the optimizations, it is mostly as good as hand-optimization.
A benefit of 8% is quite a thing. Depending on how much math one does, it may go up to 17% (a time test I did with the oggvorbis encoder sometime ago).And plain old SSE offered virtually no benefit over x87 math, so that's why it's SSE2. (Although that was with VC++, GCC might do better with it; I haven't tried.)
All tests were done on zdbsp svn rev 226.
