In the asmjit case its a bit strange though, because as far I can tell those functions didn't even need to be templated. It is almost as if he's trying to do compile time checks (like the gzdoom drawers) to speed things up, but in code where you shouldn't need to be so aggressive about that. On top of that everything is stored as compactly as possible in memory. While such code may be faster I don't think it is worth it except for the uttermost critical loops in a program.
Btw. I implemented the XOR and the bug itself seems to be gone. But there's still something wrong - while looking at the dumpjit output I noticed it does this in Actor.AimBulletMissile:
Code: Select all
; line 17: 19000002 LDP
test rbx, rbx ; test regA0, regA0
je L8 ; je L8
movsd xmm0, qword [rbx+344] ; movsd regF0, qword [regA0+344] // speed
; line 17: 19010003 LDP
test rbx, rbx ; test regA0, regA0
je L9 ; je L9
movsd xmm1, qword [rbx+120] ; movsd regF1, qword [regA0+120] // angle
; line 17: 19020004 LDP
test rbx, rbx ; test regA0, regA0
je L10 ; je L10
movsd xmm3, qword [rbx+112] ; movsd regF2, qword [regA0+112] // pitch
; line 17: 50020400 CALL_K
xorps xmm0, xmm1 ; [Swap] regF0, regF1
xorps xmm1, xmm0
xorps xmm0, xmm1
movapd xmm2, xmm0 ; [Move] regF1
mov rcx, rbx ; [Duplicate] regA0
call 140701174325264 ; Actor.Vel3DFromAngle [Native] // xmm0 = angle, xmm1 = speed, xmm2 = angle ?? clearly not right!