Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sub-optimal code when the operation "% 1" on integers is met #10956

Closed
voinokin opened this issue Aug 23, 2018 · 2 comments · Fixed by #77760
Closed

Sub-optimal code when the operation "% 1" on integers is met #10956

voinokin opened this issue Aug 23, 2018 · 2 comments · Fixed by #77760
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions help wanted [up-for-grabs] Good issue for external contributors optimization
Milestone

Comments

@voinokin
Copy link

voinokin commented Aug 23, 2018

Currently, the CLR gives result "0" (which seems correct) for "remainder of 1" operation on integers.

The formula a % b will always return a value on the range (-b, b), exclusive (it can never return b or -b), keeping the sign of the dividend. For integer division, the remainder operator satisfies the rule a % b = a - (a / b) * b. This is not to be confused with canonical modulus, which satisfies a similar rule but with floored division and returns values on the range [0, b). C# does not have an operator for canonical modulus. However, the behavior is the same for positive dividends.

For b = 1 the resulting range is obviously (-1; 1) ==> just plain 0.

However, it looks like there is some space for improvements in JIT codegen, since JIT does not seem to fold the result of evaluation to zero constant, which prohibits further optimizations - see the repro code below.

To my understanding, ALL % 1 operations can be folded to just use constant values, with maybe notable exception when the memory read and/or write is required to observe some side effects (eg. volatile variables). For signed integers obtained directly or indirectly (thru implicit type convertion to Int32) the code looks quite verbose, given that the result is known ahead. It's better for UInt32 and UInt64 where AND r/m, 0 is issued, but of these AND [m], 0 only makes sense when again side effects are to be observed. I don't see any reason for AND r, 0 - this loads CPU scheduler much more than XOR r, r.

Understanding that % 1 does not often occur in the code, here's real case when it may appear (happened to me). Say there is ring buffer with N items, and that N is provided externally - either thru static readonly field (resolve during runtime), or thru const definition (resolve during compile time). To calculate the slot index for current or next item, the % N opration is used. Since in degenerate case that N can be set to 1, there is desire to reduce all related calculations when there's just one slot for everything. At the moment additional checks placed manually are required and that provides more IL, which reduces inlineability for methods.


The repro code:

static sbyte i8 = sbyte.MinValue;
static byte u8 = byte.MaxValue;
static short i16 = short.MinValue;
static ushort u16 = ushort.MaxValue;
static int i32 = int.MinValue;
static uint u32 = uint.MaxValue;
static long i64 = long.MinValue;
static ulong u64 = ulong.MaxValue;

static void Main(string[] args)
{
    // Test with m.loc.
    i8 %= (sbyte)1;
    i16 %= (short)1;
    i32 %= (int)1;
    i64 %= (long)1;

    // Test with r.loc.
    Console.WriteLine((sbyte)(i8 + i8) % (sbyte)1);
    Console.WriteLine((short)(i16 + i16) % (short)1);
    Console.WriteLine((int)(i32 + i32) % (int)1);
    Console.WriteLine((long)(i64 + i64) % (long)1);

    // Test with m.loc.
    u8 %= (byte)1;
    u16 %= (ushort)1;
    u32 %= (uint)1;
    u64 %= (ulong)1;

    // Test with r.loc.
    Console.WriteLine((byte)(u8 + u8) % (byte)1);
    Console.WriteLine((ushort)(u16 + u16) % (ushort)1);
    Console.WriteLine((uint)(u32 + u32) % (uint)1);
    Console.WriteLine((ulong)(u64 + u64) % (ulong)1);
}

The disasm:

--- ...\Program2.cs ---
    i8 %= (sbyte)1;
000007FE749F1450  sub         rsp,28h  
000007FE749F1454  mov         rcx,7FE748D4B30h  
000007FE749F145E  mov         edx,1  
000007FE749F1463  call        000007FED44D04F0  

    // Test with m.loc.
    i8 %= (sbyte)1;
000007FE749F1468  movsx       rcx,byte ptr [7FE748D4B84h]  
000007FE749F1470  mov         eax,ecx  
000007FE749F1472  sar         eax,1Fh  
000007FE749F1475  and         eax,0  
000007FE749F1478  add         eax,ecx  
000007FE749F147A  and         eax,0FFFFFFFFh  
000007FE749F147D  sub         ecx,eax  
000007FE749F147F  movsx       rcx,cl  
000007FE749F1483  mov         byte ptr [7FE748D4B84h],cl  

    i16 %= (short)1;
000007FE749F1489  movsx       rcx,word ptr [7FE748D4B80h]  
000007FE749F1491  mov         eax,ecx  
000007FE749F1493  sar         eax,1Fh  
000007FE749F1496  and         eax,0  
000007FE749F1499  add         eax,ecx  
000007FE749F149B  and         eax,0FFFFFFFFh  
000007FE749F149E  sub         ecx,eax  
000007FE749F14A0  movsx       rcx,cx  
000007FE749F14A4  mov         word ptr [7FE748D4B80h],cx  

    i32 %= (int)1;
000007FE749F14AB  mov         ecx,dword ptr [7FE748D4B78h]  
000007FE749F14B1  mov         eax,ecx  
000007FE749F14B3  sar         eax,1Fh  
000007FE749F14B6  and         eax,0  
000007FE749F14B9  add         eax,ecx  
000007FE749F14BB  and         eax,0FFFFFFFFh  
000007FE749F14BE  sub         ecx,eax  
000007FE749F14C0  mov         dword ptr [7FE748D4B78h],ecx  

    i64 %= (long)1;
000007FE749F14C6  mov         rcx,qword ptr [7FE748D4B68h]  
000007FE749F14CD  mov         rax,rcx  
000007FE749F14D0  sar         rax,3Fh  
000007FE749F14D4  and         rax,0  
000007FE749F14D8  add         rax,rcx  
000007FE749F14DB  and         rax,0FFFFFFFFFFFFFFFFh  
000007FE749F14DF  sub         rcx,rax  
000007FE749F14E2  mov         qword ptr [7FE748D4B68h],rcx  

    // Test with r.loc.
    Console.WriteLine((sbyte)(i8 + i8) % (sbyte)1);
000007FE749F14E9  movsx       rcx,byte ptr [7FE748D4B84h]  
000007FE749F14F1  add         ecx,ecx  
000007FE749F14F3  movsx       rcx,cl  
000007FE749F14F7  mov         eax,ecx  
000007FE749F14F9  sar         eax,1Fh  
000007FE749F14FC  and         eax,0  
000007FE749F14FF  add         eax,ecx  
000007FE749F1501  and         eax,0FFFFFFFFh  
000007FE749F1504  sub         ecx,eax  
000007FE749F1506  call        000007FE749F12D0  

    Console.WriteLine((short)(i16 + i16) % (short)1);
000007FE749F150B  movsx       rcx,word ptr [7FE748D4B80h]  
000007FE749F1513  add         ecx,ecx  
000007FE749F1515  movsx       rcx,cx  
000007FE749F1519  mov         eax,ecx  
000007FE749F151B  sar         eax,1Fh  
000007FE749F151E  and         eax,0  
000007FE749F1521  add         eax,ecx  
000007FE749F1523  and         eax,0FFFFFFFFh  
000007FE749F1526  sub         ecx,eax  
000007FE749F1528  call        000007FE749F12D0  

    Console.WriteLine((int)(i32 + i32) % (int)1);
000007FE749F152D  mov         ecx,dword ptr [7FE748D4B78h]  
000007FE749F1533  add         ecx,ecx  
000007FE749F1535  mov         eax,ecx  
000007FE749F1537  sar         eax,1Fh  
000007FE749F153A  and         eax,0  
000007FE749F153D  add         eax,ecx  
000007FE749F153F  and         eax,0FFFFFFFFh  
000007FE749F1542  sub         ecx,eax  
000007FE749F1544  call        000007FE749F12D0  

    Console.WriteLine((long)(i64 + i64) % (long)1);
000007FE749F1549  mov         rcx,qword ptr [7FE748D4B68h]  
000007FE749F1550  add         rcx,rcx  
000007FE749F1553  mov         rax,rcx  
000007FE749F1556  sar         rax,3Fh  
000007FE749F155A  and         rax,0  
000007FE749F155E  add         rax,rcx  
000007FE749F1561  and         rax,0FFFFFFFFFFFFFFFFh  
000007FE749F1565  sub         rcx,rax  
000007FE749F1568  call        000007FE749F12E0  

    // Test with m.loc.
    u8 %= (byte)1;
000007FE749F156D  movzx       ecx,byte ptr [7FE748D4B85h]  
000007FE749F1574  mov         eax,ecx  
000007FE749F1576  sar         eax,1Fh  
000007FE749F1579  and         eax,0  
000007FE749F157C  add         eax,ecx  
000007FE749F157E  and         eax,0FFFFFFFFh  
000007FE749F1581  sub         ecx,eax  
000007FE749F1583  movzx       ecx,cl  
000007FE749F1586  mov         byte ptr [7FE748D4B85h],cl  

    u16 %= (ushort)1;
000007FE749F158C  movzx       ecx,word ptr [7FE748D4B82h]  
000007FE749F1593  mov         eax,ecx  
000007FE749F1595  sar         eax,1Fh  
000007FE749F1598  and         eax,0  
000007FE749F159B  add         eax,ecx  
000007FE749F159D  and         eax,0FFFFFFFFh  
000007FE749F15A0  sub         ecx,eax  
000007FE749F15A2  movzx       ecx,cx  
000007FE749F15A5  mov         word ptr [7FE748D4B82h],cx  

    u32 %= (uint)1;
000007FE749F15AC  and         dword ptr [7FE748D4B7Ch],0  

    u64 %= (ulong)1;
000007FE749F15B3  and         qword ptr [7FE748D4B70h],0  

    // Test with r.loc.
    Console.WriteLine((byte)(u8 + u8) % (byte)1);
000007FE749F15BB  movzx       ecx,byte ptr [7FE748D4B85h]  
000007FE749F15C2  add         ecx,ecx  
000007FE749F15C4  movzx       ecx,cl  
000007FE749F15C7  mov         eax,ecx  
000007FE749F15C9  sar         eax,1Fh  
000007FE749F15CC  and         eax,0  
000007FE749F15CF  add         eax,ecx  
000007FE749F15D1  and         eax,0FFFFFFFFh  
000007FE749F15D4  sub         ecx,eax  
000007FE749F15D6  call        000007FE749F12D0  

    Console.WriteLine((ushort)(u16 + u16) % (ushort)1);
000007FE749F15DB  movzx       ecx,word ptr [7FE748D4B82h]  
000007FE749F15E2  add         ecx,ecx  
000007FE749F15E4  movzx       ecx,cx  
000007FE749F15E7  mov         eax,ecx  
000007FE749F15E9  sar         eax,1Fh  
000007FE749F15EC  and         eax,0  
000007FE749F15EF  add         eax,ecx  
000007FE749F15F1  and         eax,0FFFFFFFFh  
000007FE749F15F4  sub         ecx,eax  
000007FE749F15F6  call        000007FE749F12D0  

    Console.WriteLine((uint)(u32 + u32) % (uint)1);
000007FE749F15FB  mov         ecx,dword ptr [7FE748D4B7Ch]  
000007FE749F1601  add         ecx,ecx  
000007FE749F1603  and         ecx,0  
000007FE749F1606  call        000007FE749F12D8  

    Console.WriteLine((ulong)(u64 + u64) % (ulong)1);
000007FE749F160B  mov         rcx,qword ptr [7FE748D4B70h]  
000007FE749F1612  add         rcx,rcx  
000007FE749F1615  and         rcx,0  
000007FE749F1619  call        000007FE749F12E8  

000007FE749F161E  nop  
000007FE749F161F  add         rsp,28h  
000007FE749F1623  ret  
--- No source file -------------------------------------------------------------

category:cq
theme:expression-opts
skill-level:beginner
cost:small

@mikedn
Copy link
Contributor

mikedn commented Aug 23, 2018

Eh, when I did power of 2 and magic division in lowering I didn't include the case x % 1 since that's something that the frontend should handle. Turns out that both morph and VN handle x / 1 but not x % 1.

Should be easy to fix, I'll take a look once I clear up some other stuff. If nobody else gets to it before me, of course.

@am11
Copy link
Member

am11 commented Nov 24, 2019

The formula a % b will always return a value on the range (-b, b)

Can this range be the basis of further, more thorough, optimizations? such as, detecting and eliminating dead-code. For example, given:

int foo (int i) {
    int x = 4096;
    for (int j = 0; j < i; j += x) {
        if (j % x != 0) {
            return j;
        }
    }
    return 0;
}

clang (9.0) and gcc (9.2) with -O2 generates:

xor     eax, eax
ret

playgrounds: https://gcc.godbolt.org/z/Gk2eAF vs. https://sharplab.io/#v2:EYLg...

but only for all x in 2n series, although the optimization would hold for any value of x (except only when x is negative and i is positive, it leads to overflow exception; for which the code path can also be optimized?). Similarly, if int x is passed as argument to foo, the optimization doesn't apply by those compilers.

In cases where clang and gcc do not optimize to just return 0, clang seems to still generate better/leaner code than other (4) compilers in playgrounds.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@BruceForstall BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 25, 2020
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Nov 2, 2022
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Dec 8, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Jan 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions help wanted [up-for-grabs] Good issue for external contributors optimization
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants