-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve System.Decimal performance for x64 platform #7778
Comments
@Daniel-Svensson thank you for looking into this! The numbers from your experiments look really great. As for your questions:
It would first have to be tested if these changes would improve desktop. Since on desktop, we are calling into OLEAUT32, it is possible that the code in there was already optimized for 64 bits.
It really depends on the actual applications. Decimals are quite special.
I would stay with these four operations. If you look at the list of methods on decimal implemented by the native runtime (https://github.com/dotnet/coreclr/blob/master/src/vm/ecalllist.h#L834-L849), you'll see that they all use the VarDecXXX functions that you've already optimized.
I wonder if we could somehow unify the code for 32 and 64 bit. What if we implemented the 64 bit intrinsics for 32 bit just in C and see if the perf stays the same for 32 bits. Just from looking at a sample of your functions, it looks like it might be possible.
I'd be happy to give you a guidance in cmake stuff. |
I also have a question - will the intrinsics used in your changes work for Unix ARM64? |
Desktop is calling the implementation of these methods in oleaut32. I doubt that fixing a oleaut32 perf problem by duplicating its implementation would fly for desktop. It would most likely turn into suggestion to Windows to fix the perf problem in oleaut32. Ideally, we would have fully managed implementations on these methods in C# in .NET Core. There is C# implementation of these methods in CoreRT: https://github.com/dotnet/corert/blob/master/src/System.Private.CoreLib/src/System/Decimal.DecCalc.cs . I would recommend focusing the optimization effort on this C# implementation, and once it is clearly better than what we have in CoreCLR today, switch CoreCLR to use it as well. |
Ah ok, I've not realized we already have a managed implementation in CoreRT. So I agree with @jkotas that we should focus on improving the managed implementation. |
Just some quick replies, I will try to get some time this weekend to address most of the other comments Coreclr and managed implementationsRegarding focusing on the managed version: It should be possible to get a few percentage improvements if we somehow get access to bit scanning operations, but without 128bit results I see no way to come anywhere close most of these numbers. ARM64I don't think the instrincts works as they are right now for ARM64, but most of them should hopefully be able to fix easily. That said the code can probably be used as a basis for arm64. The corresponding instructions (add, add with carry etc.) as well as multiplication of two unsigned 64bit words producing 128-bit result (2 instructions UMULL, UMULH) seems to be availible. Unifying 32 and 64bitA reasonable amount of the code can probably be shared such some of the helper methods, but the impact in 32bit mode must be measured. The original implementation seems quite optimized for 32bit with more short-circuits and special cases than required in the prototype. Mixed comments
My numbers are primary measurements against the oleaut32 implementations om 64-bit Windows (with coreclr/palrt as separate measurements against the code in decimal/decarith.cpp). I was assuming that https://github.com/dotnet/coreclr/blob/master/src/classlibnative/bcltype/decimal.cpp are the code which https://github.com/dotnet/coreclr/blob/master/src/vm/ecalllist.h#L834-L849 refers to. So it does not look like they are optimized in Windows 10 Anniversary and I doubt that changes soon.
I would love to have the code integrated to get the benefits, but Windows is not exactly OSS so that seems Where to insert the code (if anywhere) and additional question?
|
@janvorli I tried the approach and it seems to work fine after some minor tweaking.
Not directly.
Feel free to contact me if anybody is intereseted in using my work to improve the windows/"oleaut" implementation. Updated measurementsI have made very little tweaking to the 32bit implementation, there might be some additional easy things to do in order to increase performance. These measurements were made running a intel core i5 2500k, other architectures will probably get slightly different numbers.
Where to go from here?I would prefer to update try and update the current implementation and see what the final perf impact would be.
|
@Daniel-Svensson I am really sorry for not responding for such a long time. I have seen your update and wanted to discuss it further with @jkotas, but I have completely forgotten about it. |
@janvorli @jkotas I have no hurry with this issue, but I am hoping for this year. It should be possible to add the managed implementation to the benchmark project in order to compare it. I've added a mananged c++ wrapper for the new code add added a few simple benchmarks as a start. new divide measurments
Both x86 implementations are significantly slower than the windows version in oleaut32 which should be used for rujit x86 (it seems to use coreclr version) BenchmarkDotNet measurments for a single invokationupdate -- removed benchmarks since updated values can be found in next post. NetFramework: Manged method in the c++ assembly calls system.decimal methods Summary for managed benchmarks
|
New Measurements for C# BenchmarkI've managed to add the decimal implementation from CoreRT as well as compiling a custom coreclr build (commit 6747fa1) both with and without the code changes to bclnative library. Below follows measurements for
*"Method" explanation:
The benchmarks beeing run are;
The C# benchmarks are currently nowhere near as exhaustive as the C++ variants since they only measure a single invocation with constant values. The C# benchmarks can be made much better (feel free to add more tests :) ) but I think they at least highlights that the new native code would result in measurable performance. Custom CoreCLR build (x64)Both benchmark dll and CoreRT dll with corecrt decimal type have been crossgen'ed. BenchmarkDotNet=v0.10.9, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i5-2500K CPU 3.30GHz (Sandy Bridge), ProcessorCount=4
Frequency=14318180 Hz, Resolution=69.8413 ns, Timer=HPET
.NET Core SDK=2.0.0-preview2-006497
[Host] : .NET Core ? (Framework 4.6.00001.0), 64bit RyuJIT
Toolchain=InProcessToolchain
Mul 96by96
Div 96by96
notes it seems strange that PInvoke method seems faster than netframework implementation Add 96by96
notes it seems like it could be a perf regression for this specific pair a larger set of numbers such as those in the c++ project could give a more reliable measurments. 64bit results .NET 4.7BenchmarkDotNet=v0.10.9, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i5-2500K CPU 3.30GHz (Sandy Bridge), ProcessorCount=4
Frequency=14318180 Hz, Resolution=69.8413 ns, Timer=HPET
[Host] : .NET Framework 4.7 (CLR 4.0.30319.42000), 64bit RyuJIT-v4.7.2101.1
Platform=X64 Runtime=Clr
Mul 96by96
Div 96by96
Add 96by96
32bit resultsBenchmarkDotNet=v0.10.9, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i5-2500K CPU 3.30GHz (Sandy Bridge), ProcessorCount=4
Frequency=14318180 Hz, Resolution=69.8413 ns, Timer=HPET
[Host] : .NET Framework 4.7 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.2101.1
DefaultJob : .NET Framework 4.7 (CLR 4.0.30319.42000), 32bit LegacyJIT-v4.7.2101.1
Mul 96by96
Div 96by96
Add 96by96
Summary/ConclusionsCoreRT managed implementation
|
Using only two constants for performance testing isn't sufficient - the CPU branch predictor will skip large parts of the code. The current CoreRT implementation is a straightforward port from oleaut32 optimised for correctness and can definitely be improved. It allocates memory on every add/mul/div/round operation, uses 32bit math, etc. Like the proposed native variant, it could also benefit from runtime/JIT support for 64/128bit mul/div operations (maybe as private intrinsics). |
@pentp thanks for the tip about optimizing managed implementation, ive updated must of the implementation to remove allocations (but I did not find all) and updated results for .net 47 As for it being only constants, that is because the intention was to see if performance from the more rigorous c++ becnhmarks can translate all the way to C# code. |
@Daniel-Svensson by "decimal.cpp from coreRT" you meant "decimal.cs from coreRT", right? |
Thanks for spotting that @janvorli, i've edited the comment |
Because I like (fast) managed code I made some basic optimisations to the CoreRT decimal code (no significant changes in algorithm structure) and measured using a weighted random distribution of decimals that might better represent real world use (gist is sorted for better overview). A total of 440^2=~190K combinations were tested (excluding overflows/div-by-0). BenchmarkDotNet=v0.10.9, OS=Windows 10 Redstone 2 (10.0.15063)
Processor=Intel Core i7-7700K CPU 4.20GHz (Kaby Lake), ProcessorCount=8
.NET Core SDK=2.0.0-preview2-006127
[Host] : .NET Core ? (Framework 4.6.00001.0), 64bit RyuJIT
Toolchain=InProcessToolchain
Obstacles to further optimisation are #4155 / #5213 and lack of 128bit DIV/MUL intrinsics. I suspect both of these issues depend on JIT supporting multiple output registers in IR, although 128bit DIV could partially work with only a single output. @jkotas Should I start integrating these optimisations into the CoreRT repo? When it's faster than the current native implementation in CoreCLR then And should I open a new issue for 128bit DIV/MUL support in Also found a probable bug in |
Yes, we would love to have these optimizations in CoreRT repo. Thank you for doing the work! |
https://github.com/dotnet/corefx/issues/17147 is about 128bit integer support in general. Not sure whether we need a separate issue for just DIV/MUL support at this point. |
The |
closing singe x64 perf is at least a little better now and most of the relevant performance improvements cannot be ported to C# without the help of instrincts for basic x64 instructions which seems to not come in a reasonable time |
The current implementation on forwards call to windows implementation of "decimal"
(VarDecAdd, VarDecSub, VarDecMul) or to the port of these functions found under palrt (as far as I understand the code).
These methods are written optimised for 32bit platforms and by using 64bit instructions it is possible to significantly improve performance.
I have written a small proof of concept to illustrate the gains I is looking forward to try to integrate the code with coreclr, but have some questions and want feedback on how to best integrate it before submitting a PR with those methods.
Questions:
I would assume that apart from the basic aritmethic operations that conversions to/from text as well as double/int can be quite common.
I have on +,-,/ and * for now but there might be some other low hanging fruit.
I am thinking along the way of keeping _x64 suffix on these methods (VarDecAdd_x64)
and only include the code for x64 platforms.
This would be be coupled with a macro to redifine VarDecAdd as VarDecAdd_x64 to route all calls to the x64 implementation.
Proof of Concept
I have created x64 aware methods for the aritmetic instructions Add, Sub, Mul and Div
based on the current code in coreclr, there are not real changes to algoritms or other logic
apart from changing 32bit aritmetic to 64bit and some of the results of that.
And using some instrincts for bitsearch and carry propagation.
This is a summary of the measurements from example projekt which can be found at:
https://github.com/Daniel-Svensson/ClrExperiments/tree/master/ClrDecimal/coreclrtesting
Measurment
https://github.com/Daniel-Svensson/ClrExperiments/blob/master/ClrDecimal/coreclrtesting/main.cpp
In short program generates a number of "semi random" input where different number of bits are
set. Then it calls the method under test for all combinations of the input.
Results:
See the results folder (https://github.com/Daniel-Svensson/ClrExperiments/tree/master/ClrDecimal/coreclrtesting/results)
for complete output results.
I have tried to summarize results for both core i5 2500K and i7 6700K below.
I5 result are with a few minor changes but otherwise same code
, the biggest performance spread is in some of the division tests.
The results below are against the oleauto32 implementation, when compared against
implementations in palrt the results are very similar since those results are within a few
% of the timings for oleauto32.
Multiply
Add / Sub
Div
Speedup range: 10-270%
For mixed input (all 00...111 bitpatterns, with all scales and signs): ~100%
The text was updated successfully, but these errors were encountered: