Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Ord::cmp for primitives as a BinOp in MIR #118310

Merged
merged 4 commits into from
Apr 2, 2024

Conversation

scottmcm
Copy link
Member

@scottmcm scottmcm commented Nov 26, 2023

Update: most of this OP was written months ago. See #118310 (comment) below for where we got to recently that made it ready for review.


There are dozens of reasonable ways to implement Ord::cmp for integers using comparison, bit-ops, and branches. Those differences are irrelevant at the rust level, however, so we can make things better by adding BinOp::Cmp at the MIR level:

  1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest.
  2. By not inlining those details for every use of cmp, we drastically reduce the amount of MIR generated for derived PartialOrd, while also making it more amenable to MIR-level optimizations.

Having extremely careful if ordering to μoptimize resource usage on broadwell (#63767) is great, but it really feels to me like libcore is the wrong place to put that logic. Similarly, using subtraction tricks (#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical. Or maybe bitor is better than add? But maybe only on a future version that has or disjoint support? And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers. Not to mention that if LLVM gets a spaceship intrinsic -- which it should -- we'll need at least a rustc intrinsic to be able to call it.

As for simplifying it in Rust, we now regularly inline {integer}::partial_cmp, but it's quite a large amount of IR. The best way to see that is with 8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks. Even better, as soon as the construct-Some-then-match-it-in-same-BB noise is cleaned up, this'll expose the Cmp == 0 branches clearly in MIR, so that an InstCombine (#105808) can simplify that to just a BinOp::Eq and thus fix some of our generated code perf issues. (Tracking that through today's if a < b { Less } else if a == b { Equal } else { Greater } would be much harder.)


r? @ghost
But first I should check that perf is ok with this
...and my true nemesis, tidy.

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Nov 26, 2023
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@scottmcm
Copy link
Member Author

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Nov 26, 2023
@bors
Copy link
Contributor

bors commented Nov 26, 2023

⌛ Trying commit 9cfc79a with merge d5cf045...

bors added a commit to rust-lang-ci/rust that referenced this pull request Nov 26, 2023
Add `Ord::cmp` for primitives as a `BinOp` in MIR

There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches.  Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level:

1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest.
2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations.

Having extremely careful `if` ordering to μoptimize resource usage on broadwell (rust-lang#63767) is great, but it really feels to me like libcore is the wrong place to put that logic.  Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (rust-lang#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical.  Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)?  But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)?  And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers.  Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it.

As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR.  The best way to see that is with rust-lang@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks.  Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (rust-lang#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues.  (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.)

---

r? `@ghost`
But first I should check that perf is ok with this
~~...and my true nemesis, tidy.~~
@bors
Copy link
Contributor

bors commented Nov 26, 2023

☀️ Try build successful - checks-actions
Build commit: d5cf045 (d5cf04528062d622d35ccb36d43a437871ee8b27)

@rust-timer

This comment has been minimized.

@rust-timer
Copy link
Collaborator

Finished benchmarking commit (d5cf045): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf.

Next Steps: If you can justify the regressions found in this try perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please fix the regressions and do another perf run. If the next run shows neutral or positive results, the label will be automatically removed.

@bors rollup=never
@rustbot label: -S-waiting-on-perf +perf-regression

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.9% [0.4%, 2.2%] 5
Regressions ❌
(secondary)
0.2% [0.2%, 0.2%] 1
Improvements ✅
(primary)
-0.7% [-0.7%, -0.7%] 1
Improvements ✅
(secondary)
-3.4% [-3.4%, -3.4%] 1
All ❌✅ (primary) 0.7% [-0.7%, 2.2%] 6

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
5.7% [2.1%, 11.8%] 3
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-3.9% [-3.9%, -3.9%] 1
Improvements ✅
(secondary)
-5.0% [-5.0%, -5.0%] 1
All ❌✅ (primary) 3.3% [-3.9%, 11.8%] 4

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
2.2% [2.2%, 2.2%] 1
Regressions ❌
(secondary)
2.4% [2.4%, 2.4%] 1
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-2.9% [-2.9%, -2.9%] 1
All ❌✅ (primary) 2.2% [2.2%, 2.2%] 1

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.2% [0.1%, 0.3%] 10
Regressions ❌
(secondary)
0.7% [0.2%, 1.3%] 2
Improvements ✅
(primary)
-0.2% [-0.4%, -0.0%] 19
Improvements ✅
(secondary)
-6.6% [-6.6%, -6.6%] 1
All ❌✅ (primary) -0.1% [-0.4%, 0.3%] 29

Bootstrap: 675.058s -> 677.549s (0.37%)
Artifact size: 313.35 MiB -> 313.33 MiB (-0.01%)

@rustbot rustbot added perf-regression Performance regression. and removed S-waiting-on-perf Status: Waiting on a perf run to be completed. labels Nov 26, 2023
Comment on lines +16 to +22
// DEBUG: cmp
// DEBUG: setg
// DEBUG: and
// DEBUG: cmp
// DEBUG: setl
// DEBUG: and
// DEBUG: sub
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With any optimization this would be down to just cmp+setg+setl+sub, but at opt-level=0 it doesn't know that.

use std::cmp::Ordering;
debug_assert!(!is_float);
let pred = |op| base::bin_op_to_icmp_predicate(op, is_signed);
if bx.cx().tcx().sess.opts.optimize == OptLevel::No {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The different approach makes quite a remarkable difference in unoptimized mode. I haven't investigated exactly what's happening in tablegen, but the gt+lt+sub approach in unoptimized gives

unsigned_cmp:
	cmp	cx, dx
	seta	al
	and	al, 1
	cmp	cx, dx
	setb	cl
	and	cl, 1
	sub	al, cl
	ret

whereas the other approach gives

unsigned_cmp:
	push	rax
	mov	word ptr [rsp + 2], dx
	mov	word ptr [rsp + 4], cx
	mov	al, 1
	xor	r8d, r8d
	mov	byte ptr [rsp + 6], r8b
	cmp	cx, dx
	mov	byte ptr [rsp + 7], al
	jne	.LBB1_2
	mov	al, byte ptr [rsp + 6]
	mov	byte ptr [rsp + 7], al
.LBB1_2:
	mov	cx, word ptr [rsp + 4]
	mov	dx, word ptr [rsp + 2]
	mov	al, byte ptr [rsp + 7]
	mov	byte ptr [rsp], al
	mov	al, 255
	cmp	cx, dx
	mov	byte ptr [rsp + 1], al
	jb	.LBB1_4
	mov	al, byte ptr [rsp]
	mov	byte ptr [rsp + 1], al
.LBB1_4:
	mov	al, byte ptr [rsp + 1]
	pop	rcx
	ret

@scottmcm
Copy link
Member Author

@bors try @rust-timer queue

@rust-timer

This comment has been minimized.

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Nov 27, 2023
@bors
Copy link
Contributor

bors commented Nov 27, 2023

⌛ Trying commit 82551ca with merge 3d40b1d...

bors added a commit to rust-lang-ci/rust that referenced this pull request Nov 27, 2023
Add `Ord::cmp` for primitives as a `BinOp` in MIR

There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches.  Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level:

1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest.
2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations.

Having extremely careful `if` ordering to μoptimize resource usage on broadwell (rust-lang#63767) is great, but it really feels to me like libcore is the wrong place to put that logic.  Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (rust-lang#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical.  Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)?  But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)?  And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers.  Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it.

As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR.  The best way to see that is with rust-lang@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks.  Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (rust-lang#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues.  (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.)

---

r? `@ghost`
But first I should check that perf is ok with this
~~...and my true nemesis, tidy.~~
@cjgillot cjgillot removed their assignment Mar 25, 2024
Copy link
Member

@davidtwco davidtwco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'm not familiar with Stable MIR and whether or not this should be added to that.

cc @rust-lang/project-stable-mir

r=me once someone from that group signs-off on that part

Comment on lines +137 to 138
BinOp::Cmp => unreachable!(),
BinOp::Lt | BinOp::Le | BinOp::Eq | BinOp::Ge | BinOp::Gt | BinOp::Ne => unreachable!(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the same file in Line 71, BinOp::Cmp was added to the same Branch as the other Compersions. As it doesn't make a difference here, the same could be done here.

Suggested change
BinOp::Cmp => unreachable!(),
BinOp::Lt | BinOp::Le | BinOp::Eq | BinOp::Ge | BinOp::Gt | BinOp::Ne => unreachable!(),
BinOp::Lt | BinOp::Le | BinOp::Eq | BinOp::Ge | BinOp::Gt | BinOp::Ne | BinOp::Cmp => unreachable!(),

Copy link
Contributor

@oli-obk oli-obk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm from the smir side. We'll figure out how to get the type of a lang item

@scottmcm
Copy link
Member Author

scottmcm commented Apr 2, 2024

@bors r=davidtwco

@bors
Copy link
Contributor

bors commented Apr 2, 2024

📌 Commit c59e93c has been approved by davidtwco

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Apr 2, 2024
@bors
Copy link
Contributor

bors commented Apr 2, 2024

⌛ Testing commit c59e93c with merge a77322c...

@bors
Copy link
Contributor

bors commented Apr 2, 2024

☀️ Test successful - checks-actions
Approved by: davidtwco
Pushing a77322c to master...

@bors bors added the merged-by-bors This PR was explicitly merged by bors. label Apr 2, 2024
@bors bors merged commit a77322c into rust-lang:master Apr 2, 2024
12 checks passed
@rustbot rustbot added this to the 1.79.0 milestone Apr 2, 2024
@rust-timer
Copy link
Collaborator

Finished benchmarking commit (a77322c): comparison URL.

Overall result: ❌✅ regressions and improvements - ACTION NEEDED

Next Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression-triaged along with sufficient written justification. If you cannot justify the regressions please open an issue or create a new PR that fixes the regressions, add a comment linking to the newly created issue or PR, and then add the perf-regression-triaged label to this PR.

@rustbot label: +perf-regression
cc @rust-lang/wg-compiler-performance

Instruction count

This is a highly reliable metric that was used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.4% [0.2%, 0.7%] 3
Regressions ❌
(secondary)
0.3% [0.2%, 0.4%] 2
Improvements ✅
(primary)
-0.6% [-0.6%, -0.6%] 1
Improvements ✅
(secondary)
-3.1% [-3.1%, -3.1%] 1
All ❌✅ (primary) 0.2% [-0.6%, 0.7%] 4

Max RSS (memory usage)

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
3.7% [2.7%, 4.5%] 4
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
-4.5% [-5.9%, -2.5%] 3
Improvements ✅
(secondary)
- - 0
All ❌✅ (primary) 0.2% [-5.9%, 4.5%] 7

Cycles

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
- - 0
Regressions ❌
(secondary)
- - 0
Improvements ✅
(primary)
- - 0
Improvements ✅
(secondary)
-2.9% [-2.9%, -2.9%] 1
All ❌✅ (primary) - - 0

Binary size

Results

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

mean range count
Regressions ❌
(primary)
0.2% [0.0%, 0.5%] 6
Regressions ❌
(secondary)
0.0% [0.0%, 0.0%] 1
Improvements ✅
(primary)
-0.2% [-0.8%, -0.0%] 16
Improvements ✅
(secondary)
-5.6% [-5.6%, -5.6%] 1
All ❌✅ (primary) -0.1% [-0.8%, 0.5%] 22

Bootstrap: 667.286s -> 667.104s (-0.03%)
Artifact size: 315.68 MiB -> 315.58 MiB (-0.03%)

@scottmcm scottmcm deleted the three-way-compare branch April 3, 2024 00:29
github-actions bot pushed a commit to rust-lang/miri that referenced this pull request Apr 3, 2024
Add `Ord::cmp` for primitives as a `BinOp` in MIR

Update: most of this OP was written months ago.  See rust-lang/rust#118310 (comment) below for where we got to recently that made it ready for review.

---

There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches.  Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level:

1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest.
2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations.

Having extremely careful `if` ordering to μoptimize resource usage on broadwell (#63767) is great, but it really feels to me like libcore is the wrong place to put that logic.  Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical.  Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)?  But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)?  And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers.  Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it.

As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR.  The best way to see that is with rust-lang/rust@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks.  Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues.  (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.)

---

r? `@ghost`
But first I should check that perf is ok with this
~~...and my true nemesis, tidy.~~
@scottmcm
Copy link
Member Author

scottmcm commented Apr 3, 2024

image-0.24.1 opt-full is +0.69% on instructions, but also -6.23% on wall time, so I don't know what to think about these perf results. I guess it's the classic unpredictable results from more inlining.

bjorn3 pushed a commit to rust-lang/rustc_codegen_cranelift that referenced this pull request Apr 5, 2024
Add `Ord::cmp` for primitives as a `BinOp` in MIR

Update: most of this OP was written months ago.  See rust-lang/rust#118310 (comment) below for where we got to recently that made it ready for review.

---

There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches.  Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level:

1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest.
2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations.

Having extremely careful `if` ordering to μoptimize resource usage on broadwell (#63767) is great, but it really feels to me like libcore is the wrong place to put that logic.  Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical.  Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)?  But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)?  And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers.  Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it.

As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR.  The best way to see that is with rust-lang/rust@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks.  Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues.  (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.)

---

r? `@ghost`
But first I should check that perf is ok with this
~~...and my true nemesis, tidy.~~
lnicola pushed a commit to lnicola/rust-analyzer that referenced this pull request Apr 7, 2024
Add `Ord::cmp` for primitives as a `BinOp` in MIR

Update: most of this OP was written months ago.  See rust-lang/rust#118310 (comment) below for where we got to recently that made it ready for review.

---

There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches.  Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level:

1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest.
2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations.

Having extremely careful `if` ordering to μoptimize resource usage on broadwell (#63767) is great, but it really feels to me like libcore is the wrong place to put that logic.  Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical.  Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)?  But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)?  And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers.  Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it.

As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR.  The best way to see that is with rust-lang/rust@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks.  Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues.  (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.)

---

r? `@ghost`
But first I should check that perf is ok with this
~~...and my true nemesis, tidy.~~
@pnkfelix
Copy link
Member

pnkfelix commented Apr 9, 2024

@scottmcm from looking at the graphs, I'm inclined to categorize the majority of the outliers w.r.t. wall-time (especially the negative ones) as just noise here. (Of course wall-time itself is already ridiculously noisy.)

image-0.24.1 itself does seem like something interesting though, I freely admit that.

@pnkfelix
Copy link
Member

pnkfelix commented Apr 9, 2024

The impact here is somewhat limited, and the graph indicates that the regression for image-0.24.1 was subsequently recovered.

@rustbot label: +perf-regression-triaged

@rustbot rustbot added the perf-regression-triaged The performance regression has been triaged. label Apr 9, 2024
celinval added a commit to model-checking/kani that referenced this pull request Apr 16, 2024
Related changes:
  - rust-lang/rust#118310:
    Add `Ord::cmp` for primitives as a `BinOp` in MIR
  - rust-lang/rust#120131:
    Add support to `Pat` pattern type
  - rust-lang/rust#122935:
    Rename CastKind::PointerWithExposedProvenance
  - rust-lang/rust#123097:
    Adapt to changes to local_def_path_hash_to_def_id

Resolves #3130, #3142
celinval added a commit to celinval/kani-dev that referenced this pull request Apr 17, 2024
Related changes:
  - rust-lang/rust#118310:
    Add `Ord::cmp` for primitives as a `BinOp` in MIR
  - rust-lang/rust#120131:
    Add support to `Pat` pattern type
  - rust-lang/rust#122935:
    Rename CastKind::PointerWithExposedProvenance
  - rust-lang/rust#123097:
    Adapt to changes to local_def_path_hash_to_def_id

Resolves model-checking#3130, model-checking#3142
RalfJung pushed a commit to RalfJung/rust-analyzer that referenced this pull request Apr 27, 2024
Add `Ord::cmp` for primitives as a `BinOp` in MIR

Update: most of this OP was written months ago.  See rust-lang/rust#118310 (comment) below for where we got to recently that made it ready for review.

---

There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches.  Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level:

1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest.
2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations.

Having extremely careful `if` ordering to μoptimize resource usage on broadwell (#63767) is great, but it really feels to me like libcore is the wrong place to put that logic.  Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical.  Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)?  But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)?  And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers.  Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it.

As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR.  The best way to see that is with rust-lang/rust@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks.  Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues.  (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.)

---

r? `@ghost`
But first I should check that perf is ok with this
~~...and my true nemesis, tidy.~~
zpzigi754 pushed a commit to zpzigi754/kani that referenced this pull request May 8, 2024
Related changes:
  - rust-lang/rust#118310:
    Add `Ord::cmp` for primitives as a `BinOp` in MIR
  - rust-lang/rust#120131:
    Add support to `Pat` pattern type
  - rust-lang/rust#122935:
    Rename CastKind::PointerWithExposedProvenance
  - rust-lang/rust#123097:
    Adapt to changes to local_def_path_hash_to_def_id

Resolves model-checking#3130, model-checking#3142
GuillaumeGomez pushed a commit to GuillaumeGomez/rustc_codegen_gcc that referenced this pull request May 21, 2024
Add `Ord::cmp` for primitives as a `BinOp` in MIR

Update: most of this OP was written months ago.  See rust-lang/rust#118310 (comment) below for where we got to recently that made it ready for review.

---

There are dozens of reasonable ways to implement `Ord::cmp` for integers using comparison, bit-ops, and branches.  Those differences are irrelevant at the rust level, however, so we can make things better by adding `BinOp::Cmp` at the MIR level:

1. Exactly how to implement it is left up to the backends, so LLVM can use whatever pattern its optimizer best recognizes and cranelift can use whichever pattern codegens the fastest.
2. By not inlining those details for every use of `cmp`, we drastically reduce the amount of MIR generated for `derive`d `PartialOrd`, while also making it more amenable to MIR-level optimizations.

Having extremely careful `if` ordering to μoptimize resource usage on broadwell (#63767) is great, but it really feels to me like libcore is the wrong place to put that logic.  Similarly, using subtraction [tricks](https://graphics.stanford.edu/~seander/bithacks.html#CopyIntegerSign) (#105840) is arguably even nicer, but depends on the optimizer understanding it (llvm/llvm-project#73417) to be practical.  Or maybe [bitor is better than add](https://discourse.llvm.org/t/representing-in-ir/67369/2?u=scottmcm)?  But maybe only on a future version that [has `or disjoint` support](https://discourse.llvm.org/t/rfc-add-or-disjoint-flag/75036?u=scottmcm)?  And just because one of those forms happens to be good for LLVM, there's no guarantee that it'd be the same form that GCC or Cranelift would rather see -- especially given their very different optimizers.  Not to mention that if LLVM gets a spaceship intrinsic -- [which it should](https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/Suboptimal.20inlining.20in.20std.20function.20.60binary_search.60/near/404250586) -- we'll need at least a rustc intrinsic to be able to call it.

As for simplifying it in Rust, we now regularly inline `{integer}::partial_cmp`, but it's quite a large amount of IR.  The best way to see that is with rust-lang/rust@8811efa#diff-d134c32d028fbe2bf835fef2df9aca9d13332dd82284ff21ee7ebf717bfa4765R113 -- I added a new pre-codegen MIR test for a simple 3-tuple struct, and this PR change it from 36 locals and 26 basic blocks down to 24 locals and 8 basic blocks.  Even better, as soon as the construct-`Some`-then-match-it-in-same-BB noise is cleaned up, this'll expose the `Cmp == 0` branches clearly in MIR, so that an InstCombine (#105808) can simplify that to just a `BinOp::Eq` and thus fix some of our generated code perf issues.  (Tracking that through today's `if a < b { Less } else if a == b { Equal } else { Greater }` would be *much* harder.)

---

r? `@ghost`
But first I should check that perf is ok with this
~~...and my true nemesis, tidy.~~
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merged-by-bors This PR was explicitly merged by bors. perf-regression Performance regression. perf-regression-triaged The performance regression has been triaged. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.