Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Arm N2 SVE perfscore file #95700

Closed
wants to merge 1 commit into from
Closed

Conversation

a74nh
Copy link
Contributor

@a74nh a74nh commented Dec 6, 2023

The SVE perfscores were full of small errors. I've attempted to fix this by autogenerating a bunch of #defines in perfscorearm64sven2.h from the table in #94549 (comment)

Sadly there's no obvious way of auto mapping from instruction group to perfscore group. That still needs manually checking. But, thankfully many instruction groups map to exactly one perfscore group.

When adding new instructions groups to codegen, then easiest route for the perfscore is to:

  • Lookup the instructions in your group using the table in Arm64: Implement SVE encodings #94549
  • The name of the perf group(s) in the table can then be searched for in perfscorearm64sven2.h to find the correct defines.

I've named everything with N2 with the intention that at some point in the future we may want to add encodings for additional chipsets. It would also be useful at some point for the general and neon instructions to match the same CPU.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 6, 2023
@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Dec 6, 2023
@ghost
Copy link

ghost commented Dec 6, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: a74nh
Assignees: -
Labels:

area-CodeGen-coreclr, community-contribution

Milestone: -

@a74nh
Copy link
Contributor Author

a74nh commented Dec 6, 2023

@kunalspathak @BruceForstall @TIHan

Keeping as a draft PR for the moment in case anyone has a better idea.

I wondered at replacing the switch with a lookup table. But, all the exceptions make that non-obvious. Maybe best to keep as the switch for the time being.

* Returns the 1/2/4/8 byte elemsize for an Arm64 Sve vector instruction
*/

/*static*/ unsigned emitter::insSveElemsize(insOpts opt)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now thinking this function should return bits instead of bytes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TIHan added this in #95618

@BruceForstall
Copy link
Member

This looks completely reasonable to me.

@kunalspathak
Copy link
Member

I've attempted to fix this by autogenerating a bunch of #defines in perfscorearm64sven2.h from the table in #94549 (comment)

Are they different from what is defined in emitPerfScore_sve.cpp that I have uploaded in #94549?

@@ -0,0 +1,868 @@
// Licensed to the .NET Foundation under one or more agreements.
Copy link
Member

@kunalspathak kunalspathak Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be included in CMakeLists.txt.

#define PERFSCORE_THROUGHPUT_N2_LOOP_CONTROL_PROPAGATING PERFSCORE_THROUGHPUT_1C

// Loop control, propagating and flag setting
#define PERFSCORE_LATENCY_N2_LOOP_CONTROL_PROPAGATING_AND_FLAG_SETTING PERFSCORE_LATENCY_3C
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not have platform or hardware specific perfscores, it is probably good to have it that way, but:

  • During runtime, I am not sure if we detect hardware and will ever pick the appropriate perfscores PERFSCORE_LATENCY_N2_* vs ``PERFSCORE_LATENCY_N3_*`, etc.
  • This easily can get outdated (which most of the perfscore happen) or unused once we add scores for newer hardware.
  • With the new hardware perf scores added, it will be more confusing to pick which one is best for given instruction.
  • They are purely used for display purpose and we do not depend on them for any optimizations.
  • IMO, we should just follow the existing approach of having these numbers to the best of our ability and then generalize them as the missing instruction numbers are made available.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are they different from what is defined in emitPerfScore_sve.cpp that I have uploaded in #94549?

I hadn't spotted the latest update of emitPerfScore_sve.cpp.

But, I now realise where the errors are coming in .... for PERFSCORE_THROUGHPUT you've mixed up the X and C cases. X is when there are more than one per cycle. Looking at the file, INS_sve_fsqrt is marked as PERFSCORE_THROUGHPUT_14X which can't be right as it's always going to be a slow instruction.

With that fixed I'm thinking maybe we should stick to your file instead of this PR.

What is missing from your file is the cases where the perfscore changes depending on the element size (eg IF_SVE_AK_3A). I noticed that NEON has these.

I also think for currently unsupported instructions we should set the score high, rather than 1C.

They are purely used for display purpose and we do not depend on them for any optimizations.

What are they used for? Is it just for manually comparing the relative performance of one version of a compiled routine against another? Deciding whether a new coreclr optimisation improves or regresses the code?

If so, then my concern is that the relative performance of difference instructions can change quite a bit over time depending on microarchitecture design, and especially between different types of CPUs (eg: a mobile chip vs a server one) as they are design for different purposes. So, basing decisions on old architectures won't match the actual performance seen on real hardware. Could quite easily add an optimisation that that has a better perfscore, but worse real performance. Right now, comparing a NEON loop against an SVE loop using the existing perfscore data would be a bad idea, as I suspect NEON comes from a much older chip.

Ok, I can see why you wouldn't want multiple sets of perfscores. But, on the other hand I don't think a mix up of scores from a variety of sources is helpful. I'm not sure what a good solution would be though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfscore is currently an estimate of performance that can be used in asmdiffs to determine if a change is good from a performance perspective, not just a code size perspective. See "superpmi.py asmdiffs -metrics" and jit-analyze "--metrics".

We're not sure how good these scores are. @AndyAyersMS has been doing various analyses to determine how usable perfscore is for (auto) tuning CSE instead of running benchmarks and measuring time (which is noisy, etc.).

I think choosing a particular platform and using it for all the data makes sense. Whether we ever change the set of numbers we use based on the particular target platform is TBD. We're nowhere close to being able to make that decision. Updating the existing numbers to be based on a particular modern NEON platform makes sense, but might not be worth the effort, either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X is when there are more than one per cycle. Looking at the file, INS_sve_fsqrt is marked as PERFSCORE_THROUGHPUT_14X which can't be right as it's always going to be a slow instruction.

The way X is currently used for throughput < 1. fsqrt, that sounds correct because it is 1/14 as minimum.

image

What are they used for? Is it just for manually comparing the relative performance of one version of a compiled routine against another? Deciding whether a new coreclr optimisation improves or regresses the code?

That's right. We do not even have them on by default in spmi-diffs pipeline, the default we use is codeSize. But for register allocation changes, PerfScore comes handy because those changes can increase the code size but eliminate memory access and that increases the code size, but lowers the perfscore.

If so, then my concern is that the relative performance of difference instructions can change quite a bit over time depending on microarchitecture design, and especially between different types of CPUs

Yes, that is a true statement and we do not keep it upto date.

Right now, comparing a NEON loop against an SVE loop using the existing perfscore data would be a bad idea, as I suspect NEON comes from a much older chip.

I don't think we would do that. We will just compare the disassembly :)

So, basing decisions on old architectures won't match the actual performance seen on real hardware

pretty much to what @BruceForstall mentioned about our experimentation in this area.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, comparing a NEON loop against an SVE loop using the existing perfscore data would be a bad idea, as I suspect NEON comes from a much older chip.

Would it make sense to just update all the data to be from a single chip then?

@BruceForstall BruceForstall added the arm-sve Work related to arm64 SVE/SVE2 support label Dec 7, 2023
@ghost ghost closed this Jan 7, 2024
@ghost
Copy link

ghost commented Jan 7, 2024

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 7, 2024
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI arm-sve Work related to arm64 SVE/SVE2 support community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants