Performance triage by bisecting the range #2916

kotlarmilos · 2023-02-24T13:08:36Z

The idea is to improve performance triage for complex cases (i.e. dotnet/perf-autofiling-issues#11539 and dotnet/perf-autofiling-issues#11536) by bisecting the range and identifying a commit that caused the regression. Such tool could input commit range and microbenchmark grouping, and could output a commit that caused the regression with logs of a searched path.

The tool could be made available for local environment where it is limited to a particular os and architecture, which I consider as a low effort. Also, it could be added as an extension to the dashboard where for a particular set of microbenchmarks it can be executed, which I consider as a mid to high effort with broader scope and easy to use approach, but with security concerns.

I would like to obtain feedback before making it available for local environments. If such tool is considered as useful in the dashboard, I am ready to help.

sblom · 2023-02-27T19:48:03Z

Would love to see something like this! Are you imagining enhancing benchmarks_ci.py to do this?

Would be very interesting to try to factor this as a git bisect predicate so that git can do the bisection work, but we can help it narrow in, although I worry that without perfect test stability we might confuse git bisect.

kotlarmilos · 2023-02-28T12:16:09Z

It is a good idea to enhance benchmarks_ci.py. I agree that it would be interesting to factor it as a git bisect. For regressions with an order of magnitude difference even without perfect test stability we might get correct results.

I can propose changes to benchmarks_ci.py in a draft PR.

matouskozak · 2023-10-06T13:43:16Z

I would like to revive the discussion about the triage script.

I think that with recent advancements by @LoopedBard3 on https://github.com/dotnet/performance/blob/main/scripts/benchmarks_local.py. We could utilize this script to binary search thru commits to identify the source of regression (or improvement).

My idea for the script would be:
Input: baseline hash, compare hash, filter for microbenchmarks
Output: hash of commit causing regression

Algorithm steps:

Measure the magnitude of regression between baseline and compare hash. We could use the ResultsComparer tool https://github.com/dotnet/performance/tree/main/src/tools/ResultsComparer for measuring. The tool can save results to csv which could be processed by the script later.
Using binary search algorithm, try to find commit causing the same magnitude of regression as computed in 1.
- It is possible that the results between commits are not monotonous which could break the binary search algorithm. Small variance could be detected and ignored, if there would be bigger spikes than the script should either abort or "split" itself to work on subintervals.

Further ideas:

Reverting the found commit to verify the cause of regression.
Checking out new repository and building from zero for each tested commit might be too time consuming. Having 3 repos (left, mid, right) and just using git checkout <commit> without cleaning artifacts could make the search faster, however it could cause some unwanted behavior. I would make this an optional feature for user to decide.

What do you think? @sblom @kotlarmilos @LoopedBard3

kotlarmilos · 2023-10-09T10:38:28Z

Measure the magnitude of regression between baseline and compare hash. We could use the ResultsComparer tool https://github.com/dotnet/performance/tree/main/src/tools/ResultsComparer for measuring. The tool can save results to csv which could be processed by the script later.

Good idea. There is already --diff command that allows for easy comparison.

Using binary search algorithm, try to find commit causing the same magnitude of regression as computed in 1.

I would like to explore the possibility of implementing it using git bisect (https://git-scm.com/docs/git-bisect). Additionally, I think we can extend the script for local testing to support this feature.

@matouskozak @LoopedBard3 I suggest we schedule a call to discuss it further.

matouskozak · 2023-10-16T16:37:39Z

I selected a few benchmarks from Mono perf issues that regressed in the past and have stable results to be suitable for testing the bisecting algorithm.

The first set comes from dotnet/perf-autofiling-issues#21818 (linux-x64, AOT):

System.Memory.ReadOnlySpan

baseline: 9d08b24d743d0c57203a55c3d0c6dd1bc472e57e
compare: f6c592995a4f8526508da33761230a3850d942ff
expected result: 4a09c82215399c27f52277a8db7178270410c693

System.Globalization.Tests.StringSearch

baseline: da4e544809b3b10b7db8dc170331988d817240d7
compare: f6c592995a4f8526508da33761230a3850d942ff
expected result: 4a09c82215399c27f52277a8db7178270410c693

The second set comes from dotnet/perf-autofiling-issues#15795 (linux-arm64, AOT):

System.Tests.Perf_DateTimeOffset, System.Tests.Perf_DateTime, System.Globalization.Tests.Perf_DateTimeCultureInfo

baseline: 0fc78e62bb0b8d824efe7421983702b97a60def6
compare: 86b48d7c6f081c12dcc9c048fb53de1b78c9966f
expected result: ef1ba771347dc1d7d626907e4731b8f2e3cf78b3

The third set comes from dotnet/perf-autofiling-issues#14570 (linux-arm64, AOT):

System.Runtime.Intrinsics.Tests.Perf_Vector128Of, System.Runtime.Intrinsics.Tests.Perf_Vector128Of, System.Runtime.Intrinsics.Tests.Perf_Vector128Float

baseline: dce343987e81ce6cf045d983ce62d8e117d2c142
compare: c6cc201665cb3c4d61851392bd8f8d8093688500
expected result: 9bf28fe5a0bc2344bac261e07b27e6c55033318f

The last set comes from dotnet/perf-autofiling-issues#22688 (linux-x64, JIT):

This set represents improvements (in contrast with previous three that represented regression)

System.Text.Json.Serialization.Tests.WriteJson

baseline: 736dabeca728ccf8b911d96d1b4c575b4d0db7d2
compare: fc5c29692fc1a92426b7d1ce8c501e7696062bb6
expected result: 169e22c8f9f00719d87f0674954fee688b556b4a

kotlarmilos · 2023-10-19T10:23:46Z

Thanks @matouskozak! Let's create mock data based on your input and prototype the bisecting process using a moving average filter-like approach. @LoopedBard3 since you worked on the script for local benchmarking, feel free to reflect on missing features (comparer, bisect, etc).

kotlarmilos added the enhancement New feature or request label Feb 24, 2023

kotlarmilos self-assigned this Feb 24, 2023

kotlarmilos mentioned this issue Oct 9, 2023

[MVP] Script for local regression commit discovery #1985

Open

kotlarmilos mentioned this issue Mar 27, 2024

[Proposal] Performance triage improvements #4095

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance triage by bisecting the range #2916

Performance triage by bisecting the range #2916

kotlarmilos commented Feb 24, 2023

sblom commented Feb 27, 2023

kotlarmilos commented Feb 28, 2023

matouskozak commented Oct 6, 2023

kotlarmilos commented Oct 9, 2023

matouskozak commented Oct 16, 2023

kotlarmilos commented Oct 19, 2023

Performance triage by bisecting the range #2916

Performance triage by bisecting the range #2916

Comments

kotlarmilos commented Feb 24, 2023

sblom commented Feb 27, 2023

kotlarmilos commented Feb 28, 2023

matouskozak commented Oct 6, 2023

kotlarmilos commented Oct 9, 2023

matouskozak commented Oct 16, 2023

The first set comes from dotnet/perf-autofiling-issues#21818 (linux-x64, AOT):

System.Memory.ReadOnlySpan

System.Globalization.Tests.StringSearch

The second set comes from dotnet/perf-autofiling-issues#15795 (linux-arm64, AOT):

System.Tests.Perf_DateTimeOffset, System.Tests.Perf_DateTime, System.Globalization.Tests.Perf_DateTimeCultureInfo

The third set comes from dotnet/perf-autofiling-issues#14570 (linux-arm64, AOT):

System.Runtime.Intrinsics.Tests.Perf_Vector128Of, System.Runtime.Intrinsics.Tests.Perf_Vector128Of, System.Runtime.Intrinsics.Tests.Perf_Vector128Float

The last set comes from dotnet/perf-autofiling-issues#22688 (linux-x64, JIT):

System.Text.Json.Serialization.Tests.WriteJson

kotlarmilos commented Oct 19, 2023