-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New benchmarking system #5340
New benchmarking system #5340
Conversation
😍 amazing, @jfirebaugh!
👍 Rather than running for a minimum amount of time, I think a reasonable approach would be to have a
One way or another, I definitely think we should be breaking down some of the larger benchmarks (like Layout) into smaller ones. The problems with such benchmarks are (a) that each iteration is long, meaning that they have to run for much longer to get statistically meaningful results, and (b) if they run on multiple fixtures (e.g. doing layout on several tiles, or evaluating several filters), that's a somewhat arbitrary assessment -- better would be to select a single fixture; or if no single fixture is adequately representative, to select a few, and run each as its own benchmark / sub-benchmark.
If we can, that's probably a good idea -- maybe even between each individual benchmark? Somewhat related to this, we might want to consider sampling the minimum run time of a few iterations, inspired by the ideas presented in https://arxiv.org/pdf/1608.04295.pdf. It's counterintuitive -- my instinct is pay attention to the worst-case data, not drop it! -- but, done right, this could make a lot of sense |
Spending more time reading the criterion code, it looks like the strategy is roughly:
Note that this is in fact collecting a multivariate sample; in particular the |
Maybe let's simply remove all the bogus benchmarks for now? |
@jfirebaugh Criterion's focus on In our case, given that (a) |
594daea
to
6555c37
Compare
I flipped the axes on the density plot and added a visualization of individual samples, mean, median, the range between first and third quartiles, and extrema. Worth noting: the results in the screenshots posted here often suggest that |
@jfirebaugh in most cases, the differences we're seeing between mean (or min) execution times are a very small fraction of the standard deviation of the distribution, which is what we'd expect. I think we can come up with a metric for the delta--basically some measure of effect size, like Cohen's d--that takes this into account and would allow us to "read off numbers" more meaningfully. |
Oh, that would be great! |
7714ff5
to
1868098
Compare
@jfirebaugh this allows more breathing room, but makes it harder to interpret the (relative) magnitude of a shift. |
f095363
to
30e5f73
Compare
Yeah, it's a tradeoff. I think being able to show more detail in the samples and summary statistics is more valuable though, especially if we can provide an effect size summary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! Aside from some smaller comments inline, couple changes I think we should make:
- Either add sub-benchmarking, for the reasons given in New benchmarking system #5340 (comment), or if not, then update
Paint
,Layout
,MapLoad
, andQuery{Box,Point}
to use only a single tile or zoom level. - Deal with high outliers. They seem to be occurring in most of the benchmarks, and I think they're almost certainly all due to external factors like GC. With these large outliers, I worry that we're not going to be able to rely on our mean / stddev / regression results, which are all pretty sensitive to outliers. One option is to automatically drop them (just for the mean/stddev/regression calculations, but not in the plots), using some criterion like Tukey's fences with
k
equal to 2 or 3. Another option would be to add, e.g., a10th percentile
statistic to the table and treat that as the primary metric (with the other stats serving more of a supporting role).
bench/benchmarks.js
Outdated
|
||
function register(Benchmark) { | ||
window.mapboxglBenchmarks[Benchmark.name] = window.mapboxglBenchmarks[Benchmark.name] || {}; | ||
window.mapboxglBenchmarks[Benchmark.name].master = new Benchmark(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove before merging
@@ -99,7 +100,8 @@ | |||
"bubleError": true, | |||
"transforms": { | |||
"dangerousForOf": true | |||
} | |||
}, | |||
"objectAssign": "Object.assign" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't supported in IE - could we just use our own util.extend()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because it needs to be the name of something that's globally accessible without a require
. What we want is to make object spread available only in benchmarks/JSX, but I couldn't figure out a good way to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, gotcha. We can enforce with eslint: 6afaa79
bench/benchmarks_view.js
Outdated
{this.renderStatistic('Minimum', | ||
(version) => `${formatSample(d3.min(version.samples))} ms`)} | ||
{this.renderStatistic('Variance', | ||
(version) => `${formatSample(d3.variance(version.samples))} ms`)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we really need variance here. It's redundant with -- and harder to interpret than -- standard deviation.
bench/lib/benchmark.js
Outdated
|
||
_done() { | ||
// 210 samples => 20 observations for regression | ||
return this._elapsed >= 300 && this._samples.length > 210; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not totally happy with this: for benchmarks with a short running time, we can both afford to and would benefit from getting more samples. Maybe let's bump the min runtime up to 500ms or so?
this._start = performance.now(); | ||
this.bench(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noticed that this locks up the UI, with the currently-running benchmark below the fold so that it's not immediately obvious why the page is "frozen". Not a huge deal, but it might be a slightly better experience if each benchmark's results were collapsed by default, except for the currently-running one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep them expanded for now, so that if you use an extension to snapshot the page you get everything by default. I did add a "Running" / "Finished" indicator to the top of the page so you can see without scrolling if it's done or not. I anticipate the most common usage will be either:
- Focused on one benchmark (using URL hash)
- Running all benchmarks (while getting coffee/lunch)
After more thought about outliers and research about effect size, here's what I'd propose:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (see comment above re: eslint change)
@jfirebaugh oops 806b4d4 |
7404bdf
to
b8e40a0
Compare
b8e40a0
to
ecaa76a
Compare
I wrote a long prose description of the changes, and then my editor threw it away. 😭 Anyway, this is what it looks like:
Fixes #3237
Fixes #3238
TODO
FixRemove GeoJSONsetData
benchmarks -- stats say they're currently completely bogus.Layout
benchmark tests), and if so, how.Maybe we should force GC between each "version"?(Not possible.) Or add back other voodoo that was included before, like a 250mssetTimeout
between benchmarks.