Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New benchmarking system #5340

Merged
merged 1 commit into from
Sep 27, 2017
Merged

New benchmarking system #5340

merged 1 commit into from
Sep 27, 2017

Conversation

jfirebaugh
Copy link
Contributor

@jfirebaugh jfirebaugh commented Sep 21, 2017

I wrote a long prose description of the changes, and then my editor threw it away. 😭 Anyway, this is what it looks like:

image

Fixes #3237
Fixes #3238

TODO

  • Fix lint and type errors
  • Fix "WARNING: Too many active WebGL contexts. Oldest context will be lost."
  • Visualize summary statistics on density plot.
  • Fix Remove GeoJSON setData benchmarks -- stats say they're currently completely bogus.
  • Implement a more robust way of determining when we have an adequate number of samples. (Hopefully many benchmarks can be sampled for less than 5 seconds.)
  • Decide if we want to do sub-benchmark breakdowns / benchmark parameterization (e.g. for each zoom level that Layout benchmark tests), and if so, how.
  • Figure out a statistically sound way of generating the groups for regression (bootstrapping?)
  • Maybe we should force GC between each "version"? (Not possible.) Or add back other voodoo that was included before, like a 250ms setTimeout between benchmarks.
  • Figure out how to get results into a PR, like the "Copy to Clipboard" button in the previous version.

@anandthakker
Copy link
Contributor

anandthakker commented Sep 21, 2017

😍 amazing, @jfirebaugh!

Implement a more robust way of determining when we have an adequate number of samples. (Hopefully many benchmarks can be sampled for less than 5 seconds.)

👍 Rather than running for a minimum amount of time, I think a reasonable approach would be to have a Benchmark#minSamples(), set to a value that gives us enough samples for meaningful regression. (Using our current regression approach, if we want N points for regression, we need N*(N+1)/2 samples.). This will also be helped by 👇

Decide if we want to do sub-benchmark breakdowns / benchmark parameterization (e.g. for each zoom level that Layout benchmark tests), and if so, how

One way or another, I definitely think we should be breaking down some of the larger benchmarks (like Layout) into smaller ones. The problems with such benchmarks are (a) that each iteration is long, meaning that they have to run for much longer to get statistically meaningful results, and (b) if they run on multiple fixtures (e.g. doing layout on several tiles, or evaluating several filters), that's a somewhat arbitrary assessment -- better would be to select a single fixture; or if no single fixture is adequately representative, to select a few, and run each as its own benchmark / sub-benchmark.

Maybe we should force GC between each "version"?

If we can, that's probably a good idea -- maybe even between each individual benchmark? Somewhat related to this, we might want to consider sampling the minimum run time of a few iterations, inspired by the ideas presented in https://arxiv.org/pdf/1608.04295.pdf. It's counterintuitive -- my instinct is pay attention to the worst-case data, not drop it! -- but, done right, this could make a lot of sense

@jfirebaugh
Copy link
Contributor Author

Implement a more robust way of determining when we have an adequate number of samples.

Spending more time reading the criterion code, it looks like the strategy is roughly:

const THRESHOLD = 30ms
runBenchmark(timeLimit) {
    run once
    run GC

    let count = 0
    let totalTime = 0
    let timeOverThreshold = 0

    for (iterations of (distinct integers in the series [floor(1.05^1), floor(1.05^2), floor(1.05^3), …])) {
        const t = measure running it iterations times
        count += 1
        totalTime += t
        overThreshold += t - THRESHOLD
        if (totalTime >= timeLimit && timeOverThreshold >= 10 * THRESHOLD && count >= 5)
            break;
        }
    }
}

Note that this is in fact collecting a multivariate sample; in particular the Measurement type includes both elapsed time and number of iterations.

@mourner
Copy link
Member

mourner commented Sep 21, 2017

Fix GeoJSON setData benchmarks -- stats say they're currently completely bogus.

Maybe let's simply remove all the bogus benchmarks for now?

@anandthakker
Copy link
Contributor

anandthakker commented Sep 21, 2017

@jfirebaugh Criterion's focus on timeOverThreshold makes me think that they're primarily concerned with making sure that the cost of measurement is small relative to the values being measured. (Corroborated by this comment.)

In our case, given that (a) performance.now()'s precision is way smaller than any of our benchmark runtimes are likely to be, and (b) much of our overhead likely comes from the async/promise handling and is thus per-iteration, running N iterations at a time doesn't really help us. That's the reason I think we're better of focusing on sample count rather than elapsed time.

@jfirebaugh
Copy link
Contributor Author

I flipped the axes on the density plot and added a visualization of individual samples, mean, median, the range between first and third quartiles, and extrema.

image

Worth noting: the results in the screenshots posted here often suggest that master is faster than benchmarks, or vice versa. However, despite the labels, they're all results of benchmarking the benchmarks branch against itself! Sadly, it seems we'll rarely be able to simply read off numbers and be certain that one thing is faster than another.

@anandthakker
Copy link
Contributor

Sadly, it seems we'll rarely be able to simply read off numbers and be certain that one thing is faster than another.

@jfirebaugh in most cases, the differences we're seeing between mean (or min) execution times are a very small fraction of the standard deviation of the distribution, which is what we'd expect.

I think we can come up with a metric for the delta--basically some measure of effect size, like Cohen's d--that takes this into account and would allow us to "read off numbers" more meaningfully.

@jfirebaugh
Copy link
Contributor Author

Oh, that would be great!

@jfirebaugh jfirebaugh force-pushed the benchmarks branch 2 times, most recently from 7714ff5 to 1868098 Compare September 22, 2017 22:39
@anandthakker
Copy link
Contributor

Non-zero t-intercept allows the plot more breathing room

@jfirebaugh this allows more breathing room, but makes it harder to interpret the (relative) magnitude of a shift.

@jfirebaugh
Copy link
Contributor Author

Yeah, it's a tradeoff. I think being able to show more detail in the samples and summary statistics is more valuable though, especially if we can provide an effect size summary.

Copy link
Contributor

@anandthakker anandthakker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Aside from some smaller comments inline, couple changes I think we should make:

  1. Either add sub-benchmarking, for the reasons given in New benchmarking system #5340 (comment), or if not, then update Paint, Layout, MapLoad, and Query{Box,Point} to use only a single tile or zoom level.
  2. Deal with high outliers. They seem to be occurring in most of the benchmarks, and I think they're almost certainly all due to external factors like GC. With these large outliers, I worry that we're not going to be able to rely on our mean / stddev / regression results, which are all pretty sensitive to outliers. One option is to automatically drop them (just for the mean/stddev/regression calculations, but not in the plots), using some criterion like Tukey's fences with k equal to 2 or 3. Another option would be to add, e.g., a 10th percentile statistic to the table and treat that as the primary metric (with the other stats serving more of a supporting role).


function register(Benchmark) {
window.mapboxglBenchmarks[Benchmark.name] = window.mapboxglBenchmarks[Benchmark.name] || {};
window.mapboxglBenchmarks[Benchmark.name].master = new Benchmark();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove before merging

@@ -99,7 +100,8 @@
"bubleError": true,
"transforms": {
"dangerousForOf": true
}
},
"objectAssign": "Object.assign"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't supported in IE - could we just use our own util.extend()?

Copy link
Contributor Author

@jfirebaugh jfirebaugh Sep 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because it needs to be the name of something that's globally accessible without a require. What we want is to make object spread available only in benchmarks/JSX, but I couldn't figure out a good way to do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha. We can enforce with eslint: 6afaa79

{this.renderStatistic('Minimum',
(version) => `${formatSample(d3.min(version.samples))} ms`)}
{this.renderStatistic('Variance',
(version) => `${formatSample(d3.variance(version.samples))} ms`)}
Copy link
Contributor

@anandthakker anandthakker Sep 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we really need variance here. It's redundant with -- and harder to interpret than -- standard deviation.


_done() {
// 210 samples => 20 observations for regression
return this._elapsed >= 300 && this._samples.length > 210;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not totally happy with this: for benchmarks with a short running time, we can both afford to and would benefit from getting more samples. Maybe let's bump the min runtime up to 500ms or so?

this._start = performance.now();
this.bench();
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed that this locks up the UI, with the currently-running benchmark below the fold so that it's not immediately obvious why the page is "frozen". Not a huge deal, but it might be a slightly better experience if each benchmark's results were collapsed by default, except for the currently-running one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep them expanded for now, so that if you use an extension to snapshot the page you get everything by default. I did add a "Running" / "Finished" indicator to the top of the page so you can see without scrolling if it's done or not. I anticipate the most common usage will be either:

  • Focused on one benchmark (using URL hash)
  • Running all benchmarks (while getting coffee/lunch)

@anandthakker
Copy link
Contributor

anandthakker commented Sep 27, 2017

After more thought about outliers and research about effect size, here's what I'd propose:

  • Replace our mean and standard deviation statistics with a trimmed mean and Windsorized standard deviation. This will make them more robust to outliers, so that we can report the difference between the (trimmed) means as one metric of effect size. (We certainly don't want to ignore outliers. However, throwing them into the mean doesn't really help us to assess them. Instead, we're better off handling them separately: e.g., by inspecting the plots for suspicious changes in the number or pattern of outliers.)
  • Also calculate and report the "probability of superiority" (or, in our case, "inferiority"): i.e., the probability that a randomly chosen sample from the non-master branch is slower than a randomly chosen sample from the master branch. This is a simple, intuitive-to-understand score that's insensitive to outliers and I think would support the difference-between-trimmed-means metric nicely.

Copy link
Contributor

@anandthakker anandthakker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (see comment above re: eslint change)

@anandthakker
Copy link
Contributor

@jfirebaugh oops 806b4d4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants