New benchmarking system #5340

jfirebaugh · 2017-09-21T00:20:39Z

I wrote a long prose description of the changes, and then my editor threw it away. 😭 Anyway, this is what it looks like:

Fixes #3237
Fixes #3238

TODO

Fix lint and type errors
Fix "WARNING: Too many active WebGL contexts. Oldest context will be lost."
Visualize summary statistics on density plot.
~~Fix~~ Remove GeoJSON setData benchmarks -- stats say they're currently completely bogus.
Implement a more robust way of determining when we have an adequate number of samples. (Hopefully many benchmarks can be sampled for less than 5 seconds.)
Decide if we want to do sub-benchmark breakdowns / benchmark parameterization (e.g. for each zoom level that Layout benchmark tests), and if so, how.
Figure out a statistically sound way of generating the groups for regression (bootstrapping?)
~~Maybe we should force GC between each "version"?~~ (Not possible.) Or add back other voodoo that was included before, like a 250ms setTimeout between benchmarks.
Figure out how to get results into a PR, like the "Copy to Clipboard" button in the previous version.

anandthakker · 2017-09-21T11:40:55Z

😍 amazing, @jfirebaugh!

Implement a more robust way of determining when we have an adequate number of samples. (Hopefully many benchmarks can be sampled for less than 5 seconds.)

👍 Rather than running for a minimum amount of time, I think a reasonable approach would be to have a Benchmark#minSamples(), set to a value that gives us enough samples for meaningful regression. (Using our current regression approach, if we want N points for regression, we need N*(N+1)/2 samples.). This will also be helped by 👇

Decide if we want to do sub-benchmark breakdowns / benchmark parameterization (e.g. for each zoom level that Layout benchmark tests), and if so, how

One way or another, I definitely think we should be breaking down some of the larger benchmarks (like Layout) into smaller ones. The problems with such benchmarks are (a) that each iteration is long, meaning that they have to run for much longer to get statistically meaningful results, and (b) if they run on multiple fixtures (e.g. doing layout on several tiles, or evaluating several filters), that's a somewhat arbitrary assessment -- better would be to select a single fixture; or if no single fixture is adequately representative, to select a few, and run each as its own benchmark / sub-benchmark.

Maybe we should force GC between each "version"?

If we can, that's probably a good idea -- maybe even between each individual benchmark? Somewhat related to this, we might want to consider sampling the minimum run time of a few iterations, inspired by the ideas presented in https://arxiv.org/pdf/1608.04295.pdf. It's counterintuitive -- my instinct is pay attention to the worst-case data, not drop it! -- but, done right, this could make a lot of sense

jfirebaugh · 2017-09-21T19:26:51Z

Implement a more robust way of determining when we have an adequate number of samples.

Spending more time reading the criterion code, it looks like the strategy is roughly:

const THRESHOLD = 30ms
runBenchmark(timeLimit) {
    run once
    run GC

    let count = 0
    let totalTime = 0
    let timeOverThreshold = 0

    for (iterations of (distinct integers in the series [floor(1.05^1), floor(1.05^2), floor(1.05^3), …])) {
        const t = measure running it iterations times
        count += 1
        totalTime += t
        overThreshold += t - THRESHOLD
        if (totalTime >= timeLimit && timeOverThreshold >= 10 * THRESHOLD && count >= 5)
            break;
        }
    }
}

Note that this is in fact collecting a multivariate sample; in particular the Measurement type includes both elapsed time and number of iterations.

mourner · 2017-09-21T19:51:00Z

Fix GeoJSON setData benchmarks -- stats say they're currently completely bogus.

Maybe let's simply remove all the bogus benchmarks for now?

anandthakker · 2017-09-21T20:40:33Z

@jfirebaugh Criterion's focus on timeOverThreshold makes me think that they're primarily concerned with making sure that the cost of measurement is small relative to the values being measured. (Corroborated by this comment.)

In our case, given that (a) performance.now()'s precision is way smaller than any of our benchmark runtimes are likely to be, and (b) much of our overhead likely comes from the async/promise handling and is thus per-iteration, running N iterations at a time doesn't really help us. That's the reason I think we're better of focusing on sample count rather than elapsed time.

jfirebaugh · 2017-09-22T00:59:11Z

I flipped the axes on the density plot and added a visualization of individual samples, mean, median, the range between first and third quartiles, and extrema.

Worth noting: the results in the screenshots posted here often suggest that master is faster than benchmarks, or vice versa. However, despite the labels, they're all results of benchmarking the benchmarks branch against itself! Sadly, it seems we'll rarely be able to simply read off numbers and be certain that one thing is faster than another.

anandthakker · 2017-09-22T13:26:55Z

Sadly, it seems we'll rarely be able to simply read off numbers and be certain that one thing is faster than another.

@jfirebaugh in most cases, the differences we're seeing between mean (or min) execution times are a very small fraction of the standard deviation of the distribution, which is what we'd expect.

I think we can come up with a metric for the delta--basically some measure of effect size, like Cohen's d--that takes this into account and would allow us to "read off numbers" more meaningfully.

jfirebaugh · 2017-09-22T15:01:32Z

Oh, that would be great!

anandthakker · 2017-09-25T16:56:00Z

Non-zero t-intercept allows the plot more breathing room

@jfirebaugh this allows more breathing room, but makes it harder to interpret the (relative) magnitude of a shift.

jfirebaugh · 2017-09-25T18:09:24Z

Yeah, it's a tradeoff. I think being able to show more detail in the samples and summary statistics is more valuable though, especially if we can provide an effect size summary.

anandthakker

This looks great! Aside from some smaller comments inline, couple changes I think we should make:

Either add sub-benchmarking, for the reasons given in New benchmarking system #5340 (comment), or if not, then update Paint, Layout, MapLoad, and Query{Box,Point} to use only a single tile or zoom level.
Deal with high outliers. They seem to be occurring in most of the benchmarks, and I think they're almost certainly all due to external factors like GC. With these large outliers, I worry that we're not going to be able to rely on our mean / stddev / regression results, which are all pretty sensitive to outliers. One option is to automatically drop them (just for the mean/stddev/regression calculations, but not in the plots), using some criterion like Tukey's fences with k equal to 2 or 3. Another option would be to add, e.g., a 10th percentile statistic to the table and treat that as the primary metric (with the other stats serving more of a supporting role).

anandthakker · 2017-09-26T16:13:17Z

bench/benchmarks.js

+
+function register(Benchmark) {
+    window.mapboxglBenchmarks[Benchmark.name] = window.mapboxglBenchmarks[Benchmark.name] || {};
+    window.mapboxglBenchmarks[Benchmark.name].master = new Benchmark();


Remove before merging

anandthakker · 2017-09-26T16:38:29Z

package.json

@@ -99,7 +100,8 @@
          "bubleError": true,
          "transforms": {
            "dangerousForOf": true
-          }
+          },
+          "objectAssign": "Object.assign"


This isn't supported in IE - could we just use our own util.extend()?

No, because it needs to be the name of something that's globally accessible without a require. What we want is to make object spread available only in benchmarks/JSX, but I couldn't figure out a good way to do so.

Ah, gotcha. We can enforce with eslint: 6afaa79

anandthakker · 2017-09-26T17:54:52Z

bench/benchmarks_view.js

+                        {this.renderStatistic('Minimum',
+                            (version) => `${formatSample(d3.min(version.samples))} ms`)}
+                        {this.renderStatistic('Variance',
+                            (version) => `${formatSample(d3.variance(version.samples))} ms`)}


I don't think we really need variance here. It's redundant with -- and harder to interpret than -- standard deviation.

anandthakker · 2017-09-26T18:57:31Z

bench/lib/benchmark.js

+
+    _done() {
+        // 210 samples => 20 observations for regression
+        return this._elapsed >= 300 && this._samples.length > 210;


I'm still not totally happy with this: for benchmarks with a short running time, we can both afford to and would benefit from getting more samples. Maybe let's bump the min runtime up to 500ms or so?

anandthakker · 2017-09-26T19:31:03Z

bench/lib/benchmark.js

+            this._start = performance.now();
+            this.bench();
+        }
+    }


Noticed that this locks up the UI, with the currently-running benchmark below the fold so that it's not immediately obvious why the page is "frozen". Not a huge deal, but it might be a slightly better experience if each benchmark's results were collapsed by default, except for the currently-running one.

Let's keep them expanded for now, so that if you use an extension to snapshot the page you get everything by default. I did add a "Running" / "Finished" indicator to the top of the page so you can see without scrolling if it's done or not. I anticipate the most common usage will be either:

Focused on one benchmark (using URL hash)

Running all benchmarks (while getting coffee/lunch)

anandthakker · 2017-09-27T02:51:45Z

After more thought about outliers and research about effect size, here's what I'd propose:

Replace our mean and standard deviation statistics with a trimmed mean and Windsorized standard deviation. This will make them more robust to outliers, so that we can report the difference between the (trimmed) means as one metric of effect size. (We certainly don't want to ignore outliers. However, throwing them into the mean doesn't really help us to assess them. Instead, we're better off handling them separately: e.g., by inspecting the plots for suspicious changes in the number or pattern of outliers.)
Also calculate and report the "probability of superiority" (or, in our case, "inferiority"): i.e., the probability that a randomly chosen sample from the non-master branch is slower than a randomly chosen sample from the master branch. This is a simple, intuitive-to-understand score that's insensitive to outliers and I think would support the difference-between-trimmed-means metric nicely.

anandthakker

LGTM (see comment above re: eslint change)

anandthakker · 2017-09-27T16:24:02Z

@jfirebaugh oops 806b4d4

jfirebaugh force-pushed the benchmarks branch from 594daea to 6555c37 Compare September 22, 2017 00:50

jfirebaugh force-pushed the benchmarks branch 2 times, most recently from 7714ff5 to 1868098 Compare September 22, 2017 22:39

jfirebaugh force-pushed the benchmarks branch from f095363 to 30e5f73 Compare September 25, 2017 18:04

jfirebaugh requested a review from anandthakker September 25, 2017 18:16

anandthakker reviewed Sep 26, 2017

View reviewed changes

anandthakker approved these changes Sep 27, 2017

View reviewed changes

jfirebaugh force-pushed the benchmarks branch from 7404bdf to b8e40a0 Compare September 27, 2017 16:46

New benchmarking system

ecaa76a

jfirebaugh force-pushed the benchmarks branch from b8e40a0 to ecaa76a Compare September 27, 2017 16:51

jfirebaugh merged commit 0c1a718 into master Sep 27, 2017

jfirebaugh deleted the benchmarks branch September 27, 2017 17:08

This was referenced Sep 27, 2017

Improve benchmark statistical analysis #5361

Closed

Sub-benchmarks / benchmark variants #5362

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New benchmarking system #5340

New benchmarking system #5340

jfirebaugh commented Sep 21, 2017 •

edited

Loading

anandthakker commented Sep 21, 2017 •

edited

Loading

jfirebaugh commented Sep 21, 2017

mourner commented Sep 21, 2017

anandthakker commented Sep 21, 2017 •

edited

Loading

jfirebaugh commented Sep 22, 2017

anandthakker commented Sep 22, 2017

jfirebaugh commented Sep 22, 2017

anandthakker commented Sep 25, 2017

jfirebaugh commented Sep 25, 2017

anandthakker left a comment

anandthakker Sep 26, 2017

anandthakker Sep 26, 2017

jfirebaugh Sep 27, 2017 •

edited

Loading

anandthakker Sep 27, 2017

anandthakker Sep 26, 2017 •

edited

Loading

anandthakker Sep 26, 2017

anandthakker Sep 26, 2017

jfirebaugh Sep 27, 2017

anandthakker commented Sep 27, 2017 •

edited

Loading

anandthakker left a comment

anandthakker commented Sep 27, 2017

New benchmarking system #5340

New benchmarking system #5340

Conversation

jfirebaugh commented Sep 21, 2017 • edited Loading

TODO

anandthakker commented Sep 21, 2017 • edited Loading

jfirebaugh commented Sep 21, 2017

mourner commented Sep 21, 2017

anandthakker commented Sep 21, 2017 • edited Loading

jfirebaugh commented Sep 22, 2017

anandthakker commented Sep 22, 2017

jfirebaugh commented Sep 22, 2017

anandthakker commented Sep 25, 2017

jfirebaugh commented Sep 25, 2017

anandthakker left a comment

Choose a reason for hiding this comment

anandthakker Sep 26, 2017

Choose a reason for hiding this comment

anandthakker Sep 26, 2017

Choose a reason for hiding this comment

jfirebaugh Sep 27, 2017 • edited Loading

Choose a reason for hiding this comment

anandthakker Sep 27, 2017

Choose a reason for hiding this comment

anandthakker Sep 26, 2017 • edited Loading

Choose a reason for hiding this comment

anandthakker Sep 26, 2017

Choose a reason for hiding this comment

anandthakker Sep 26, 2017

Choose a reason for hiding this comment

jfirebaugh Sep 27, 2017

Choose a reason for hiding this comment

anandthakker commented Sep 27, 2017 • edited Loading

anandthakker left a comment

Choose a reason for hiding this comment

anandthakker commented Sep 27, 2017

jfirebaugh commented Sep 21, 2017 •

edited

Loading

anandthakker commented Sep 21, 2017 •

edited

Loading

anandthakker commented Sep 21, 2017 •

edited

Loading

jfirebaugh Sep 27, 2017 •

edited

Loading

anandthakker Sep 26, 2017 •

edited

Loading

anandthakker commented Sep 27, 2017 •

edited

Loading