Fixing Coordinated omission + replacing histogram with HdrHistogram #214

nitsanw · 2015-03-04T13:39:06Z

Hi,
I'm submitting this PR as a halfway point to an agreed upon PR. There are a few changes, some are cosmetic (which I should have perhaps avoided in the name of having a cleaner PR for you to look at), but the 2 items listed in the title are of importance:

Fixing coordinated omission(as described and discussed by Gil Tene[1][2][3]). This required the introduction of an intended start time per operation to be used in computing the latency. So if operation is delayed from executing due to previous operation not completing on time the latency is inclusive of that delay. This divorces the load generator intended execution schedule from the server push back behaviour.
Replacing OneMeasurementHistogram with OneMeasurementHdrHistogram which uses HdrHistogram[4] to capture latency thus enjoying a better suited data structure and allowing loss less compressed logging of interval histograms[5].
Your feedback is most welcome.
Thanks,
Nitsan
PS: I would like to cover this PR in a blog post in the near future and would value your comments on the draft as well
[1] https://groups.google.com/forum/#!msg/mechanical-sympathy/icNZJejUHfE/BfDekfBEs_sJ
[2] http://www.infoq.com/presentations/latency-pitfalls
[3] https://www.youtube.com/watch?v=9MKY4KypBzg
[4] https://github.com/HdrHistogram/HdrHistogram
[5] http://psy-lob-saw.blogspot.com/2015/02/hdrhistogram-better-latency-capture.html

throttle method in client.

Add option to randomize delay(default=true, as before)

Remove needless synchronization by using CHM in Measurements.

Add option for controlling the feature(default on).

Add a joined histogram measuremement to compare old/new histogram outputs.

- Output old histogram or HdrHistogram or both - Potentially output HdrHistogram to log file on status interval.

- Make status interval configurable

cmatser · 2015-03-04T19:57:49Z

This is a great addition to YCSB. The usefulness of the current Histogram implementation is dubious. When I was evaluating performance values, I would only rarely look at the histogram results. This would require some adjustments and re-runs in order to capture the correct "window" of data. This PR sounds like it would solve that problem and help provide more insight into the performance results. Thanks!

nitsanw · 2015-03-05T07:19:57Z

Glad you approve, not sure what you mean by "window" of data? is that the bucket sizing for original histogram?
The extra benefit of this patch when serializing the HdrHistograms to file is that you can get correct histogram data over parts of the run by using the HistogramLogProcessor and setting a time period of interest (or crafting your own tooling). This is helpful when trying to gain insight to application latency behaviour over time.

cmatser · 2015-03-05T16:44:40Z

The bucket sizing, exactly. We would often find that the histogram data would not be properly represented because we had a bulk of data ending up in the last "overflow" bucket. This would change based on our test parameters, specifically when we changed threadcount. We would then try re-runs with the number of buckets large enough to fit the current run and then wondered if we were adding undue complexity to the client, thus skewing the results. Needless to say, we just did this a few times, enough to confirm expected results, and then gave up on it.

nitsanw · 2015-03-05T19:23:26Z

HdrHistogram completely solves that problem, as well as a few others :)

 On Thursday, March 5, 2015 6:44 PM, Chrisjan Matser <notifications@github.com> wrote:

The bucket sizing, exactly. We would often find that the histogram data would not be properly represented because we had a bulk of data ending up in the last "overflow" bucket. This would change based on our test parameters, specifically when we changed threadcount. We would then try re-runs with the number of buckets large enough to fit the current run and then wondered if we were adding undue complexity to the client, thus skewing the results. Needless to say, we just did this a few times, enough to confirm expected results, and then gave up on it.—
Reply to this email directly or view it on GitHub.

busbey · 2015-05-17T08:44:30Z

Some feedback:

The details in CHANGES.md should be in a wiki page.
Please break the jar-with-dependencies change into a different PR.

Am I reading this correctly that this defaults to the same latency measuring behavior as before?

nitsanw · 2015-05-17T16:26:46Z

It's been so long I can't remember why I added the dependency packaging... I'll look into it and you can exclude it from the merge if you don't think it's important. If you think it's a good idea maybe open a separate issue for it and make this PR fix it. It does fix more than just one isolated thing and I realize not everyone likes to work that way.
I'm not sure which wiki you refer to and how can I edit it as part of a branch...

nitsanw · 2015-05-17T16:37:38Z

OK, having looked at the changes I think my change was the simplest way to deliver the dependency to the release package. The assembly configuration picks up the jars from the target folders of all modules and does not bring in modules dependencies. Removing the all in one change will break the HdrHistogram dependency for the end package.

nitsanw · 2015-05-17T16:42:37Z

Please note that given the slow response to PRs on this repository I have joined efforts with DB vendors and am trying to encourage collaboration on a fork:
https://github.com/YCSB/YCSB
The fork includes many contributions from @jbellis and has the CO fix merged. I'm not sure what would be the best course of action going forward in maintaining YCSB.

busbey · 2015-05-23T15:20:28Z

The problem with grabbing top level dependencies is that not all of our modules use the same versions of things.

Instead, we should update the distribution module to pull in the core dependencies and add them to the classpath.

nitsanw · 2015-05-25T04:26:01Z

Are you sure this is something that needs to be done as part of this PR? I'm not sure this is the place to enforce new policy for new problems. Alternatively you can manually merge and fix the issue as you see fit? I'm saying this not to spite, but as a maintainer of an open-source library.

nitsanw · 2015-06-01T07:26:53Z

Is there any particular action pending on this merge? I see the dependency issue as a continuation of #250 but if that is not the case I'm not sure what you need from me to resolve this PR

busbey · 2015-06-01T13:54:20Z

Here are the outstanding review feedback pieces, the first two are blockers:

I need someone to answer this question from above, so that we can decide if something different needs to happen or if we need a release note:

Am I reading this correctly that this defaults to the same latency measuring behavior as before?
CHANGES.md needs to be removed from the code base and made into a part of our docs. Since wiki pages are just markdown (our wiki is over here), this should be straight forward. Placing the content in a gist and linking here will allow me to move it into the docs (or waiting for make community-maintainable documentation #258 would mean it could be a PR). It might need some edits (like referencing that it's a branch, 'further suggestions' should be filed as issues, etc) but I think those can largely be follow-on.
It would be nice if the new core dependency was placed in the assembly (as the jackson dep in core must be already) instead of building a jar-with-dependencies. That the jackson dependencies work fine but the HdrHistogram one doesn't causes me some concern. So long as Improve deployment of optional dependencies #250 (refactor deployment) is a blocker for Release version 0.2.0 #266 (release 0.2.0) I'd be fine punting to that issue.
It would be nice to drop the commit that disables mapkeeper (since it's already disabled). But that's a nit.

I've been trying to find someone to pick things up, since your previous comment indicated that you didn't have time to take care of the needed changes. If you do now, that'd be great. Eventually I'll end up taking care of things myself, but my availability to handle PRs that need work is limited.

nitsanw · 2015-06-01T14:19:16Z

Default is latency measurement as before. I've changed this on https://github.com/YCSB/YCSB and would consider it a good idea to do so here to before release. The PR as it stands highlights the issue and allows before/after comparison. If we are agreed that the non-coordinated measurement is the way to go things could be simplified.
I'll have a look and see if I can find the relevant place for it. Dumping it as is seems a bit harsh. I could add the notes to the PR, allowing you to split it as you see fit when you release (e.g. parts to the release notes, parts to the wiki etc.)
I'll have a look at how the jackson dependency is handled and follow same pattern.
Dropping mapkeeper was probably done to make the build pass.

busbey · 2015-06-03T15:48:38Z

That plan sounds great. I'm hesitant to make the latency change the default until we've had some time to have folks knock it around (so maybe we can plan for making it the default in 0.3.0 or 0.4.0). The rest sounds great.

Let me know how things are going. I'm budgeting some dedicated time this weekend and need to decide if I can spend it on #250.

nitsanw · 2015-06-08T09:28:26Z

Did not end up having time for it. Life got in the way...
Will hopefully spend some time on it this week.
When are you aiming to release 0.2.0?

busbey · 2015-06-08T14:07:39Z

The 0.2.0 RC will branch on June 15th. (The next release will branch on July 15th).

.3. I'll have a look at how the jackson dependency is handled and follow same pattern.

You can ignore this one and forget about the dependency thing. I got far enough on #250 to see that it's probably only luck that made the jackson dependency work before.

nitsanw · 2015-06-16T22:55:18Z

Adding the changes notes here:
--- Please Copy/Paste/Break up as you think is appropriate ---
When used as a latency under load benchmark YCSB in it's original form suffers from
Coordinated Omission[1] and related measurement issue:

Load is controlled by response time
Measurement does not account for missing time
Measurement starts at beginning of request rather than at intended beginning
Measurement is limited in scope as the histogram does not provide data on overflow values

To provide a minimal correction patch the following were implemented:

Replace internal histogram implementation with HdrHistogram[2]:
HdrHistogram offers a dynamic range of measurement at a given precision and will
improve the fidelity of reporting. It allows capturing a much wider range of latencies.
HdrHistogram also supports compressed loss-less serialization which enable capturing
snapshot histograms from which lower resolution histograms can be constructed for plotting
latency over time. Snapshot interval histograms are serialized on status reporting which
must be enabled using the '-s' option.
Track intended operation start and report latencies from that point in time:
Assuming the benchmark sets a target schedule of execution in which every operation
is supposed to happen at a given time the benchmark should measure the latency between
intended start time and operation completion.
This required the introduction of a new measurement point and inevitably
includes measuring some of the internal preparation steps of the load generator.
These overhead should be negligible in the context of a network hop, but could
be corrected for by estimating the load-generator overheads (e.g. by measuring a
no-op DB or by measuring the setup time for an operation and deducting that from total).
This intended measurement point is only used when there is a target load (specified by
the -target paramaeter)

This branch supports the following new options:

-p measurementtype=histogram|hdrhistogram|hdrhistogram+histogram|timeseries
The new measurement types are hdrhistogram and hdrhistogram+histogram. Default is still
histogram, which is the old histogram. Ultimately we would remove the old measurement types
and use only HdrHistogram but the old measurement is left in there for comparison sake.
-p measurement.interval=op|intended|both
This new option deferentiates between measured intervals and adds the intended interval(as described)
above, and the option to record both the op and intended for comparison.
-p hdrhistogram.fileoutput=true|false
This new option will enable periodical writes of the interval histogram into an output file. The path can be set using '-p hdrhistogram.output.path='.

Example parameters:
-target 1000 -s -p workload=com.yahoo.ycsb.workloads.CoreWorkload -p basicdb.verbose=false -p basicdb.simulatedelay=4 -p measurement.interval=both -p measurementtype=hdrhistogram -p hdrhistogram.fileoutput=true -p maxexecutiontime=60

Further changes made:

-p status.interval= (default=10)
Controls the number of seconds between status reports and therefore between HdrHistogram snapshots reported.
-p basicdb.randomizedelay=true|false
Controls weather the delay simulated by the mock DB is uniformly random or not.

Further suggestions:

Correction load control: currently after a pause the load generator will do
operations back to back to catchup, this leads to a flat out throughput mode
of testing as opposed to controlled load.
Move to async model: Scenarios where Ops have no dependency could delegate the
Op execution to a threadpool and thus separate the request rate control from the
synchronous execution of Ops. Measurement would start on queuing for execution.
https://groups.google.com/forum/#!msg/mechanical-sympathy/icNZJejUHfE/BfDekfBEs_sJ
https://github.com/HdrHistogram/HdrHistogram

[core] use HdrHistogram and fix coordinated omission also closes #10

busbey · 2015-06-17T05:00:40Z

I merged this manually in 4d68068. not sure why it didn't show up.

Filed a follow on to clean up the CHANGES.md stuff.

Thanks for working on this, it's a great addition.

nitsanw · 2015-06-17T06:16:11Z

AWESOME!!!!
Thanks for picking this up, this is great news :-)

cmccoy · 2015-06-17T15:02:23Z

👍 - thanks!

[core] use HdrHistogram and fix coordinated omission also closes brianfrankcooper#10

nitsanw added 14 commits January 14, 2015 17:46

Fix record count property default value in CoreWorkload, also refactor

2d71f50

throttle method in client.

Improve delay code

7b0d528

Add option to randomize delay(default=true, as before)

Wait for status thread to properly exit in shutdown sequence

166b31e

Add HdrHistogram based measurement.

c824ee7

Remove needless synchronization by using CHM in Measurements.

Switch to managing time in nanoseconds in the ClientThread

b6f0c35

Add dependency on HdrHistogram

e2d099d

Read me on CO in the context of YCSB

5272577

Add intended deadline reporting and measurement.

5a034da

Add option for controlling the feature(default on).

Measure intended latency and old latency side by side.

e54c8f9

Add a joined histogram measuremement to compare old/new histogram outputs.

typo

cd99ef6

- Output intended vs. per-op latency or both

ad7aaa3

- Output old histogram or HdrHistogram or both - Potentially output HdrHistogram to log file on status interval.

- Correct status reporting sliding timing

a957043

- Make status interval configurable

Add notes for status.interval and basicdb.ranomizedelay options.

4d24a6f

Upgrade to HdrHistogram 2.1.4

efe8ba0

nitsanw added 5 commits March 5, 2015 09:50

Add a spin.sleep option

2367e2f

pack dependencies in jar

2ff36f9

disable mapper as it fails to build

9b67844

correct typo

d0f1683

typo

c336223

Add new mock DB simulator

92f6b3a

slfritchie mentioned this pull request May 7, 2015

Fix/avoid measurement errors due to coordinated omission [JIRA: TOOLS-114] basho/basho_bench#183

Open

busbey mentioned this pull request May 17, 2015

Better model of request rates? #10

Closed

busbey mentioned this pull request May 17, 2015

Synchronize Measurements.getSummary() method to prevent ConcurrentModificationExceptions #207

Merged

meteorfox mentioned this pull request May 20, 2015

Proposal: Use Corrected YCSB for Coordinated-Omission Problem GoogleCloudPlatform/PerfKitBenchmarker#169

Closed

busbey added a commit that referenced this pull request Jun 17, 2015

Merge pull request #214 from LatencyUtils:master

4d68068

[core] use HdrHistogram and fix coordinated omission also closes #10

busbey mentioned this pull request Jun 17, 2015

Cleanup docs around new histogram / coordinated omission changes #299

Open

busbey closed this Jun 17, 2015

busbey mentioned this pull request Jun 18, 2015

Release version 0.2.0 #266

Closed

jaricftw pushed a commit to jaricftw/YCSB that referenced this pull request Jul 19, 2016

Merge pull request brianfrankcooper#214 from LatencyUtils:master

f40a230

[core] use HdrHistogram and fix coordinated omission also closes brianfrankcooper#10

jaricftw pushed a commit to jaricftw/YCSB that referenced this pull request Jul 19, 2016

Merge pull request brianfrankcooper#214 from LatencyUtils:master

faa152f

[core] use HdrHistogram and fix coordinated omission also closes brianfrankcooper#10

vlsi mentioned this pull request Feb 3, 2018

Mixed read/update proportions produce swinging latencies #1066

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing Coordinated omission + replacing histogram with HdrHistogram #214

Fixing Coordinated omission + replacing histogram with HdrHistogram #214

nitsanw commented Mar 4, 2015

cmatser commented Mar 4, 2015

nitsanw commented Mar 5, 2015

cmatser commented Mar 5, 2015

nitsanw commented Mar 5, 2015

busbey commented May 17, 2015

nitsanw commented May 17, 2015

nitsanw commented May 17, 2015

nitsanw commented May 17, 2015

busbey commented May 23, 2015

nitsanw commented May 25, 2015

nitsanw commented Jun 1, 2015

busbey commented Jun 1, 2015

nitsanw commented Jun 1, 2015

busbey commented Jun 3, 2015

nitsanw commented Jun 8, 2015

busbey commented Jun 8, 2015

nitsanw commented Jun 16, 2015

busbey commented Jun 17, 2015

nitsanw commented Jun 17, 2015

cmccoy commented Jun 17, 2015

Fixing Coordinated omission + replacing histogram with HdrHistogram #214

Fixing Coordinated omission + replacing histogram with HdrHistogram #214

Conversation

nitsanw commented Mar 4, 2015

cmatser commented Mar 4, 2015

nitsanw commented Mar 5, 2015

cmatser commented Mar 5, 2015

nitsanw commented Mar 5, 2015

busbey commented May 17, 2015

nitsanw commented May 17, 2015

nitsanw commented May 17, 2015

nitsanw commented May 17, 2015

busbey commented May 23, 2015

nitsanw commented May 25, 2015

nitsanw commented Jun 1, 2015

busbey commented Jun 1, 2015

nitsanw commented Jun 1, 2015

busbey commented Jun 3, 2015

nitsanw commented Jun 8, 2015

busbey commented Jun 8, 2015

nitsanw commented Jun 16, 2015

busbey commented Jun 17, 2015

nitsanw commented Jun 17, 2015

cmccoy commented Jun 17, 2015