Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InternalEngineTests.testConcurrentOutOfOrderDocsOnReplica should use two documents #30121

Merged
merged 9 commits into from
May 3, 2018

Conversation

bleskes
Copy link
Contributor

@bleskes bleskes commented Apr 25, 2018

We were recently looking at bugs that can only occur if two different documents were indexed concurrently. For example, what happens if the local checkpoint advances above the sequence number of a document that's being indexed. That can only happen if another concurrent operation caused the checkpoint to advance. It has to be another document to allow concurrency as we acquire a per uid lock.While our investigation proved that the suspected bug doesn't exists, we still discovered our unit testing coverage is not good enough to cover this case.

This PR extend the test concurrent out of order replica processing to use two documents in its history.

@bleskes bleskes added >non-issue >test Issues or PRs that are addressing/adding tests v7.0.0 v6.4.0 labels Apr 25, 2018
@bleskes bleskes added the :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. label Apr 25, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@bleskes
Copy link
Contributor Author

bleskes commented May 1, 2018

@ywelsch @DaveCTurner ping

shuffle(ops, random());
concurrentlyApplyOps(ops, engine);
// randomly interleave
AtomicLong seqNoGenerator = new AtomicLong();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it needs to be an AtomicLong - it's only updated on this thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just habit. I'll change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has to be an AtomicLong because it's referenced in a lambda.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bah, daft scoping rules, you're right it can't just be a long.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I asked for less duplication, but it's a matter of taste.

lastFieldValue = null;
lastFieldValueDoc1 = null;
}
final List<Engine.Operation> opsDoc2 =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplication isn't to my taste - I think I'd try and pull the notion of "doc" out into a class of its own and have this kind of thing be methods there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what kind of duplication do you mean? the extraction of the last value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that there's a block of lines that does something to doc1 followed by an essentially identical block of lines that does the same thing to doc2 - both here and below in the blocks containing assertThat(collector.getTotalHits(), equalTo(1));. Also the parallel variables opsDoc{1,2}, lastOpDoc{1,2}, lastFieldValueDoc{1,2}. The nice thing about combining this stuff together is that it lets the reader see that there's no differences between the two treatments without needing to check the parallels line-by-line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the sentiment. I feel a class will be an overkill for just one test. It's now 21 lines of code and it's all in one place. I prefer to keep as is and refactor if we need it more often.

}
final String valuePrefix = forReplica ? "r_" : "p_";
final Term id = newUid(docId);
final int startWithSeqNo = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to check, this is only for master (i.e. >= 7.0.0), right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan for this to also go to 6.x, but all the callers of this method were using false for partialOldPrimary, so I removed it. I'll make sure to keep it in the backport.

shuffle(allOps, random());
concurrentlyApplyOps(allOps, engine);


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: extra blank line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

@bleskes
Copy link
Contributor Author

bleskes commented May 3, 2018

@DaveCTurner I responded to your feedback. Thanks. Can you take another look?

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bleskes bleskes merged commit ccd791b into elastic:master May 3, 2018
@bleskes bleskes deleted the engine_test_two_histories branch May 3, 2018 12:57
@bleskes
Copy link
Contributor Author

bleskes commented May 3, 2018

Thx @DaveCTurner

bleskes added a commit that referenced this pull request May 3, 2018
…two documents (#30121)

We were recently looking at bugs that can only occur if two different documents were indexed concurrently. For example, what happens if the local checkpoint advances above the sequence number of  a document that's being indexed. That can only happen if another concurrent operation caused the checkpoint to advance. It has to be another document to allow concurrency as we acquire a per uid lock.While our investigation proved that the suspected bug doesn't exists, we still discovered our unit testing coverage is not good enough to cover this case.

This PR extend the test concurrent out of order replica processing to use two documents in its history.
dnhatn added a commit that referenced this pull request May 4, 2018
* master:
  Set the new lucene version for 6.4.0
  [ML][TEST] Clean up jobs in ModelPlotIT
  Upgrade to 7.4.0-snapshot-1ed95c097b (#30357)
  Watcher: Ensure trigger service pauses execution (#30363)
  [DOCS] Added coming qualifiers in changelog
  [DOCS] Commented out empty sections in the changelog to fix the doc build. (#30372)
  Security: reduce garbage during index resolution (#30180)
  Make RepositoriesMetaData contents unmodifiable (#30361)
  Change quad tree max levels to 29. Closes #21191 (#29663)
  Test: use trial license in qa tests with security
  [ML] Add integration test for model plots (#30359)
  SQL: Fix bug caused by empty composites (#30343)
  [ML] Account for gaps in data counts after job is reopened (#30294)
  InternalEngineTests.testConcurrentOutOfOrderDocsOnReplica should use two documents (#30121)
  Change signature of Get Repositories Response (#30333)
  Tests: Use different watch ids per test in smoke test (#30331)
  [Docs] Add term query with normalizer example
  Adds Eclipse config for xpack licence headers (#30299)
  Watcher: Make start/stop cycle more predictable and synchronous (#30118)
  [test] add debug logging for packaging test
  [DOCS] Removed X-Pack Breaking Changes
  [DOCS] Fixes link to TLS LDAP info
  Update versions for start_trial after backport (#30218)
  Packaging: Set elasticsearch user to have non-existent homedir (#29007)
  [DOCS] Fixes broken links to bootstrap user (#30349)
  Fix NPE when CumulativeSum agg encounters null/empty bucket (#29641)
  Make licensing FIPS-140 compliant (#30251)
  [DOCS] Reorganizes authentication details in Stack Overview (#30280)
  Network: Remove http.enabled setting (#29601)
  Fix merging logic of Suggester Options (#29514)
  [DOCS] Adds LDAP realm configuration details (#30214)
  [DOCS] Adds native realm configuration details (#30215)
  ReplicationTracker.markAllocationIdAsInSync may hang if allocation is cancelled (#30316)
  [DOCS] Enables edit links for X-Pack pages (#30278)
  Packaging: Unmark systemd service file as a config file (#29004)
  SQL: Reduce number of ranges generated for comparisons (#30267)
  Tests: Simplify VersionUtils released version splitting (#30322)
  Cancelling a peer recovery on the source can leak a primary permit (#30318)
  Added changelog entry for deb prerelease version change (#30184)
  Convert server javadoc to html5 (#30279)
  Create default ES_TMPDIR on Windows (#30325)
  [Docs] Clarify `fuzzy_like_this` redirect (#30183)
  Post backport of #29658.
  Fix docs of the `_ignored` meta field.
  Remove MapperService#types(). (#29617)
  Remove useless version checks in REST tests. (#30165)
  Add a new `_ignored` meta field. (#29658)
  Move repository-azure fixture test to QA project (#30253)

# Conflicts:
#	buildSrc/version.properties
#	server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java
dnhatn added a commit that referenced this pull request May 4, 2018
jasontedor added a commit to martijnvg/elasticsearch that referenced this pull request May 6, 2018
* origin/ccr: (166 commits)
  Introduce soft-deletes retention policy based on global checkpoint (elastic#30335)
  Enable MockHttpTransport in ShardChangsIT
  Remove old sha files from dated Lucene snapshot
  Update InternalEngine tests on ccr side for elastic#30121
  Set the new lucene version for 6.4.0
  [ML][TEST] Clean up jobs in ModelPlotIT
  Upgrade to 7.4.0-snapshot-1ed95c097b (elastic#30357)
  Watcher: Ensure trigger service pauses execution (elastic#30363)
  [DOCS] Added coming qualifiers in changelog
  [DOCS] Commented out empty sections in the changelog to fix the doc build. (elastic#30372)
  Security: reduce garbage during index resolution (elastic#30180)
  Make RepositoriesMetaData contents unmodifiable (elastic#30361)
  Change quad tree max levels to 29. Closes elastic#21191 (elastic#29663)
  Test: use trial license in qa tests with security
  [ML] Add integration test for model plots (elastic#30359)
  SQL: Fix bug caused by empty composites (elastic#30343)
  [ML] Account for gaps in data counts after job is reopened (elastic#30294)
  InternalEngineTests.testConcurrentOutOfOrderDocsOnReplica should use two documents (elastic#30121)
  Change signature of Get Repositories Response (elastic#30333)
  Tests: Use different watch ids per test in smoke test (elastic#30331)
  ...
jasontedor added a commit to martijnvg/elasticsearch that referenced this pull request May 6, 2018
* origin/ccr: (127 commits)
  Introduce soft-deletes retention policy based on global checkpoint (elastic#30335)
  Enable MockHttpTransport in ShardChangsIT
  Remove old sha files from dated Lucene snapshot
  Update InternalEngine tests on ccr side for elastic#30121
  Set the new lucene version for 6.4.0
  [ML][TEST] Clean up jobs in ModelPlotIT
  Upgrade to 7.4.0-snapshot-1ed95c097b (elastic#30357)
  Watcher: Ensure trigger service pauses execution (elastic#30363)
  [DOCS] Added coming qualifiers in changelog
  [DOCS] Commented out empty sections in the changelog to fix the doc build. (elastic#30372)
  Security: reduce garbage during index resolution (elastic#30180)
  Make RepositoriesMetaData contents unmodifiable (elastic#30361)
  Change quad tree max levels to 29. Closes elastic#21191 (elastic#29663)
  Test: use trial license in qa tests with security
  [ML] Add integration test for model plots (elastic#30359)
  SQL: Fix bug caused by empty composites (elastic#30343)
  [ML] Account for gaps in data counts after job is reopened (elastic#30294)
  InternalEngineTests.testConcurrentOutOfOrderDocsOnReplica should use two documents (elastic#30121)
  Change signature of Get Repositories Response (elastic#30333)
  Tests: Use different watch ids per test in smoke test (elastic#30331)
  ...
jasontedor added a commit to martijnvg/elasticsearch that referenced this pull request May 6, 2018
* origin/ccr: (166 commits)
  Introduce soft-deletes retention policy based on global checkpoint (elastic#30335)
  Enable MockHttpTransport in ShardChangsIT
  Remove old sha files from dated Lucene snapshot
  Update InternalEngine tests on ccr side for elastic#30121
  Set the new lucene version for 6.4.0
  [ML][TEST] Clean up jobs in ModelPlotIT
  Upgrade to 7.4.0-snapshot-1ed95c097b (elastic#30357)
  Watcher: Ensure trigger service pauses execution (elastic#30363)
  [DOCS] Added coming qualifiers in changelog
  [DOCS] Commented out empty sections in the changelog to fix the doc build. (elastic#30372)
  Security: reduce garbage during index resolution (elastic#30180)
  Make RepositoriesMetaData contents unmodifiable (elastic#30361)
  Change quad tree max levels to 29. Closes elastic#21191 (elastic#29663)
  Test: use trial license in qa tests with security
  [ML] Add integration test for model plots (elastic#30359)
  SQL: Fix bug caused by empty composites (elastic#30343)
  [ML] Account for gaps in data counts after job is reopened (elastic#30294)
  InternalEngineTests.testConcurrentOutOfOrderDocsOnReplica should use two documents (elastic#30121)
  Change signature of Get Repositories Response (elastic#30333)
  Tests: Use different watch ids per test in smoke test (elastic#30331)
  ...
dnhatn added a commit that referenced this pull request May 8, 2018
* 6.x:
  Stop forking javac (#30462)
  Fix tribe tests
  Docs: Use task_id in examples of tasks (#30436)
  Security: Rename IndexLifecycleManager to SecurityIndexManager (#30442)
  Packaging: Set elasticsearch user to have non-existent homedir (#29007)
  [Docs] Fix typo in cardinality-aggregation.asciidoc (#30434)
  Avoid NPE in `more_like_this` when field has zero tokens (#30365)
  Build: Switch to building javadoc with html5 (#30440)
  Add a quick tour of the project to CONTRIBUTING (#30187)
  Add stricter geohash parsing (#30376)
  Reindex: Use request flavored methods (#30317)
  Silence SplitIndexIT.testSplitIndexPrimaryTerm test failure.  (#30432)
  Auto-expand replicas when adding or removing nodes (#30423)
  Silence IndexUpgradeIT test failures. (#30430)
  Fix line length violation in cache tests
  Add failing test for core cache deadlock
  [DOCS] convert forcemerge snippet
  Update forcemerge.asciidoc (#30113)
  Added zentity to the list of API extension plugins (#29143)
  Fix the search request default operation behavior doc (#29302) (#29405)
  Watcher: Mark watcher as started only after loading watches (#30403)
  Correct wording in log message (#30336)
  Do not fail snapshot when deleting a missing snapshotted file (#30332)
  AwaitsFix testCreateShrinkIndexToN
  DOCS: Correct mapping tags in put-template api
  DOCS: Fix broken link in the put index template api
  Add put index template api to high level rest client (#30400)
  [Docs] Add snippets for POS stop tags default value
  Remove entry inadvertently picked into changelog
  Move respect accept header on no handler to 6.3.1
  Respect accept header on no handler (#30383)
  [Test] Add analysis-nori plugin to the vagrant tests
  [Rollup] Validate timezone in range queries (#30338)
  [Docs] Fix bad link
  [Docs] Fix end of section in the korean plugin docs
  add the Korean nori plugin to the change logs
  Expose the Lucene Korean analyzer module in a plugin (#30397)
  Docs: remove transport_client from CCS role example (#30263)
  Test: remove cluster permission from CCS user (#30262)
  Watcher: Remove unneeded index deletion in tests
  fix docs branch version
  fix lucene snapshot version
  Upgrade to 7.4.0-snapshot-1ed95c097b (#30357)
  [ML][TEST] Clean up jobs in ModelPlotIT
  Watcher: Ensure trigger service pauses execution (#30363)
  [DOCS] Fixes ordering of changelog sections
  [DOCS] Commented out empty sections in the changelog to fix the doc build. (#30372)
  Make RepositoriesMetaData contents unmodifiable (#30361)
  Change signature of Get Repositories Response (#30333)
  6.x Backport: Terms query validate bug  (#30319)
  InternalEngineTests.testConcurrentOutOfOrderDocsOnReplica should use two documents (#30121)
  Security: reduce garbage during index resolution (#30180)
  Test: use trial license in qa tests with security
  [ML] Add integration test for model plots (#30359)
  SQL: Fix bug caused by empty composites (#30343)
  [ML] Account for gaps in data counts after job is reopened (#30294)
  [ML] Refactor DataStreamDiagnostics to use array (#30129)
  Make licensing FIPS-140 compliant (#30251)
  Do not load global state when deleting a snapshot (#29278)
  Don't load global state when only restoring indices (#29239)
  Tests: Use different watch ids per test in smoke test (#30331)
  Watcher: Make start/stop cycle more predictable and synchronous (#30118)
  [Docs] Add term query with normalizer example
  Adds Eclipse config for xpack licence headers (#30299)
  Fix message content in users tool (#30293)
  [DOCS] Removed X-Pack breaking changes page
  [DOCS] Added security breaking change
  [DOCS] Fixes link to TLS LDAP info
  [DOCS] Merges X-Pack release notes into changelog (#30350)
  [DOCS] Fixes broken links to bootstrap user (#30349)
  [Docs] Remove errant changelog line
  Fix NPE when CumulativeSum agg encounters null/empty bucket (#29641)
  [DOCS] Reorganizes authentication details in Stack Overview (#30280)
  Tests: Simplify VersionUtils released version splitting (#30322)
  Fix merging logic of Suggester Options (#29514)
  ReplicationTracker.markAllocationIdAsInSync may hang if allocation is cancelled (#30316)
  [DOCS] Adds LDAP realm configuration details (#30214)
  [DOCS] Adds native realm configuration details (#30215)
  Disable SSL on testing old BWC nodes (#30337)
  [DOCS] Enables edit links for X-Pack pages
  Cancelling a peer recovery on the source can leak a primary permit (#30318)
  SQL: Reduce number of ranges generated for comparisons (#30267)
  [DOCS] Adds links to changelog sections
  Convert server javadoc to html5 (#30279)
  REST Client: Add Request object flavored methods (#29623)
  Create default ES_TMPDIR on Windows (#30325)
  [Docs] Clarify `fuzzy_like_this` redirect (#30183)
  Fix docs of the `_ignored` meta field.
  Add a new `_ignored` meta field. (#29658)
  Move repository-azure fixture test to QA project (#30253)
dnhatn added a commit that referenced this pull request May 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Engine Anything around managing Lucene and the Translog in an open shard. >non-issue >test Issues or PRs that are addressing/adding tests v6.4.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants