Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] CreateIndexIT.testIndexWithUnknownSetting failure #32118

Closed
davidkyle opened this issue Jul 17, 2018 · 2 comments
Closed

[CI] CreateIndexIT.testIndexWithUnknownSetting failure #32118

davidkyle opened this issue Jul 17, 2018 · 2 comments
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >test-failure Triaged test failures from CI v7.0.0-beta1

Comments

@davidkyle
Copy link
Member

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/2335/

The error is

com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=6940, name=elasticsearch[node_td3][write][T#1], state=RUNNABLE, group=TGRP-CreateIndexIT]
	at __randomizedtesting.SeedInfo.seed([4C96DA64C5B54786:A3C49766AA9A4A5B]:0)
Caused by: java.lang.AssertionError: shard term already update.  op term [2], shardTerm [3]
	at __randomizedtesting.SeedInfo.seed([4C96DA64C5B54786]:0)
	at org.elasticsearch.index.shard.IndexShard.lambda$acquireReplicaOperationPermit$9(IndexShard.java:2225)
	at org.elasticsearch.index.shard.IndexShardOperationPermits.doBlockOperations(IndexShardOperationPermits.java:173)
	at org.elasticsearch.index.shard.IndexShardOperationPermits.blockOperations(IndexShardOperationPermits.java:110)
	at org.elasticsearch.index.shard.IndexShard.acquireReplicaOperationPermit(IndexShard.java:2224)
	at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:616)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:493)
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:479)
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63)
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1679)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723)
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Does not reproduce

./gradlew :server:integTest \
  -Dtests.seed=4C96DA64C5B54786 \
  -Dtests.class=org.elasticsearch.action.admin.indices.create.CreateIndexIT \
  -Dtests.method="testIndexWithUnknownSetting" \
  -Dtests.security.manager=true \
  -Dtests.locale=pt \
  -Dtests.timezone=US/Pacific-New
@davidkyle davidkyle added >test-failure Triaged test failures from CI v7.0.0 :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. labels Jul 17, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

ywelsch added a commit that referenced this issue Aug 3, 2018
We've recently seen a number of test failures that tripped an assertion in IndexShard (see issues
linked below), leading to the discovery of a race between resetting a replica when it learns about a
higher term and when the same replica is promoted to primary. This commit fixes the race by
distinguishing between a cluster state primary term (called pendingPrimaryTerm) and a shard-level
operation term. The former is set during the cluster state update or when a replica learns about a
new primary. The latter is only incremented under the operation block, which can happen in a
delayed fashion. It also solves the issue where a replica that's still adjusting to the new term
receives a cluster state update that promotes it to primary, which can happen in the situation of
multiple nodes being shut down in short succession. In that case, the cluster state update thread
would call `asyncBlockOperations` in `updateShardState`, which in turn would throw an exception
as blocking permits is not allowed while an ongoing block is in place, subsequently failing the shard.
This commit therefore extends the IndexShardOperationPermits to allow it to queue multiple blocks
(which will all take precedence over operations acquiring permits). Finally, it also moves the primary
activation of the replication tracker under the operation block, so that the actual transition to
primary only happens under the operation block.

Relates to #32431, #32304 and #32118
ywelsch added a commit that referenced this issue Aug 3, 2018
We've recently seen a number of test failures that tripped an assertion in IndexShard (see issues
linked below), leading to the discovery of a race between resetting a replica when it learns about a
higher term and when the same replica is promoted to primary. This commit fixes the race by
distinguishing between a cluster state primary term (called pendingPrimaryTerm) and a shard-level
operation term. The former is set during the cluster state update or when a replica learns about a
new primary. The latter is only incremented under the operation block, which can happen in a
delayed fashion. It also solves the issue where a replica that's still adjusting to the new term
receives a cluster state update that promotes it to primary, which can happen in the situation of
multiple nodes being shut down in short succession. In that case, the cluster state update thread
would call `asyncBlockOperations` in `updateShardState`, which in turn would throw an exception
as blocking permits is not allowed while an ongoing block is in place, subsequently failing the shard.
This commit therefore extends the IndexShardOperationPermits to allow it to queue multiple blocks
(which will all take precedence over operations acquiring permits). Finally, it also moves the primary
activation of the replication tracker under the operation block, so that the actual transition to
primary only happens under the operation block.

Relates to #32431, #32304 and #32118
ywelsch added a commit that referenced this issue Aug 3, 2018
We've recently seen a number of test failures that tripped an assertion in IndexShard (see issues
linked below), leading to the discovery of a race between resetting a replica when it learns about a
higher term and when the same replica is promoted to primary. This commit fixes the race by
distinguishing between a cluster state primary term (called pendingPrimaryTerm) and a shard-level
operation term. The former is set during the cluster state update or when a replica learns about a
new primary. The latter is only incremented under the operation block, which can happen in a
delayed fashion. It also solves the issue where a replica that's still adjusting to the new term
receives a cluster state update that promotes it to primary, which can happen in the situation of
multiple nodes being shut down in short succession. In that case, the cluster state update thread
would call `asyncBlockOperations` in `updateShardState`, which in turn would throw an exception
as blocking permits is not allowed while an ongoing block is in place, subsequently failing the shard.
This commit therefore extends the IndexShardOperationPermits to allow it to queue multiple blocks
(which will all take precedence over operations acquiring permits). Finally, it also moves the primary
activation of the replication tracker under the operation block, so that the actual transition to
primary only happens under the operation block.

Relates to #32431, #32304 and #32118
@ywelsch
Copy link
Contributor

ywelsch commented Aug 3, 2018

Closed by #32442. If this still occurs, please reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >test-failure Triaged test failures from CI v7.0.0-beta1
Projects
None yet
Development

No branches or pull requests

4 participants