Enable global checkpoint listeners to timeout #33620

jasontedor · 2018-09-12T05:13:59Z

In cross-cluster replication, we will use global checkpoint listeners to long poll for updates to a shard. However, we do not want these polls to wait indefinitely as it could be difficult to discern if the listener is still waiting for updates versus something has gone horribly wrong and cross-cluster replication is stuck. Instead, we want these listeners to timeout after some period (for example, one minute) so that they are notified and we can update status on the following side that cross-cluster replication is still active. After this, we will immediately enter back into a poll mode.

To do this, we need the ability to associate a timeout with a global checkpoint listener. This commit adds this capability.

Relates #32696

In cross-cluster replication, we will use global checkpoint listeners to long poll for updates to a shard. However, we do want these polls to wait indefinitely as it could be difficult to discern if the listener is still waiting for updates versus something has gone horribly wrong and cross-cluster replication is stuck. Instead, we want these listeners to timeout after some period (for example, one minute) so that they are notified and we can update status on the following side that cross-cluster replication is still active. After this, we will immediately enter back into a poll mode. To do this, we need the ability to associate a timeout with a global checkpoint listener. This commit adds this capability.

elasticmachine · 2018-09-12T05:14:01Z

Pinging @elastic/es-distributed

bleskes

LGTM. Left some nits to accept or reject.

server/src/main/java/org/elasticsearch/index/shard/GlobalCheckpointListeners.java

bleskes · 2018-09-12T12:08:49Z

server/src/main/java/org/elasticsearch/index/shard/GlobalCheckpointListeners.java

                logger.warn("error notifying global checkpoint listener of closed shard", caught);
+            } else {
+                assert e instanceof TimeoutException : e;
+                logger.warn("error notifying global checkpoint listener of timeout", caught);


nit: why not just always log the exception (to serve as an indicator of what happened) and assert its either IndexShardCloseException or TimeoutException?

I prefer it as since it's more explicit, and since your approach only saves two lines of code I'll stick with mine. 😇

jasontedor · 2018-09-12T12:36:43Z

Thanks @bleskes, will merge on green.

martijnvg

LGTM

dnhatn

LGTM.

dnhatn · 2018-09-12T12:48:13Z

server/src/main/java/org/elasticsearch/index/shard/GlobalCheckpointListeners.java

    }

    // guarded by this
    private boolean closed;
-    private volatile List<GlobalCheckpointListener> listeners;
+    private volatile Map<GlobalCheckpointListener, ScheduledFuture<?>> listeners;


afaics, all accesses to listeners are under lock, but I may miss something.

You're right, good catch @dnhatn. In an early version of #32696 this was not the case but during review we changed it so that it is the case, and then missed that volatile is no longer necessary here. I will remove this in a follow-up.

In cross-cluster replication, we will use global checkpoint listeners to long poll for updates to a shard. However, we do not want these polls to wait indefinitely as it could be difficult to discern if the listener is still waiting for updates versus something has gone horribly wrong and cross-cluster replication is stuck. Instead, we want these listeners to timeout after some period (for example, one minute) so that they are notified and we can update status on the following side that cross-cluster replication is still active. After this, we will immediately enter back into a poll mode. To do this, we need the ability to associate a timeout with a global checkpoint listener. This commit adds this capability.

jasontedor added review v7.0.0 :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. v6.5.0 labels Sep 12, 2018

jasontedor requested review from martijnvg, bleskes, ywelsch and dnhatn September 12, 2018 05:14

jasontedor added 2 commits September 12, 2018 01:29

Fix forbidden API invocation

88563e0

Add license

574deec

bleskes approved these changes Sep 12, 2018

View reviewed changes

jasontedor added 2 commits September 12, 2018 08:24

Rename method

56efb92

Code simplification

d9abe55

martijnvg approved these changes Sep 12, 2018

View reviewed changes

dnhatn approved these changes Sep 12, 2018

View reviewed changes

jasontedor merged commit 36ba3cd into elastic:master Sep 12, 2018

jasontedor deleted the global-checkpoint-listener-timeout branch September 12, 2018 15:10

jasontedor mentioned this pull request Sep 13, 2018

Use serializable exception in GCP listeners #33657

Merged

vladimirdolzhenko mentioned this pull request Sep 13, 2018

[CI] GlobalCheckpointListenersTests.testFailingListenerAfterTimeout fails #33665

Closed

colings86 added the >bug label Oct 25, 2018

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable global checkpoint listeners to timeout #33620

Enable global checkpoint listeners to timeout #33620

jasontedor commented Sep 12, 2018 •

edited

Loading

elasticmachine commented Sep 12, 2018

bleskes left a comment

bleskes Sep 12, 2018

jasontedor Sep 12, 2018

jasontedor commented Sep 12, 2018

martijnvg left a comment

dnhatn left a comment

dnhatn Sep 12, 2018

jasontedor Sep 12, 2018

Enable global checkpoint listeners to timeout #33620

Enable global checkpoint listeners to timeout #33620

Conversation

jasontedor commented Sep 12, 2018 • edited Loading

elasticmachine commented Sep 12, 2018

bleskes left a comment

Choose a reason for hiding this comment

bleskes Sep 12, 2018

Choose a reason for hiding this comment

jasontedor Sep 12, 2018

Choose a reason for hiding this comment

jasontedor commented Sep 12, 2018

martijnvg left a comment

Choose a reason for hiding this comment

dnhatn left a comment

Choose a reason for hiding this comment

dnhatn Sep 12, 2018

Choose a reason for hiding this comment

jasontedor Sep 12, 2018

Choose a reason for hiding this comment

jasontedor commented Sep 12, 2018 •

edited

Loading