-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable global checkpoint listeners to timeout #33620
Enable global checkpoint listeners to timeout #33620
Conversation
In cross-cluster replication, we will use global checkpoint listeners to long poll for updates to a shard. However, we do want these polls to wait indefinitely as it could be difficult to discern if the listener is still waiting for updates versus something has gone horribly wrong and cross-cluster replication is stuck. Instead, we want these listeners to timeout after some period (for example, one minute) so that they are notified and we can update status on the following side that cross-cluster replication is still active. After this, we will immediately enter back into a poll mode. To do this, we need the ability to associate a timeout with a global checkpoint listener. This commit adds this capability.
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Left some nits to accept or reject.
server/src/main/java/org/elasticsearch/index/shard/GlobalCheckpointListeners.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/shard/GlobalCheckpointListeners.java
Outdated
Show resolved
Hide resolved
logger.warn("error notifying global checkpoint listener of closed shard", caught); | ||
} else { | ||
assert e instanceof TimeoutException : e; | ||
logger.warn("error notifying global checkpoint listener of timeout", caught); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: why not just always log the exception (to serve as an indicator of what happened) and assert its either IndexShardCloseException or TimeoutException?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer it as since it's more explicit, and since your approach only saves two lines of code I'll stick with mine. 😇
Thanks @bleskes, will merge on green. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
} | ||
|
||
// guarded by this | ||
private boolean closed; | ||
private volatile List<GlobalCheckpointListener> listeners; | ||
private volatile Map<GlobalCheckpointListener, ScheduledFuture<?>> listeners; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaics, all accesses to listeners are under lock, but I may miss something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cross-cluster replication, we will use global checkpoint listeners to long poll for updates to a shard. However, we do not want these polls to wait indefinitely as it could be difficult to discern if the listener is still waiting for updates versus something has gone horribly wrong and cross-cluster replication is stuck. Instead, we want these listeners to timeout after some period (for example, one minute) so that they are notified and we can update status on the following side that cross-cluster replication is still active. After this, we will immediately enter back into a poll mode. To do this, we need the ability to associate a timeout with a global checkpoint listener. This commit adds this capability.
In cross-cluster replication, we will use global checkpoint listeners to long poll for updates to a shard. However, we do not want these polls to wait indefinitely as it could be difficult to discern if the listener is still waiting for updates versus something has gone horribly wrong and cross-cluster replication is stuck. Instead, we want these listeners to timeout after some period (for example, one minute) so that they are notified and we can update status on the following side that cross-cluster replication is still active. After this, we will immediately enter back into a poll mode.
To do this, we need the ability to associate a timeout with a global checkpoint listener. This commit adds this capability.
Relates #32696