Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RapidsShuffleHeartbeatManager needs to remove executors that are stale #2589

Closed
abellina opened this issue Jun 4, 2021 · 1 comment · Fixed by #2977
Closed

[BUG] RapidsShuffleHeartbeatManager needs to remove executors that are stale #2589

abellina opened this issue Jun 4, 2021 · 1 comment · Fixed by #2977
Assignees
Labels
bug Something isn't working shuffle things that impact the shuffle plugin

Comments

@abellina
Copy link
Collaborator

abellina commented Jun 4, 2021

We only add executors in RapidsShuffleHeartbeatManager but we never remove them.

We need to keep track of the last time we saw a heartbeat, and as we get more registrations check if any executors are stale, and not include in executor update messages. We don't need to tell already registered executors about the loss of a peer, because UCX should be able to figure that out on its own.

We also need to update the logic on how the "new executors" for a peer is decided. Currently it's based on registration order, and it will need to be time based I believe.

@abellina abellina added bug Something isn't working ? - Needs Triage Need team to review and classify shuffle things that impact the shuffle plugin labels Jun 4, 2021
@abellina
Copy link
Collaborator Author

abellina commented Jun 4, 2021

This PR #2587 is related and improves things slightly for 21.06.

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Jun 8, 2021
@abellina abellina self-assigned this Jul 14, 2021
@sameerz sameerz added this to the July 19 - July 30 milestone Jul 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working shuffle things that impact the shuffle plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants