Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Initialize UCX early #1558

Closed
abellina opened this issue Jan 20, 2021 · 0 comments · Fixed by #1891
Closed

[FEA] Initialize UCX early #1558

abellina opened this issue Jan 20, 2021 · 0 comments · Fixed by #1891
Assignees
Labels
performance A performance related task/issue shuffle things that impact the shuffle plugin

Comments

@abellina
Copy link
Collaborator

abellina commented Jan 20, 2021

We have several issues that could be improved if UCX started early:

  1. IPC mem handles are slow to open, especially on systems with many GPUs
  2. nv_peer_mem registration takes time
  3. UCX issues happen during the query, rather than upfront.
  4. We have seen pauses when handshaking UCX worker addresses. With all else going on in the application, the late start can complicate debugging these cases.

For issues (3, 4), if there is a configuration/network issue, we currently only detect it much later after the query has partially executed, which should be enough motivation for this issue.

UCX initialization also means that a ping message should be sent between peers. This would allow for those IPC and GPUDirectRDMA mappings to be done early.

@abellina abellina added feature request New feature or request ? - Needs Triage Need team to review and classify shuffle things that impact the shuffle plugin labels Jan 20, 2021
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Jan 26, 2021
@abellina abellina self-assigned this Mar 3, 2021
@abellina abellina added this to the Mar 1 - Mar 12 milestone Mar 3, 2021
@sameerz sameerz added performance A performance related task/issue and removed feature request New feature or request labels Mar 3, 2021
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
…IDIA#1558)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue shuffle things that impact the shuffle plugin
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants