You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have several issues that could be improved if UCX started early:
IPC mem handles are slow to open, especially on systems with many GPUs
nv_peer_mem registration takes time
UCX issues happen during the query, rather than upfront.
We have seen pauses when handshaking UCX worker addresses. With all else going on in the application, the late start can complicate debugging these cases.
For issues (3, 4), if there is a configuration/network issue, we currently only detect it much later after the query has partially executed, which should be enough motivation for this issue.
UCX initialization also means that a ping message should be sent between peers. This would allow for those IPC and GPUDirectRDMA mappings to be done early.
The text was updated successfully, but these errors were encountered:
We have several issues that could be improved if UCX started early:
For issues (3, 4), if there is a configuration/network issue, we currently only detect it much later after the query has partially executed, which should be enough motivation for this issue.
UCX initialization also means that a ping message should be sent between peers. This would allow for those IPC and GPUDirectRDMA mappings to be done early.
The text was updated successfully, but these errors were encountered: