Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UCP/RNDV/CUDA: RNDV protocol improvements for CUDA #5473

Merged
merged 4 commits into from
Aug 13, 2020

Conversation

bureddy
Copy link
Contributor

@bureddy bureddy commented Jul 25, 2020

What

Improving out-of-the-box behavior for cuda transfers

  • Default try GET protocol for CUDA and fallback to pipeline protocol if GET protocol fails
  • Option to tune sender-side pipelining scheme

Why ?

The current default rndv scheme for CUDA transfers is to use sender-side pipelining to overcome GPUDirectRDMA performance limitation on architecture where GPUs are connected to CPU socket directly.
But, now most of the GPU system architectures are designed where GPU and NIC are connected to same PCIe Switch ( Ex: all DGX systems, different vendor GPU server architectures). we are recommending users set UCX_RNDV_SCHEME=get_zcopy explicitly inorder to get an optimal GPUDirectRDMA performance on these architectures. And also we do not have an optimal fallback if GET protocol fails (ex GPUs connected to different CPU sockets(DGX1))

How ?

  • support for get-zcopy fallback: pre-compute get zcopy lanes from rkey and fallback to the optimal schemes if no lanes are suitable. most of the pre-computing lanes code is used from RNDV/GET: try to push request to pending yosefe/ucx#16 ( @hoopoepg )
  • option UCX_RNDV_PIPELINE_SEND_THRESH to control the sender-side pipelining scheme if needed.

@Akshay-Venkatesh @yosefe

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 5473 in repo openucx/ucx

@bureddy
Copy link
Contributor Author

bureddy commented Jul 25, 2020

bot:pipe:retest

@bureddy
Copy link
Contributor Author

bureddy commented Jul 26, 2020

bot:pipe:retest

1 similar comment
@bureddy
Copy link
Contributor Author

bureddy commented Jul 27, 2020

bot:pipe:retest

@bureddy
Copy link
Contributor Author

bureddy commented Jul 27, 2020

@hoopoepg @yosefe @brminich please review

ucp_rkey_h rkey; /* key for remote send buffer */
ucp_lane_map_t lanes_map_avail; /* used lanes map */
ucp_lane_map_t lanes_map_all; /* actual lanes map */
uint8_t lanes_count; /* actual lanes map */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lanes count (in the comment)

@@ -66,12 +66,14 @@ void ucp_rndv_receive(ucp_worker_h worker, ucp_request_t *rreq,
const ucp_rndv_rts_hdr_t *rndv_rts_hdr);

static UCS_F_ALWAYS_INLINE int ucp_rndv_is_get_zcopy(ucs_memory_type_t mem_type,
size_t length,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe pass just 2 params to this func: request and context?
2 first params are taken from request, others are from context

src/ucp/proto/rndv.c Show resolved Hide resolved
freq->send.rndv_get.rreq = sreq;
ucp_rndv_req_init_get_zcopy_lane_map(freq);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you really need to re-calculate it for every fragment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so because each fragment is tracked in separate rndv frag req

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we cache it somewhere, because it is supposed to be the same for all fragments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. can you check it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brminich can you please check?

@yosefe
Copy link
Contributor

yosefe commented Jul 30, 2020

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bureddy
Copy link
Contributor Author

bureddy commented Jul 31, 2020

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@bureddy
Copy link
Contributor Author

bureddy commented Aug 1, 2020

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@brminich brminich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hoopoepg, plz review

ucp_request_recv_buffer_reg(rreq, ep_config->key.rma_bw_md_map,
rndv_rts_hdr->size);

if ((rndv_mode == UCP_RNDV_MODE_PUT_ZCOPY) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like now we do not register RX buffer for PUT protocol in case of UCP_RNDV_MODE_PUT_AUTO + non-CUDA memory

Copy link
Contributor Author

@bureddy bureddy Aug 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. This PR is not changing current behavior for the HOST memory. Today, it does GET by default for HOST memory. it sends RTR w/o registering RX buffer to switch to Active message rndv it fails to do GET (https://github.com/openucx/ucx/blob/master/src/ucp/proto/rndv.c#L473)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brminich are we ok here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@bureddy
Copy link
Contributor Author

bureddy commented Aug 4, 2020

@Akshay-Venkatesh @yosefe can you please review?

@bureddy
Copy link
Contributor Author

bureddy commented Aug 11, 2020

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

src/ucp/core/ucp_context.c Outdated Show resolved Hide resolved
src/ucp/proto/rndv.c Outdated Show resolved Hide resolved

if ((lane_bw/max_lane_bw) <
(1. / context->config.ext.multi_lane_max_ratio)) {
lane_map &= ~UCS_BIT(lane_idx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems it can make rndv_req->send.rndv_get.rkey_index[i] invalid because some lanes would be removed from the middle

src/ucp/proto/rndv.c Outdated Show resolved Hide resolved
src/ucp/proto/rndv.c Outdated Show resolved Hide resolved
src/ucp/proto/rndv.c Outdated Show resolved Hide resolved
@bureddy
Copy link
Contributor Author

bureddy commented Aug 13, 2020

bot:pipe:retest

@yosefe yosefe merged commit 5d914ec into openucx:master Aug 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants