Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework "scatter data if graph is too large" warning #8814

Closed
fjetter opened this issue Aug 2, 2024 · 0 comments · Fixed by #8815
Closed

Rework "scatter data if graph is too large" warning #8814

fjetter opened this issue Aug 2, 2024 · 0 comments · Fixed by #8815
Assignees

Comments

@fjetter
Copy link
Member

fjetter commented Aug 2, 2024

We're currently issuing a warning to users about large graphs and are suggesting to scatter data. We should revise this warning. Ideally we'd point them to a page in the documentation that discusses this problem.

Primarily, I would like to recommend a safer and conceptually simpler approach of using a delayed or a Client.submit instead. The benefit of doing this is that code that is not using a dask client can benefit from this and that it is more resilient.

The shortcomings of delayed/submit over scatter are

  • Direct to worker communication is not possible. The data will always flow over the scheduler. Depending on the network topology, direct communication is not possible anyhow.
  • A copy of the data will be stored on the scheduler (which is why it is more resilient but this of course might push the scheduler over its limit)

If the scheduler memory or direct communication is actually a problem for users, going the extra mile of using remote storage might even be necessary. Since this topic is not exactly trivial it might be appropriate to let the warning point to a documentation page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants