Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support binding multi gpu per executor #1486

Open
coderyangyangyang opened this issue Jan 11, 2021 · 7 comments
Open

[FEA] Support binding multi gpu per executor #1486

coderyangyangyang opened this issue Jan 11, 2021 · 7 comments
Labels
feature request New feature or request

Comments

@coderyangyangyang
Copy link

In our project one task bind multi gpu devices to parallelize the computing,so we are looking forward that this feature would be supported in the future

@coderyangyangyang coderyangyangyang added ? - Needs Triage Need team to review and classify feature request New feature or request labels Jan 11, 2021
@coderyangyangyang coderyangyangyang changed the title [FEA] Support binding multi gpu per excutor [FEA] Support binding multi gpu per executor Jan 11, 2021
@revans2
Copy link
Collaborator

revans2 commented Jan 11, 2021

How exactly would you want the computing to be parallelized? Would you expect to have a single task run across multiple GPUs at the same time? or do you want a single executor to select the best free GPU to use when assigning a task? CUDF does not support either of these use cases yet. A single task using multiple GPUs would require CUDF to rewrite most of their kernels/algorithms to try and do this, which is not simple, but might be doable. If you want multiple tasks to run on different GPUs we can sort of do that today, but you have to have multiple separate executors. Is there a reason you cannot ask Spark to launch the executor with only 1 GPU and proportionally less tasks?

@coderyangyangyang
Copy link
Author

How exactly would you want the computing to be parallelized? Would you expect to have a single task run across multiple GPUs at the same time? or do you want a single executor to select the best free GPU to use when assigning a task? CUDF does not support either of these use cases yet. A single task using multiple GPUs would require CUDF to rewrite most of their kernels/algorithms to try and do this, which is not simple, but might be doable. If you want multiple tasks to run on different GPUs we can sort of do that today, but you have to have multiple separate executors. Is there a reason you cannot ask Spark to launch the executor with only 1 GPU and proportionally less tasks?


Thank you for your reply, In our project, we build an outer search engine wrapper for spark which bind multi gpu devices to search large scale vectors,each task call search method which is a member method of the search engine instance. so, in our case, one executor bind multi gpu devices to new a search engine. and we don't know whether the spark-rapids would compete for gpu resources with our search engine

@revans2
Copy link
Collaborator

revans2 commented Jan 12, 2021

RAPIDS needs some kind of a GPU to run on. Cuda does have the ability to share GPUs between multiple processes, but context switching becomes a performance problem and generally it should be avoided. Most resource managers like kubernetes and YARN will hand out GPU resources, but not partial GPUs.

https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/

Each container can request one or more GPUs. It is not possible to request a fraction of a GPU.

https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/UsingGpus.html

Spark does the same thing and will request/schedule whole GPUs for an executor, but will allow you to split up the whole GPU between tasks in the executor.

In our project, we build an outer search engine wrapper for spark which bind multi gpu devices to search large scale vectors

I am still a little confused about what you want. Are you asking to have a single query in it that will use both your search engine code and the RAPIDs plugin at the same time? Or are you asking for RAPIDs queries to co-exist in the same multi-tenant cluster as search engine queries?

If you are asking for the first one (search engine and RAPIDS in the same query), then it is a really hard problem because of resource scheduling, like you mentioned. This is very similar to doing ML/DL training at the same time as using the RAPIDS plugin, and sadly we don't have a good solution for this yet, because the two really need to be able to coordinate with one another so that they can share the resources effeciently.

@coderyangyangyang
Copy link
Author

Thank you very much,I think we should modify our GPU lib to adapt spark-RAPIDS rules.

@revans2
Copy link
Collaborator

revans2 commented Jan 13, 2021

@coderyangyangyang please let me know if you need some help with this. The main thing you would need to do is to use RMM for GPU memory allocation/deallocation. RMM is not really designed for multi-GPU in a single process, so I don't know how well it will work. Be aware that a lot of the java RAPIDS code also assumes that there will be a single GPU and tries to set the GPU automatically to avoid issues around new threads being created and auto-initializing to GPU-0. If you really have to have support for multiple GPUs in a single process we can work with you to try and overcome some of those issues in RMM and the RAPIDS java API.

@coderyangyangyang
Copy link
Author

we could sacrifice some worker memory to increase executors to the same num of gpu devices, and let the search engine bind only one GPU on each executor.

@sameerz
Copy link
Collaborator

sameerz commented Jan 19, 2021

We will leave this open as a feature request, but we do not have plans to address this soon.

@sameerz sameerz closed this as completed Jan 19, 2021
@sameerz sameerz reopened this Jan 19, 2021
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Jan 19, 2021
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
[auto-merge] bot-auto-merge-branch-23.10 to branch-23.12 [skip ci] [bot]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants