[FEA] Support binding multi gpu per executor #1486

coderyangyangyang · 2021-01-11T07:01:51Z

In our project one task bind multi gpu devices to parallelize the computing，so we are looking forward that this feature would be supported in the future

revans2 · 2021-01-11T15:27:55Z

How exactly would you want the computing to be parallelized? Would you expect to have a single task run across multiple GPUs at the same time? or do you want a single executor to select the best free GPU to use when assigning a task? CUDF does not support either of these use cases yet. A single task using multiple GPUs would require CUDF to rewrite most of their kernels/algorithms to try and do this, which is not simple, but might be doable. If you want multiple tasks to run on different GPUs we can sort of do that today, but you have to have multiple separate executors. Is there a reason you cannot ask Spark to launch the executor with only 1 GPU and proportionally less tasks?

coderyangyangyang · 2021-01-12T06:52:29Z

How exactly would you want the computing to be parallelized? Would you expect to have a single task run across multiple GPUs at the same time? or do you want a single executor to select the best free GPU to use when assigning a task? CUDF does not support either of these use cases yet. A single task using multiple GPUs would require CUDF to rewrite most of their kernels/algorithms to try and do this, which is not simple, but might be doable. If you want multiple tasks to run on different GPUs we can sort of do that today, but you have to have multiple separate executors. Is there a reason you cannot ask Spark to launch the executor with only 1 GPU and proportionally less tasks?

Thank you for your reply， In our project, we build an outer search engine wrapper for spark which bind multi gpu devices to search large scale vectors，each task call search method which is a member method of the search engine instance. so, in our case, one executor bind multi gpu devices to new a search engine. and we don't know whether the spark-rapids would compete for gpu resources with our search engine

revans2 · 2021-01-12T14:29:53Z

RAPIDS needs some kind of a GPU to run on. Cuda does have the ability to share GPUs between multiple processes, but context switching becomes a performance problem and generally it should be avoided. Most resource managers like kubernetes and YARN will hand out GPU resources, but not partial GPUs.

https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/

Each container can request one or more GPUs. It is not possible to request a fraction of a GPU.

https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/UsingGpus.html

Spark does the same thing and will request/schedule whole GPUs for an executor, but will allow you to split up the whole GPU between tasks in the executor.

In our project, we build an outer search engine wrapper for spark which bind multi gpu devices to search large scale vectors

I am still a little confused about what you want. Are you asking to have a single query in it that will use both your search engine code and the RAPIDs plugin at the same time? Or are you asking for RAPIDs queries to co-exist in the same multi-tenant cluster as search engine queries?

If you are asking for the first one (search engine and RAPIDS in the same query), then it is a really hard problem because of resource scheduling, like you mentioned. This is very similar to doing ML/DL training at the same time as using the RAPIDS plugin, and sadly we don't have a good solution for this yet, because the two really need to be able to coordinate with one another so that they can share the resources effeciently.

coderyangyangyang · 2021-01-13T13:01:39Z

Thank you very much，I think we should modify our GPU lib to adapt spark-RAPIDS rules.

revans2 · 2021-01-13T15:11:12Z

@coderyangyangyang please let me know if you need some help with this. The main thing you would need to do is to use RMM for GPU memory allocation/deallocation. RMM is not really designed for multi-GPU in a single process, so I don't know how well it will work. Be aware that a lot of the java RAPIDS code also assumes that there will be a single GPU and tries to set the GPU automatically to avoid issues around new threads being created and auto-initializing to GPU-0. If you really have to have support for multiple GPUs in a single process we can work with you to try and overcome some of those issues in RMM and the RAPIDS java API.

coderyangyangyang · 2021-01-14T10:49:13Z

we could sacrifice some worker memory to increase executors to the same num of gpu devices, and let the search engine bind only one GPU on each executor.

sameerz · 2021-01-19T21:55:20Z

We will leave this open as a feature request, but we do not have plans to address this soon.

[auto-merge] bot-auto-merge-branch-23.10 to branch-23.12 [skip ci] [bot]

coderyangyangyang added ? - Needs Triage Need team to review and classify feature request New feature or request labels Jan 11, 2021

coderyangyangyang changed the title ~~[FEA] Support binding multi gpu per excutor~~ [FEA] Support binding multi gpu per executor Jan 11, 2021

sameerz closed this as completed Jan 19, 2021

sameerz reopened this Jan 19, 2021

sameerz removed the ? - Needs Triage Need team to review and classify label Jan 19, 2021

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023

Merge pull request NVIDIA#1486 from NVIDIA/bot-auto-merge-branch-23.10

57c71b9

[auto-merge] bot-auto-merge-branch-23.10 to branch-23.12 [skip ci] [bot]

jlowe mentioned this issue Jan 24, 2024

java.lang.NoClassDefFoundError: org/apache/spark/sql/sources/v2/ReadSupport #672

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support binding multi gpu per executor #1486

[FEA] Support binding multi gpu per executor #1486

coderyangyangyang commented Jan 11, 2021

revans2 commented Jan 11, 2021

coderyangyangyang commented Jan 12, 2021

revans2 commented Jan 12, 2021

coderyangyangyang commented Jan 13, 2021

revans2 commented Jan 13, 2021

coderyangyangyang commented Jan 14, 2021

sameerz commented Jan 19, 2021

[FEA] Support binding multi gpu per executor #1486

[FEA] Support binding multi gpu per executor #1486

Comments

coderyangyangyang commented Jan 11, 2021

revans2 commented Jan 11, 2021

coderyangyangyang commented Jan 12, 2021

revans2 commented Jan 12, 2021

coderyangyangyang commented Jan 13, 2021

revans2 commented Jan 13, 2021

coderyangyangyang commented Jan 14, 2021

sameerz commented Jan 19, 2021