-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Support binding multi gpu per executor #1486
Comments
How exactly would you want the computing to be parallelized? Would you expect to have a single task run across multiple GPUs at the same time? or do you want a single executor to select the best free GPU to use when assigning a task? CUDF does not support either of these use cases yet. A single task using multiple GPUs would require CUDF to rewrite most of their kernels/algorithms to try and do this, which is not simple, but might be doable. If you want multiple tasks to run on different GPUs we can sort of do that today, but you have to have multiple separate executors. Is there a reason you cannot ask Spark to launch the executor with only 1 GPU and proportionally less tasks? |
Thank you for your reply, In our project, we build an outer search engine wrapper for spark which bind multi gpu devices to search large scale vectors,each task call search method which is a member method of the search engine instance. so, in our case, one executor bind multi gpu devices to new a search engine. and we don't know whether the spark-rapids would compete for gpu resources with our search engine |
RAPIDS needs some kind of a GPU to run on. Cuda does have the ability to share GPUs between multiple processes, but context switching becomes a performance problem and generally it should be avoided. Most resource managers like kubernetes and YARN will hand out GPU resources, but not partial GPUs. https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/UsingGpus.html Spark does the same thing and will request/schedule whole GPUs for an executor, but will allow you to split up the whole GPU between tasks in the executor.
I am still a little confused about what you want. Are you asking to have a single query in it that will use both your search engine code and the RAPIDs plugin at the same time? Or are you asking for RAPIDs queries to co-exist in the same multi-tenant cluster as search engine queries? If you are asking for the first one (search engine and RAPIDS in the same query), then it is a really hard problem because of resource scheduling, like you mentioned. This is very similar to doing ML/DL training at the same time as using the RAPIDS plugin, and sadly we don't have a good solution for this yet, because the two really need to be able to coordinate with one another so that they can share the resources effeciently. |
Thank you very much,I think we should modify our GPU lib to adapt spark-RAPIDS rules. |
@coderyangyangyang please let me know if you need some help with this. The main thing you would need to do is to use RMM for GPU memory allocation/deallocation. RMM is not really designed for multi-GPU in a single process, so I don't know how well it will work. Be aware that a lot of the java RAPIDS code also assumes that there will be a single GPU and tries to set the GPU automatically to avoid issues around new threads being created and auto-initializing to GPU-0. If you really have to have support for multiple GPUs in a single process we can work with you to try and overcome some of those issues in RMM and the RAPIDS java API. |
we could sacrifice some worker memory to increase executors to the same num of gpu devices, and let the search engine bind only one GPU on each executor. |
We will leave this open as a feature request, but we do not have plans to address this soon. |
[auto-merge] bot-auto-merge-branch-23.10 to branch-23.12 [skip ci] [bot]
In our project one task bind multi gpu devices to parallelize the computing,so we are looking forward that this feature would be supported in the future
The text was updated successfully, but these errors were encountered: