Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot find cuda_profiler_api.h when building cpu_adam #2622

Closed
hagope opened this issue Dec 16, 2022 · 11 comments
Closed

Cannot find cuda_profiler_api.h when building cpu_adam #2622

hagope opened this issue Dec 16, 2022 · 11 comments
Assignees
Labels
bug Something isn't working training

Comments

@hagope
Copy link

hagope commented Dec 16, 2022

Describe the bug
When trying to pip install and build Adam, cuda_profiler_api.h is missing in sources.

To Reproduce
Steps to reproduce the behavior:

  1. Pip install with Adam flag set: DS_BUILD_CPU_ADAM=1 pip install .
  2. An error stops the build indicating that cuda_profiler_api.h cannot be found.

Expected behavior
I believe the cuda_profiler_api.h should be added to the csrc/includes path? I was able to work around the problem with a symlink to my local cuda install: ln -s /usr/local/cuda-11.7/targets/x86_64-linux/include/cuda_profiler_api.h csrc/includes/cuda_profiler_api.h

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY] << AFTER using workaround above
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/omar/anaconda3/envs/diff2/lib/python3.9/site-packages/torch']
torch version .................... 1.13.1
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed install path ........... ['/home/omar/anaconda3/envs/diff2/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.7.7+2076bf23, 2076bf23, HEAD
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

System info (please complete the following information):

  • OS: Ubuntu 20.04 on WSL2
@hagope hagope added bug Something isn't working training labels Dec 16, 2022
@tjruwase tjruwase assigned ShijieZZZZ and loadams and unassigned ShijieZZZZ Dec 19, 2022
@tjruwase
Copy link
Contributor

@hagope, thanks for reporting this issue. Do you mind sharing a PR?

@loadams loadams self-assigned this Jan 3, 2023
@DakeZhang1998
Copy link

Hi guys, I have encountered the same issue. Any update on this? Thanks!

@loadams
Copy link
Contributor

loadams commented Jan 5, 2023

@DakeZhang1998 - I'm reproducing this now, but are you able to simlink to work around this in the meantime?

@DakeZhang1998
Copy link

@DakeZhang1998 - I'm reproducing this now, but are you able to simlink to work around this in the meantime?

I am quite new to this. I am trying to use Deepspeed on a Ubuntu server where I don't have sudo access. So I am managing CUDA in Anaconda. I installed PyTorch using the command from their official website: conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia. I am not sure where I can find the file cuda_profiler_api.h.

@abhay-agarwal
Copy link

This is probably a simple case of your python environment not containing the right PATH and LD_LIBRARY environment variables.

export PATH="/usr/local/cuda-11.7/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-11.7/lib64:$LD_LIBRARY_PATH"

I think deepspeed could also do better at finding your cuda install even when these variables are not specified given that they could follow the /usr/local/cuda convention.

@DakeZhang1998
Copy link

Thanks to the technical manager in my school, my issue is fixed by forcing Conda to install cuda-nvprof (11.7) instead of v12 by default.

@loadams
Copy link
Contributor

loadams commented Jan 13, 2023

Thanks @abhay-agarwal and glad that works for you, @DakeZhang1998. @hagope, can you check your PATH and LD_LIBRARY_PATH? I'm taking a look at why DeepSpeed doesn't discover the cuda install any better.

@loadams
Copy link
Contributor

loadams commented Jan 13, 2023

This actually appears to be an issue on the PyTorch side. Credit to @HeyangQin for this, links here to PyTorch discussion and another DeepSpeed issue with info on this. Seems like we'll have to wait for the next PyTorch release, so closing this issue for now.

https://discuss.pytorch.org/t/not-able-to-include-cusolverdn-h/169122

#2684

@loadams loadams closed this as completed Jan 13, 2023
@lamnguyenx
Copy link

Thanks to the technical manager in my school, my issue is fixed by forcing Conda to install cuda-nvprof (11.7) instead of v12 by default.

This worked. In my casem, when building FasterTransformer, I have to use nvcc --version to check for CUDA actual version (11.6) and then run conda install -c nvidia cuda-nvprof=11.6

@zplizzi
Copy link

zplizzi commented Feb 26, 2023

For people installing CUDA with apt, doing apt install cuda-nvprof-11-7 fixed this problem for me.

@minuenergy
Copy link

This is probably a simple case of your python environment not containing the right PATH and LD_LIBRARY environment variables.

export PATH="/usr/local/cuda-11.7/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-11.7/lib64:$LD_LIBRARY_PATH"

I think deepspeed could also do better at finding your cuda install even when these variables are not specified given that they could follow the /usr/local/cuda convention.

I don't know what to put in that environment variable.
when I command find / -name cuda , and this happened.

image

In this situation, what should put below?
export PATH="??"
export LD_LIBRARY_PATH="?"

addition, when i command which nvcc -> /opt/conda/bin/nvcc came out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

9 participants