Cannot find cuda_profiler_api.h when building cpu_adam #2622

hagope · 2022-12-16T19:10:41Z

Describe the bug
When trying to pip install and build Adam, cuda_profiler_api.h is missing in sources.

To Reproduce
Steps to reproduce the behavior:

Pip install with Adam flag set: DS_BUILD_CPU_ADAM=1 pip install .
An error stops the build indicating that cuda_profiler_api.h cannot be found.

Expected behavior
I believe the cuda_profiler_api.h should be added to the csrc/includes path? I was able to work around the problem with a symlink to my local cuda install: ln -s /usr/local/cuda-11.7/targets/x86_64-linux/include/cuda_profiler_api.h csrc/includes/cuda_profiler_api.h

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [YES] ...... [OKAY] << AFTER using workaround above
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/omar/anaconda3/envs/diff2/lib/python3.9/site-packages/torch']
torch version .................... 1.13.1
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed install path ........... ['/home/omar/anaconda3/envs/diff2/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.7.7+2076bf23, 2076bf23, HEAD
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

System info (please complete the following information):

OS: Ubuntu 20.04 on WSL2

The text was updated successfully, but these errors were encountered:

tjruwase · 2022-12-19T17:59:42Z

@hagope, thanks for reporting this issue. Do you mind sharing a PR?

DakeZhang1998 · 2023-01-05T04:20:47Z

Hi guys, I have encountered the same issue. Any update on this? Thanks!

loadams · 2023-01-05T16:21:48Z

@DakeZhang1998 - I'm reproducing this now, but are you able to simlink to work around this in the meantime?

DakeZhang1998 · 2023-01-05T22:24:31Z

@DakeZhang1998 - I'm reproducing this now, but are you able to simlink to work around this in the meantime?

I am quite new to this. I am trying to use Deepspeed on a Ubuntu server where I don't have sudo access. So I am managing CUDA in Anaconda. I installed PyTorch using the command from their official website: conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia. I am not sure where I can find the file cuda_profiler_api.h.

abhay-agarwal · 2023-01-07T18:20:28Z

This is probably a simple case of your python environment not containing the right PATH and LD_LIBRARY environment variables.

export PATH="/usr/local/cuda-11.7/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-11.7/lib64:$LD_LIBRARY_PATH"

I think deepspeed could also do better at finding your cuda install even when these variables are not specified given that they could follow the /usr/local/cuda convention.

DakeZhang1998 · 2023-01-07T20:18:34Z

Thanks to the technical manager in my school, my issue is fixed by forcing Conda to install cuda-nvprof (11.7) instead of v12 by default.

loadams · 2023-01-13T18:28:15Z

Thanks @abhay-agarwal and glad that works for you, @DakeZhang1998. @hagope, can you check your PATH and LD_LIBRARY_PATH? I'm taking a look at why DeepSpeed doesn't discover the cuda install any better.

loadams · 2023-01-13T23:01:19Z

This actually appears to be an issue on the PyTorch side. Credit to @HeyangQin for this, links here to PyTorch discussion and another DeepSpeed issue with info on this. Seems like we'll have to wait for the next PyTorch release, so closing this issue for now.

https://discuss.pytorch.org/t/not-able-to-include-cusolverdn-h/169122

#2684

lamnguyenx · 2023-01-31T10:47:25Z

Thanks to the technical manager in my school, my issue is fixed by forcing Conda to install cuda-nvprof (11.7) instead of v12 by default.

This worked. In my casem, when building FasterTransformer, I have to use nvcc --version to check for CUDA actual version (11.6) and then run conda install -c nvidia cuda-nvprof=11.6

zplizzi · 2023-02-26T01:50:44Z

For people installing CUDA with apt, doing apt install cuda-nvprof-11-7 fixed this problem for me.

minuenergy · 2023-10-11T08:35:31Z

This is probably a simple case of your python environment not containing the right PATH and LD_LIBRARY environment variables.
export PATH="/usr/local/cuda-11.7/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda-11.7/lib64:$LD_LIBRARY_PATH"
I think deepspeed could also do better at finding your cuda install even when these variables are not specified given that they could follow the /usr/local/cuda convention.

I don't know what to put in that environment variable.
when I command find / -name cuda , and this happened.

In this situation, what should put below?
export PATH="??"
export LD_LIBRARY_PATH="?"

addition, when i command which nvcc -> /opt/conda/bin/nvcc came out

hagope added bug Something isn't working training labels Dec 16, 2022

tjruwase assigned ShijieZZZZ and loadams and unassigned ShijieZZZZ Dec 19, 2022

tjruwase unassigned loadams Dec 19, 2022

loadams self-assigned this Jan 3, 2023

tjruwase mentioned this issue Jan 9, 2023

RuntimeError: Error building extension 'cpu_adam' #2682

Closed

loadams closed this as completed Jan 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot find cuda_profiler_api.h when building cpu_adam #2622

Cannot find cuda_profiler_api.h when building cpu_adam #2622

hagope commented Dec 16, 2022

tjruwase commented Dec 19, 2022

DakeZhang1998 commented Jan 5, 2023

loadams commented Jan 5, 2023

DakeZhang1998 commented Jan 5, 2023

abhay-agarwal commented Jan 7, 2023

DakeZhang1998 commented Jan 7, 2023

loadams commented Jan 13, 2023

loadams commented Jan 13, 2023

lamnguyenx commented Jan 31, 2023

zplizzi commented Feb 26, 2023

minuenergy commented Oct 11, 2023

Cannot find cuda_profiler_api.h when building cpu_adam #2622

Cannot find cuda_profiler_api.h when building cpu_adam #2622

Comments

hagope commented Dec 16, 2022

tjruwase commented Dec 19, 2022

DakeZhang1998 commented Jan 5, 2023

loadams commented Jan 5, 2023

DakeZhang1998 commented Jan 5, 2023

abhay-agarwal commented Jan 7, 2023

DakeZhang1998 commented Jan 7, 2023

loadams commented Jan 13, 2023

loadams commented Jan 13, 2023

lamnguyenx commented Jan 31, 2023

zplizzi commented Feb 26, 2023

minuenergy commented Oct 11, 2023