Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudnn library is not accessible #2656

Open
yeounoh opened this issue Mar 5, 2022 · 2 comments
Open

cudnn library is not accessible #2656

yeounoh opened this issue Mar 5, 2022 · 2 comments

Comments

@yeounoh
Copy link

yeounoh commented Mar 5, 2022

Describe the current behavior
Colab GPU runtime has cudnn8 pre-installed with cuda11.2, but the cudnn library is placed outside $LD_LIBRARY_PATH=/usr/local/nvidia/lib;/usr/local/nvidia/lib64 and/or outside /usr/local/cuda which layman CUDA user would expect it to be. It's placed in /usr/lib/x86_64-linux-gnu without exposure to the colab user:

// current env vars
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
PATH=/opt/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin

This causes training a PyTorch model on CUDA to run into missing cudnn library error for some operators (e.g., convolution):

`RuntimeError: UNKNOWN: Failed to determine best cudnn convolution algorithm: INTERNAL: All algorithms tried for %custom-call.1 = (f32[1,112,112,64]{2,1,3,0}, u8[0]{0}) custom-call(f32[1,229,229,3]{2,1,3,0} %pad, f32[7,7,3,64]{1,0,2,3} %copy.4), window={size=7x7 stride=2x2}, dim_labels=b01f_01io->b01f, custom_call_target="__cudnn$convForward", metadata={op_type="conv_general_dilated" op_name="jit(conv_general_dilated)/conv_general_dilated[\n batch_group_count=1\n dimension_numbers=ConvDimensionNumbers(lhs_spec=(0, 3, 1, 2), rhs_spec=(3, 2, 0, 1), out_spec=(0, 3, 1, 2))\n feature_group_count=1\n lhs_dilation=(1, 1)\n lhs_shape=(1, 224, 224, 3)\n padding=((2, 3), (2, 3))\n precision=None\n preferred_element_type=None\n rhs_dilation=(1, 1)\n rhs_shape=(7, 7, 3, 64)\n window_strides=(2, 2)\n]" source_file="/media/node/Materials/anaconda3/envs/xmcgan/lib/python3.9/site-packages/flax/linen/linear.py" source_line=282}, backend_config="{"algorithm":"0","tensor_ops_enabled":false,"conv_result_scale":1,"activation_mode":"0","side_input_scale":0}" failed. Falling back to default algorithm.

Describe the expected behavior
Unless we intended users to examine the directory structure and find where the scattered cuda related library packages, we should export $LD_LIBRARY_PATH or some sort to include both cuda and cudnn libraries. So user can just run cuda workload without re-installing or linking.

For instance,

export LD_LIBRARY_PATH="${LD_LIBRARY_PATH};/usr/lib/x86_64-linux-gnu"

What web browser you are using
(Chrome, Firefox, Safari, etc.)
Chrome

Additional context
I was testing torch 1.11 and torch-xla 1.11, before the release -- and found this issue and #2649 .

@yeounoh yeounoh added the bug label Mar 5, 2022
@yeounoh
Copy link
Author

yeounoh commented Mar 10, 2022

I also noticed that the pre-installed cudnn version is lower than what we require:

Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.1.  CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.

Would it be possible to bump up the version? Also, the GPU box runs cuda 11.2 which is not officially supported by cudnn 8.0.5 release note, cuda 11.2 is compatible with cudnn 8.1.0+.

@yeounoh
Copy link
Author

yeounoh commented Mar 10, 2022

It seems that /usr/local/cuda/ and /usr/lib/x86_64-linux-gnu hold different versions of cudnn, 8.0.5 and 8.1.1 -- I will test with a clean slate box.

EDIT: tried again on a clean GPU box, it seems that the pre-installed version is 8.0.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants