cudnn library is not accessible #2656

yeounoh · 2022-03-05T18:38:49Z

Describe the current behavior
Colab GPU runtime has cudnn8 pre-installed with cuda11.2, but the cudnn library is placed outside $LD_LIBRARY_PATH=/usr/local/nvidia/lib;/usr/local/nvidia/lib64 and/or outside /usr/local/cuda which layman CUDA user would expect it to be. It's placed in /usr/lib/x86_64-linux-gnu without exposure to the colab user:

// current env vars
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
PATH=/opt/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/tools/node/bin:/tools/google-cloud-sdk/bin

This causes training a PyTorch model on CUDA to run into missing cudnn library error for some operators (e.g., convolution):

`RuntimeError: UNKNOWN: Failed to determine best cudnn convolution algorithm: INTERNAL: All algorithms tried for %custom-call.1 = (f32[1,112,112,64]{2,1,3,0}, u8[0]{0}) custom-call(f32[1,229,229,3]{2,1,3,0} %pad, f32[7,7,3,64]{1,0,2,3} %copy.4), window={size=7x7 stride=2x2}, dim_labels=b01f_01io->b01f, custom_call_target="__cudnn$convForward", metadata={op_type="conv_general_dilated" op_name="jit(conv_general_dilated)/conv_general_dilated[\n batch_group_count=1\n dimension_numbers=ConvDimensionNumbers(lhs_spec=(0, 3, 1, 2), rhs_spec=(3, 2, 0, 1), out_spec=(0, 3, 1, 2))\n feature_group_count=1\n lhs_dilation=(1, 1)\n lhs_shape=(1, 224, 224, 3)\n padding=((2, 3), (2, 3))\n precision=None\n preferred_element_type=None\n rhs_dilation=(1, 1)\n rhs_shape=(7, 7, 3, 64)\n window_strides=(2, 2)\n]" source_file="/media/node/Materials/anaconda3/envs/xmcgan/lib/python3.9/site-packages/flax/linen/linear.py" source_line=282}, backend_config="{"algorithm":"0","tensor_ops_enabled":false,"conv_result_scale":1,"activation_mode":"0","side_input_scale":0}" failed. Falling back to default algorithm.

Describe the expected behavior
Unless we intended users to examine the directory structure and find where the scattered cuda related library packages, we should export $LD_LIBRARY_PATH or some sort to include both cuda and cudnn libraries. So user can just run cuda workload without re-installing or linking.

For instance,

export LD_LIBRARY_PATH="${LD_LIBRARY_PATH};/usr/lib/x86_64-linux-gnu"

What web browser you are using
(Chrome, Firefox, Safari, etc.)
Chrome

Additional context
I was testing torch 1.11 and torch-xla 1.11, before the release -- and found this issue and #2649 .

The text was updated successfully, but these errors were encountered:

yeounoh · 2022-03-10T18:20:57Z

I also noticed that the pre-installed cudnn version is lower than what we require:

Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.1.  CuDNN library needs to have matching major version and equal or higher minor version. If using a binary install, upgrade your CuDNN library.  If building from sources, make sure the library loaded at runtime is compatible with the version specified during compile configuration.

Would it be possible to bump up the version? Also, the GPU box runs cuda 11.2 which is not officially supported by cudnn 8.0.5 release note, cuda 11.2 is compatible with cudnn 8.1.0+.

yeounoh · 2022-03-10T19:20:30Z

It seems that /usr/local/cuda/ and /usr/lib/x86_64-linux-gnu hold different versions of cudnn, 8.0.5 and 8.1.1 -- I will test with a clean slate box.

EDIT: tried again on a clean GPU box, it seems that the pre-installed version is 8.0.5.

yeounoh added the bug label Mar 5, 2022

cperry-goog added the triaged label Mar 10, 2022

craigcitro mentioned this issue Mar 18, 2022

Upgrading PyTorch to v1.11.0 and torchvision to v0.12.0 and torchaudio to v0.11.0 #2668

Closed

This was referenced Apr 21, 2022

UnknownError: Graph execution error: Fail to find the dnn implementation. [[{{node CudnnRNN}}]] #2673

Closed

"UnknownError: Failed to get convolution algorithm" cuDNN error TF 2.6.3 #2765

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cudnn library is not accessible #2656

cudnn library is not accessible #2656

yeounoh commented Mar 5, 2022 •

edited

Loading

yeounoh commented Mar 10, 2022 •

edited

Loading

yeounoh commented Mar 10, 2022 •

edited

Loading

cudnn library is not accessible #2656

cudnn library is not accessible #2656

Comments

yeounoh commented Mar 5, 2022 • edited Loading

yeounoh commented Mar 10, 2022 • edited Loading

yeounoh commented Mar 10, 2022 • edited Loading

yeounoh commented Mar 5, 2022 •

edited

Loading

yeounoh commented Mar 10, 2022 •

edited

Loading

yeounoh commented Mar 10, 2022 •

edited

Loading