Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a segmentation fault due to unloaded libcufile #158

Merged
merged 2 commits into from
Nov 20, 2021

Conversation

gigony
Copy link
Contributor

@gigony gigony commented Nov 20, 2021

Do not unload the cufile (GDS) library because libcufile registers a cleanup function with atexit() and unloading the library will cause a segfault (calling the cleanup function that doesn't exist anymore).

It turns out that CUDA 11.5 bundled GDS library (libcufile) and it is available in GPUCI's build/test container image (such as gpuci/rapidsai:21.12-cuda11.5-devel-ubuntu18.04-py3.7).
cuCIM would dynamically load libcufile.so shared library and unload it when a global static variable in cuCIM is destroyed.
Since libcufile's cleanup function(through atexit_thread handler) is registered after the libcufile is loaded, it causes a segmentation fault at exit if the libcufile is explicitly unloaded through dlclose().
(See #153)

You can see discussions with atexit in dynamically loaded shared library keywords (the actual root cause is the use of thread_local variable in libcufile.so).

Maybe using destructor attribute could fix the issue from GDS(cufile) side.

This patch leaves the libcufile library loaded, without calling ::dlclose(library_handle) method to unload libcufile.so.

Update (2021-11-20): I couldn't find atexit() used in libcufile (though I can see atexit() call for an executable file[fio]) but it seems that a method is registered and called at the exit time so we cannot help but leave the dynamically loaded library without unloading.
Update (2021-11-23): libcufile.so started using thread_local variable since v1.1 which makes the shared library unloadable.
For this reason, this patch is a correct patch until libcufile is updated to make it possible.

Related information:

Do not unload the cufile (GDS) library as libcufile
registers a cleanup function with atexit() and unloading
the library will cause a segfault (calling the cleanup
function that doesn't exist anymore).

- Leave the libcufile library loaded
@gigony gigony added the bug Something isn't working label Nov 20, 2021
@gigony gigony added this to the v21.12.00 milestone Nov 20, 2021
@gigony gigony self-assigned this Nov 20, 2021
@jakirkham jakirkham added the non-breaking Introduces a non-breaking change label Nov 20, 2021
@gigony
Copy link
Contributor Author

gigony commented Nov 20, 2021

Need to take some time to dig the libcufile in detail if it is actually due to atexit() call, or cuCIM's destructor (by a static variable) is calling an unloaded method in libcufile, because I can see one atexit() method call but only in an executable code.

@gigony gigony changed the title Fix a segmentation fault due to libcufile's atexit() call WIP: Fix a segmentation fault due to libcufile's atexit() call Nov 20, 2021
@gigony gigony changed the title WIP: Fix a segmentation fault due to libcufile's atexit() call Fix a segmentation fault due to libcufile's atexit() call Nov 20, 2021
@gigony gigony changed the title Fix a segmentation fault due to libcufile's atexit() call Fix a segmentation fault due to unloaded libcufile Nov 20, 2021
@ajschmidt8 ajschmidt8 merged commit 891dc45 into rapidsai:branch-21.12 Nov 20, 2021
@gigony
Copy link
Contributor Author

gigony commented Nov 23, 2021

@jakirkham More update on the root-cause:

libcufile.so started using thread_local variable since v1.1 which makes the shared library unloadable.
For this reason, this patch is a correct patch until libcufile is updated to make it possible.

Related information:

@gigony
Copy link
Contributor Author

gigony commented Dec 3, 2021

GDS team suggested using RTLD_NODELETE when calling dlopen() so that the library is not unloaded.
They will fix the issue by adding -z nodelete in the link flags when building libcufile.so

dlopen("/usr/local/cuda/targets/x86_64-linux/lib/libcufile.so", RTLD_NOW | RTLD_LOCAL| RTLD_NODELETE);

see #177

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working non-breaking Introduces a non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants