-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Manually add libtorch op test #10758
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: driazati <9407960+driazati@users.noreply.github.com>
So the PyTorch version we are using has the old, incompatible DLPack. |
For context: pytorch/pytorch#64995 |
@masahi Do you think bumping to PyTorch 1.11 would be an option? |
I'm attempting to bump the version in #10794 |
I'm testing the new docker image in |
The keras looks like it's some compat-breaking change somewhere, I don't know if we accidentally (or intentionally) upgraded that... I'll check the ONNX change, maybe the input has been renamed (I'm not sure that they would be terribly stable unless specified), but likely not before tomorrow. |
No worry, I fixed both issues and the test passed https://github.com/apache/tvm/tree/ci-docker-staging. I will send a pr tomorrow |
Awesome! Thank you!
|
#10849 was merged, this PR is finally unblocked now. |
any ideas what that could be about? |
No idea, this happens only when linked with https://discuss.pytorch.org/t/building-pytorch-from-source-failed-on-ubuntu-16-04/43026 suggests our cuda install is broken, but since CI has passed with PT 1.11, I don't think that's the case. One thing that could be related is that our In the worst case, we can install CPU-version of PT, since we don't need cuda capability for CI purpose. cc @driazati |
So I wonder if it's that libtorch and tvm try to link to different versions of libcuda... |
When we install torch via Or we can try updating our NVIDIA container image to update cuda itself (we should do that anyway sooner or later). I don't know what process is required for that. |
I think mostly from looking at the CUDA version it advertises on https://pytorch.org/get-started/locally/ (for the current PyTorch CUDA 10.2). |
So some of the tests seem to look at whether cuda is available to torch. We could go over these and avoid that - I'm wondering whether the the
Another idea could be to use a CPU-only PyTorch in the build setup and then use a GPU-enabled one in the testing if the problem is mainly with the linkage of libtvm->libtorch->cuda... |
That's a left over from the initial PR from AWS. It is completely irrelevant and should be removed. We never need GPU-enabled PT for testing purposes. To avoid driver-related problems in the future, we should switch to CPU-only build. |
Thank you. I'll send a PR.
|
We can come back to this now. Recently when I tried to test PyTorchTVM stuff #8777, I hit an undefined symbol issue from libtorch. So even though we cleared cuda issues, I fear we might encounter another issues. |
Was this with libtorch symbols? |
It's the same problem as https://discuss.tvm.apache.org/t/can-someone-please-give-me-the-steps-to-use-pt-tvmdsoop/12525, something related to |
Yeah. So I think the official PyTorch Python packages don't use the "new" C++ API due to Python module conventions buried somewhere... |
So I'm having trouble to reproduce the exact failure on my dev machine, but I use newer cmake (not sure if that matters, but I don't have a good idea as to what that error message wants to tell me...) |
You mean the error Note that updating |
I don't really know if the cmake version is the problem as I have not been able to reproduce it locally nor do I fully understand the error message. So I guess it's nontrivial to find out in the CI if updating cmake helps. :/ |
No idea either, my cmake-fu is very low unfortunately... |
After quite a bit of work, I have reproduced the failure locally and can confirm that bumping cmake to the version compiled from source does solve the error. |
Of course, if a bump of the ubuntu base were imminent, that could be an alternative. |
I see, I don't know if we are updating ubuntu any time soon (cc @driazati), so can you send |
sorry for my delays, I was able to repro the error using gh pr checkout -f 10758
# errors out
python tests/scripts/ci.py gpu upgrading to cmake 3.11.4 (via this patch to diff --git a/tests/scripts/ci.py b/tests/scripts/ci.py
index c0ce085ff..4ee9725e7 100755
--- a/tests/scripts/ci.py
+++ b/tests/scripts/ci.py
@@ -370,10 +370,22 @@ def generate_command(
if precheck is not None:
precheck()
+ cmake_stuff = [
+ "sudo apt remove -y cmake",
+ "rm -rf CMake",
+ "git clone --branch v3.11.4 https://github.com/Kitware/CMake.git --depth=1",
+ "pushd CMake",
+ "./configure --parallel=24",
+ "make -j24",
+ "sudo ln -s $(pwd)/bin/cmake /usr/bin/cmake",
+ "popd",
+ "cmake --version",
+ ]
+
if skip_build:
scripts = []
else:
- scripts = [
+ scripts = cmake_stuff + [
f"./tests/scripts/task_config_build_{name}.sh {get_build_dir(name)}",
f"./tests/scripts/task_build.py --build-dir {get_build_dir(name)}",
# This can be removed once https://github.com/apache/tvm/pull/10257 then re-running tests/scripts/ci.py gpu so upgrading cmake to 3.11.4 in CI seems like the best way to go (we're on 3.10 right now), so @t-vi if |
The cmake version (3.10) in Ubuntu 18.04 does not cope well with the more advanced cmake use in libtorch surrounding the CUDA target. We switch to a self-built cmake 3.14 (already used by arm and i386 CI). The context for this is apache#10758 .
I've submitted the build with the new cmake. Thank you for patiently helping me. |
The cmake version (3.10) in Ubuntu 18.04 does not cope well with the more advanced cmake use in libtorch surrounding the CUDA target. We switch to a self-built cmake 3.14 (already used by arm and i386 CI). The context for this is #10758 .
The cmake version (3.10) in Ubuntu 18.04 does not cope well with the more advanced cmake use in libtorch surrounding the CUDA target. We switch to a self-built cmake 3.14 (already used by arm and i386 CI). The context for this is apache#10758 .
So it seems that the libtorch linked into TVM and the libtorch that the PyTorch wants are incompatible now. :/ |
The cmake version (3.10) in Ubuntu 18.04 does not cope well with the more advanced cmake use in libtorch surrounding the CUDA target. We switch to a self-built cmake 3.14 (already used by arm and i386 CI). The context for this is apache#10758 .
We want the test to run in integration-gpu in the CI, but currently it is not picked up there and on the CPU, we do not have PyTorch. I would not know why it is not tried on the gpu-enabled CI run.