Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Manually add libtorch op test #10758

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Manually add libtorch op test #10758

wants to merge 14 commits into from

Conversation

t-vi
Copy link
Contributor

@t-vi t-vi commented Mar 24, 2022

We want the test to run in integration-gpu in the CI, but currently it is not picked up there and on the CPU, we do not have PyTorch. I would not know why it is not tried on the gpu-enabled CI run.

t-vi and others added 2 commits March 25, 2022 17:24
Co-authored-by: driazati <9407960+driazati@users.noreply.github.com>
@t-vi
Copy link
Contributor Author

t-vi commented Mar 25, 2022

So the PyTorch version we are using has the old, incompatible DLPack.
Quite likely this means we cannot enable the libtorch op before we move to PyTorch 1.11.
Would that be feasible?

@t-vi
Copy link
Contributor Author

t-vi commented Mar 25, 2022

For context: pytorch/pytorch#64995

@t-vi
Copy link
Contributor Author

t-vi commented Mar 25, 2022

@masahi Do you think bumping to PyTorch 1.11 would be an option?

@t-vi
Copy link
Contributor Author

t-vi commented Mar 26, 2022

I'm attempting to bump the version in #10794

@masahi
Copy link
Member

masahi commented Mar 31, 2022

I'm testing the new docker image in ci-docker-staging. I've hit an error from Keras twice https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/242/pipeline/315, not sure if it is related or flaky. There is also an error from ONNX tests that use PyTorch.

@t-vi
Copy link
Contributor Author

t-vi commented Mar 31, 2022

The keras looks like it's some compat-breaking change somewhere, I don't know if we accidentally (or intentionally) upgraded that...

I'll check the ONNX change, maybe the input has been renamed (I'm not sure that they would be terribly stable unless specified), but likely not before tomorrow.

@masahi
Copy link
Member

masahi commented Mar 31, 2022

No worry, I fixed both issues and the test passed https://github.com/apache/tvm/tree/ci-docker-staging. I will send a pr tomorrow

@t-vi
Copy link
Contributor Author

t-vi commented Mar 31, 2022 via email

@masahi
Copy link
Member

masahi commented Apr 1, 2022

#10849 was merged, this PR is finally unblocked now.

@t-vi
Copy link
Contributor Author

t-vi commented Apr 1, 2022

[2022-04-01T19:22:29.075Z] /usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/libcuda.so: file not recognized: File truncated
[2022-04-01T19:22:29.075Z] collect2: error: ld returned 1 exit status
[2022-04-01T19:22:29.075Z] CMakeFiles/tvm_runtime.dir/build.make:253: recipe for target 'libtvm_runtime.so' failed
[2022-04-01T19:22:29.075Z] make[2]: Leaving directory '/workspace/build'
[2022-04-01T19:22:29.075Z] make[2]: *** [libtvm_runtime.so] Error 1

any ideas what that could be about?

@masahi
Copy link
Member

masahi commented Apr 1, 2022

No idea, this happens only when linked with libtorch.so, right?

https://discuss.pytorch.org/t/building-pytorch-from-source-failed-on-ubuntu-16-04/43026 suggests our cuda install is broken, but since CI has passed with PT 1.11, I don't think that's the case.

One thing that could be related is that our ci-gpu is using fairly old cuda - 11.0 according to https://github.com/apache/tvm/blob/main/docker/Dockerfile.ci_gpu#L20. Maybe this is incompatible with the latest libtorch.so? Even if that's the case, I don't get why we didn't get any errors on the CI test.

In the worst case, we can install CPU-version of PT, since we don't need cuda capability for CI purpose.

cc @driazati

@t-vi
Copy link
Contributor Author

t-vi commented Apr 1, 2022

So I wonder if it's that libtorch and tvm try to link to different versions of libcuda...

@masahi
Copy link
Member

masahi commented Apr 1, 2022

When we install torch via pip install torch==1.11.0, how do we know which cuda it was built against? Maybe we should explicitly specify the cuda version as in https://pytorch.org/ (pip3 install torch -extra-index-url https://download.pytorch.org/whl/cu113). But a better way is still to install a CPU-only version.

Or we can try updating our NVIDIA container image to update cuda itself (we should do that anyway sooner or later). I don't know what process is required for that.

@t-vi
Copy link
Contributor Author

t-vi commented Apr 2, 2022

How do we know which cuda it was built against?

I think mostly from looking at the CUDA version it advertises on https://pytorch.org/get-started/locally/ (for the current PyTorch CUDA 10.2).
If the problem is only when linking libtvm and libtorch, a CPU-only PyTorch could solve this or maybe just finding a PyTorch enabled CPU instance to build the fallback and run this test on.

@t-vi
Copy link
Contributor Author

t-vi commented Apr 5, 2022

So some of the tests seem to look at whether cuda is available to torch. We could go over these and avoid that - I'm wondering whether the the measure_latency function is used at all btw.

def measure_latency(model, input_shapes, output_shapes, thresh, dryruns=40):

Another idea could be to use a CPU-only PyTorch in the build setup and then use a GPU-enabled one in the testing if the problem is mainly with the linkage of libtvm->libtorch->cuda...

@masahi
Copy link
Member

masahi commented Apr 5, 2022

I'm wondering whether the the measure_latency function is used at all btw.

That's a left over from the initial PR from AWS. It is completely irrelevant and should be removed.

We never need GPU-enabled PT for testing purposes. To avoid driver-related problems in the future, we should switch to CPU-only build.

@t-vi
Copy link
Contributor Author

t-vi commented Apr 5, 2022 via email

@t-vi t-vi mentioned this pull request Apr 6, 2022
@masahi masahi mentioned this pull request Apr 13, 2022
@masahi
Copy link
Member

masahi commented Apr 13, 2022

We can come back to this now. Recently when I tried to test PyTorchTVM stuff #8777, I hit an undefined symbol issue from libtorch. So even though we cleared cuda issues, I fear we might encounter another issues.

@t-vi
Copy link
Contributor Author

t-vi commented Apr 20, 2022

I hit an undefined symbol issue from libtorch. So even though we cleared cuda issues, I fear we might encounter another issues.

Was this with libtorch symbols?
I think the libtorch ABI is not terribly stable (to put it mildly) but if we use the exact same libtorch distribution (and my understanding is that we do in the CI), it should work.
I rarely hit this particular problem, because I use a global PyTorch installation for everything, but then I always get missing symbols if I update PyTorch without updating the auxiliary libraries that have extension modules...

@masahi
Copy link
Member

masahi commented Apr 20, 2022

It's the same problem as https://discuss.tvm.apache.org/t/can-someone-please-give-me-the-steps-to-use-pt-tvmdsoop/12525, something related to GLIBCXX_USE_CXX11_ABI issues apparently.

@t-vi
Copy link
Contributor Author

t-vi commented Apr 20, 2022

Yeah. So I think the official PyTorch Python packages don't use the "new" C++ API due to Python module conventions buried somewhere...

@t-vi
Copy link
Contributor Author

t-vi commented Apr 22, 2022

So I'm having trouble to reproduce the exact failure on my dev machine, but I use newer cmake (not sure if that matters, but I don't have a good idea as to what that error message wants to tell me...)

@masahi
Copy link
Member

masahi commented Apr 23, 2022

but I don't have a good idea as to what that error message wants to tell me...

You mean the error $<COMPILE_LANGUAGE:...> Unknown language. from https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/PR-10758/9/pipeline/? If PT 1.11 requires newer cmake, maybe you can reproduce the error if you use an older cmake?

Note that updating Dockerfile for cmake requires going through another CI image update process. If the cmake version is indeed the problem, I can take care of another update.

@t-vi
Copy link
Contributor Author

t-vi commented Apr 25, 2022

I don't really know if the cmake version is the problem as I have not been able to reproduce it locally nor do I fully understand the error message. So I guess it's nontrivial to find out in the CI if updating cmake helps. :/
If you had any advice, I'd gladly follow it.

@masahi
Copy link
Member

masahi commented Apr 25, 2022

No idea either, my cmake-fu is very low unfortunately...

@t-vi
Copy link
Contributor Author

t-vi commented Apr 27, 2022

After quite a bit of work, I have reproduced the failure locally and can confirm that bumping cmake to the version compiled from source does solve the error.

@t-vi
Copy link
Contributor Author

t-vi commented Apr 27, 2022

Of course, if a bump of the ubuntu base were imminent, that could be an alternative.

@masahi
Copy link
Member

masahi commented Apr 27, 2022

I see, I don't know if we are updating ubuntu any time soon (cc @driazati), so can you send docker/Dockerfile.ci_gpu change in a separate PR?

@driazati
Copy link
Member

driazati commented Apr 27, 2022

sorry for my delays, I was able to repro the error using tests/scripts/ci.py:

gh pr checkout -f 10758
# errors out
python tests/scripts/ci.py gpu

upgrading to cmake 3.11.4 (via this patch to ci.py) seems to fix it and the build finishes successfully though

diff --git a/tests/scripts/ci.py b/tests/scripts/ci.py
index c0ce085ff..4ee9725e7 100755
--- a/tests/scripts/ci.py
+++ b/tests/scripts/ci.py
@@ -370,10 +370,22 @@ def generate_command(
         if precheck is not None:
             precheck()
 
+        cmake_stuff = [
+            "sudo apt remove -y cmake",
+            "rm -rf CMake",
+            "git clone --branch v3.11.4 https://github.com/Kitware/CMake.git --depth=1",
+            "pushd CMake",
+            "./configure --parallel=24",
+            "make -j24",
+            "sudo ln -s $(pwd)/bin/cmake /usr/bin/cmake",
+            "popd",
+            "cmake --version",
+        ]
+
         if skip_build:
             scripts = []
         else:
-            scripts = [
+            scripts = cmake_stuff + [
                 f"./tests/scripts/task_config_build_{name}.sh {get_build_dir(name)}",
                 f"./tests/scripts/task_build.py --build-dir {get_build_dir(name)}",
                 # This can be removed once https://github.com/apache/tvm/pull/10257

then re-running

tests/scripts/ci.py gpu

so upgrading cmake to 3.11.4 in CI seems like the best way to go (we're on 3.10 right now), so @t-vi if ubuntu_install_cmake_source.sh builds cmake 3.11.4 that'd be good, and we'll need another (😢) docker image update as well

t-vi added a commit to t-vi/tvm that referenced this pull request Apr 28, 2022
The cmake version (3.10) in Ubuntu 18.04 does not cope well with the
more advanced cmake use in libtorch surrounding the CUDA target.
We switch to a self-built cmake 3.14 (already used by arm and i386 CI).

The context for this is apache#10758 .
@t-vi
Copy link
Contributor Author

t-vi commented Apr 28, 2022

I've submitted the build with the new cmake. Thank you for patiently helping me.

Mousius pushed a commit that referenced this pull request Apr 28, 2022
The cmake version (3.10) in Ubuntu 18.04 does not cope well with the
more advanced cmake use in libtorch surrounding the CUDA target.
We switch to a self-built cmake 3.14 (already used by arm and i386 CI).

The context for this is #10758 .
shtinsa pushed a commit to Deelvin/tvm that referenced this pull request May 17, 2022
The cmake version (3.10) in Ubuntu 18.04 does not cope well with the
more advanced cmake use in libtorch surrounding the CUDA target.
We switch to a self-built cmake 3.14 (already used by arm and i386 CI).

The context for this is apache#10758 .
@t-vi
Copy link
Contributor Author

t-vi commented May 23, 2022

So it seems that the libtorch linked into TVM and the libtorch that the PyTorch wants are incompatible now. :/
I will check if it is possible to only have the linked-to-libtorch symbols locally visible in the hope that they don't interfere then, but I'm not sure that that'll work.

juda pushed a commit to juda/tvm that referenced this pull request Jun 21, 2022
The cmake version (3.10) in Ubuntu 18.04 does not cope well with the
more advanced cmake use in libtorch surrounding the CUDA target.
We switch to a self-built cmake 3.14 (already used by arm and i386 CI).

The context for this is apache#10758 .
@areusch areusch added needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it and removed needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it labels Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants