Manually add libtorch op test #10758

t-vi · 2022-03-24T10:43:25Z

We want the test to run in integration-gpu in the CI, but currently it is not picked up there and on the CPU, we do not have PyTorch. I would not know why it is not tried on the gpu-enabled CI run.

tests/scripts/task_config_build_gpu.sh

Co-authored-by: driazati <9407960+driazati@users.noreply.github.com>

t-vi · 2022-03-25T16:50:27Z

So the PyTorch version we are using has the old, incompatible DLPack.
Quite likely this means we cannot enable the libtorch op before we move to PyTorch 1.11.
Would that be feasible?

t-vi · 2022-03-25T16:51:10Z

For context: pytorch/pytorch#64995

t-vi · 2022-03-25T16:57:37Z

@masahi Do you think bumping to PyTorch 1.11 would be an option?

t-vi · 2022-03-26T05:28:57Z

I'm attempting to bump the version in #10794

masahi · 2022-03-31T10:21:26Z

I'm testing the new docker image in ci-docker-staging. I've hit an error from Keras twice https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/242/pipeline/315, not sure if it is related or flaky. There is also an error from ONNX tests that use PyTorch.

t-vi · 2022-03-31T11:48:23Z

The keras looks like it's some compat-breaking change somewhere, I don't know if we accidentally (or intentionally) upgraded that...

I'll check the ONNX change, maybe the input has been renamed (I'm not sure that they would be terribly stable unless specified), but likely not before tomorrow.

masahi · 2022-03-31T14:59:01Z

No worry, I fixed both issues and the test passed https://github.com/apache/tvm/tree/ci-docker-staging. I will send a pr tomorrow

t-vi · 2022-03-31T16:00:43Z

Awesome! Thank you!

masahi · 2022-04-01T14:21:20Z

#10849 was merged, this PR is finally unblocked now.

t-vi · 2022-04-01T19:42:09Z

[2022-04-01T19:22:29.075Z] /usr/lib/gcc/x86_64-linux-gnu/7/../../../x86_64-linux-gnu/libcuda.so: file not recognized: File truncated
[2022-04-01T19:22:29.075Z] collect2: error: ld returned 1 exit status
[2022-04-01T19:22:29.075Z] CMakeFiles/tvm_runtime.dir/build.make:253: recipe for target 'libtvm_runtime.so' failed
[2022-04-01T19:22:29.075Z] make[2]: Leaving directory '/workspace/build'
[2022-04-01T19:22:29.075Z] make[2]: *** [libtvm_runtime.so] Error 1

any ideas what that could be about?

masahi · 2022-04-01T20:00:52Z

No idea, this happens only when linked with libtorch.so, right?

https://discuss.pytorch.org/t/building-pytorch-from-source-failed-on-ubuntu-16-04/43026 suggests our cuda install is broken, but since CI has passed with PT 1.11, I don't think that's the case.

One thing that could be related is that our ci-gpu is using fairly old cuda - 11.0 according to https://github.com/apache/tvm/blob/main/docker/Dockerfile.ci_gpu#L20. Maybe this is incompatible with the latest libtorch.so? Even if that's the case, I don't get why we didn't get any errors on the CI test.

In the worst case, we can install CPU-version of PT, since we don't need cuda capability for CI purpose.

cc @driazati

t-vi · 2022-04-01T20:02:39Z

So I wonder if it's that libtorch and tvm try to link to different versions of libcuda...

masahi · 2022-04-01T20:24:57Z

When we install torch via pip install torch==1.11.0, how do we know which cuda it was built against? Maybe we should explicitly specify the cuda version as in https://pytorch.org/ (pip3 install torch -extra-index-url https://download.pytorch.org/whl/cu113). But a better way is still to install a CPU-only version.

Or we can try updating our NVIDIA container image to update cuda itself (we should do that anyway sooner or later). I don't know what process is required for that.

t-vi · 2022-04-02T02:15:55Z

How do we know which cuda it was built against?

I think mostly from looking at the CUDA version it advertises on https://pytorch.org/get-started/locally/ (for the current PyTorch CUDA 10.2).
If the problem is only when linking libtvm and libtorch, a CPU-only PyTorch could solve this or maybe just finding a PyTorch enabled CPU instance to build the fallback and run this test on.

t-vi · 2022-04-05T10:08:52Z

So some of the tests seem to look at whether cuda is available to torch. We could go over these and avoid that - I'm wondering whether the the measure_latency function is used at all btw.

tvm/tests/python/frontend/pytorch/test_forward.py

Line 127 in 41cfd3d

def measure_latency(model, input_shapes, output_shapes, thresh, dryruns=40):

Another idea could be to use a CPU-only PyTorch in the build setup and then use a GPU-enabled one in the testing if the problem is mainly with the linkage of libtvm->libtorch->cuda...

masahi · 2022-04-05T10:15:52Z

I'm wondering whether the the measure_latency function is used at all btw.

That's a left over from the initial PR from AWS. It is completely irrelevant and should be removed.

We never need GPU-enabled PT for testing purposes. To avoid driver-related problems in the future, we should switch to CPU-only build.

t-vi · 2022-04-05T11:36:56Z

Thank you. I'll send a PR.

masahi · 2022-04-13T19:58:07Z

We can come back to this now. Recently when I tried to test PyTorchTVM stuff #8777, I hit an undefined symbol issue from libtorch. So even though we cleared cuda issues, I fear we might encounter another issues.

t-vi · 2022-04-20T13:16:22Z

I hit an undefined symbol issue from libtorch. So even though we cleared cuda issues, I fear we might encounter another issues.

Was this with libtorch symbols?
I think the libtorch ABI is not terribly stable (to put it mildly) but if we use the exact same libtorch distribution (and my understanding is that we do in the CI), it should work.
I rarely hit this particular problem, because I use a global PyTorch installation for everything, but then I always get missing symbols if I update PyTorch without updating the auxiliary libraries that have extension modules...

masahi · 2022-04-20T20:08:18Z

It's the same problem as https://discuss.tvm.apache.org/t/can-someone-please-give-me-the-steps-to-use-pt-tvmdsoop/12525, something related to GLIBCXX_USE_CXX11_ABI issues apparently.

t-vi · 2022-04-20T20:11:10Z

Yeah. So I think the official PyTorch Python packages don't use the "new" C++ API due to Python module conventions buried somewhere...

t-vi · 2022-04-22T14:52:31Z

So I'm having trouble to reproduce the exact failure on my dev machine, but I use newer cmake (not sure if that matters, but I don't have a good idea as to what that error message wants to tell me...)

masahi · 2022-04-23T08:38:00Z

but I don't have a good idea as to what that error message wants to tell me...

You mean the error $<COMPILE_LANGUAGE:...> Unknown language. from https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/PR-10758/9/pipeline/? If PT 1.11 requires newer cmake, maybe you can reproduce the error if you use an older cmake?

Note that updating Dockerfile for cmake requires going through another CI image update process. If the cmake version is indeed the problem, I can take care of another update.

t-vi · 2022-04-25T06:04:27Z

I don't really know if the cmake version is the problem as I have not been able to reproduce it locally nor do I fully understand the error message. So I guess it's nontrivial to find out in the CI if updating cmake helps. :/
If you had any advice, I'd gladly follow it.

masahi · 2022-04-25T06:40:15Z

No idea either, my cmake-fu is very low unfortunately...

t-vi · 2022-04-27T19:11:34Z

After quite a bit of work, I have reproduced the failure locally and can confirm that bumping cmake to the version compiled from source does solve the error.

t-vi · 2022-04-27T19:12:00Z

Of course, if a bump of the ubuntu base were imminent, that could be an alternative.

masahi · 2022-04-27T20:13:09Z

I see, I don't know if we are updating ubuntu any time soon (cc @driazati), so can you send docker/Dockerfile.ci_gpu change in a separate PR?

driazati · 2022-04-27T23:36:09Z

sorry for my delays, I was able to repro the error using tests/scripts/ci.py:

gh pr checkout -f 10758
# errors out
python tests/scripts/ci.py gpu

upgrading to cmake 3.11.4 (via this patch to ci.py) seems to fix it and the build finishes successfully though

diff --git a/tests/scripts/ci.py b/tests/scripts/ci.py
index c0ce085ff..4ee9725e7 100755
--- a/tests/scripts/ci.py
+++ b/tests/scripts/ci.py
@@ -370,10 +370,22 @@ def generate_command(
         if precheck is not None:
             precheck()
 
+        cmake_stuff = [
+            "sudo apt remove -y cmake",
+            "rm -rf CMake",
+            "git clone --branch v3.11.4 https://github.com/Kitware/CMake.git --depth=1",
+            "pushd CMake",
+            "./configure --parallel=24",
+            "make -j24",
+            "sudo ln -s $(pwd)/bin/cmake /usr/bin/cmake",
+            "popd",
+            "cmake --version",
+        ]
+
         if skip_build:
             scripts = []
         else:
-            scripts = [
+            scripts = cmake_stuff + [
                 f"./tests/scripts/task_config_build_{name}.sh {get_build_dir(name)}",
                 f"./tests/scripts/task_build.py --build-dir {get_build_dir(name)}",
                 # This can be removed once https://github.com/apache/tvm/pull/10257

then re-running

tests/scripts/ci.py gpu

so upgrading cmake to 3.11.4 in CI seems like the best way to go (we're on 3.10 right now), so @t-vi if ubuntu_install_cmake_source.sh builds cmake 3.11.4 that'd be good, and we'll need another (😢) docker image update as well

The cmake version (3.10) in Ubuntu 18.04 does not cope well with the more advanced cmake use in libtorch surrounding the CUDA target. We switch to a self-built cmake 3.14 (already used by arm and i386 CI). The context for this is apache#10758 .

t-vi · 2022-04-28T07:19:59Z

I've submitted the build with the new cmake. Thank you for patiently helping me.

The cmake version (3.10) in Ubuntu 18.04 does not cope well with the more advanced cmake use in libtorch surrounding the CUDA target. We switch to a self-built cmake 3.14 (already used by arm and i386 CI). The context for this is #10758 .

The cmake version (3.10) in Ubuntu 18.04 does not cope well with the more advanced cmake use in libtorch surrounding the CUDA target. We switch to a self-built cmake 3.14 (already used by arm and i386 CI). The context for this is apache#10758 .

t-vi · 2022-05-23T07:25:39Z

So it seems that the libtorch linked into TVM and the libtorch that the PyTorch wants are incompatible now. :/
I will check if it is possible to only have the linked-to-libtorch symbols locally visible in the hope that they don't interfere then, but I'm not sure that that'll work.

The cmake version (3.10) in Ubuntu 18.04 does not cope well with the more advanced cmake use in libtorch surrounding the CUDA target. We switch to a self-built cmake 3.14 (already used by arm and i386 CI). The context for this is apache#10758 .

t-vi added 3 commits March 24, 2022 11:43

Manually add libtorch op test

97653cb

fix black difference

eb6ad91

add USE_LIBTORCH to GPU build config

85aba6c

driazati reviewed Mar 25, 2022

View reviewed changes

tests/scripts/task_config_build_gpu.sh Outdated Show resolved Hide resolved

t-vi and others added 2 commits March 25, 2022 17:24

Update tests/scripts/task_config_build_gpu.sh

244609b

Co-authored-by: driazati <9407960+driazati@users.noreply.github.com>

include dlpack at very top

5efcc99

masahi mentioned this pull request Mar 31, 2022

[CI] Update GPU image for PyTorch 1.11 #10849

Merged

t-vi added 3 commits April 1, 2022 19:29

Merge branch 'main' into libtorch_op_test

03c8f0c

undo messing with the includes

355c5a4

reorder includes to avoid redifinition warning

70a8929

t-vi mentioned this pull request Apr 6, 2022

Switch to CPU PyTorch #10914

Merged

masahi mentioned this pull request Apr 13, 2022

[CI] Update GPU image #10992

Merged

Merge branch 'main' into libtorch_op_test

e2b70b5

experimentally use self-built cmake like arm for ci-gpu

daed2cc

t-vi mentioned this pull request Apr 28, 2022

Bump cmake version for GPU build #11156

Merged

Merge branch 'main' into libtorch_op_test

a7e954d

t-vi mentioned this pull request May 17, 2022

use libtorch c++ distribution with c++11 strings in gpu image #11346

Merged

t-vi added 3 commits May 20, 2022 09:04

Merge branch 'main' into libtorch_op_test

1b682a5

use libtorch

9873eaa

Merge branch 'main' into libtorch_op_test

b5b83f9

areusch added needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it and removed needs-triage PRs or issues that need to be investigated by maintainers to find the right assignees to address it labels Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manually add libtorch op test #10758

Manually add libtorch op test #10758

t-vi commented Mar 24, 2022

t-vi commented Mar 25, 2022

t-vi commented Mar 25, 2022

t-vi commented Mar 25, 2022

t-vi commented Mar 26, 2022

masahi commented Mar 31, 2022

t-vi commented Mar 31, 2022

masahi commented Mar 31, 2022

t-vi commented Mar 31, 2022 via email

masahi commented Apr 1, 2022

t-vi commented Apr 1, 2022

masahi commented Apr 1, 2022

t-vi commented Apr 1, 2022

masahi commented Apr 1, 2022

t-vi commented Apr 2, 2022 •

edited

Loading

t-vi commented Apr 5, 2022

masahi commented Apr 5, 2022

t-vi commented Apr 5, 2022 via email

masahi commented Apr 13, 2022

t-vi commented Apr 20, 2022

masahi commented Apr 20, 2022

t-vi commented Apr 20, 2022

t-vi commented Apr 22, 2022

masahi commented Apr 23, 2022 •

edited

Loading

t-vi commented Apr 25, 2022

masahi commented Apr 25, 2022

t-vi commented Apr 27, 2022

t-vi commented Apr 27, 2022 •

edited

Loading

masahi commented Apr 27, 2022

driazati commented Apr 27, 2022 •

edited

Loading

t-vi commented Apr 28, 2022

t-vi commented May 23, 2022

Manually add libtorch op test #10758

Are you sure you want to change the base?

Manually add libtorch op test #10758

Conversation

t-vi commented Mar 24, 2022

t-vi commented Mar 25, 2022

t-vi commented Mar 25, 2022

t-vi commented Mar 25, 2022

t-vi commented Mar 26, 2022

masahi commented Mar 31, 2022

t-vi commented Mar 31, 2022

masahi commented Mar 31, 2022

t-vi commented Mar 31, 2022 via email

masahi commented Apr 1, 2022

t-vi commented Apr 1, 2022

masahi commented Apr 1, 2022

t-vi commented Apr 1, 2022

masahi commented Apr 1, 2022

t-vi commented Apr 2, 2022 • edited Loading

t-vi commented Apr 5, 2022

masahi commented Apr 5, 2022

t-vi commented Apr 5, 2022 via email

masahi commented Apr 13, 2022

t-vi commented Apr 20, 2022

masahi commented Apr 20, 2022

t-vi commented Apr 20, 2022

t-vi commented Apr 22, 2022

masahi commented Apr 23, 2022 • edited Loading

t-vi commented Apr 25, 2022

masahi commented Apr 25, 2022

t-vi commented Apr 27, 2022

t-vi commented Apr 27, 2022 • edited Loading

masahi commented Apr 27, 2022

driazati commented Apr 27, 2022 • edited Loading

t-vi commented Apr 28, 2022

t-vi commented May 23, 2022

t-vi commented Apr 2, 2022 •

edited

Loading

masahi commented Apr 23, 2022 •

edited

Loading

t-vi commented Apr 27, 2022 •

edited

Loading

driazati commented Apr 27, 2022 •

edited

Loading