Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bump PyTorch version to 1.11 #10794

Merged
merged 6 commits into from
Mar 30, 2022
Merged

bump PyTorch version to 1.11 #10794

merged 6 commits into from
Mar 30, 2022

Conversation

t-vi
Copy link
Contributor

@t-vi t-vi commented Mar 26, 2022

This bumps PyTorch to 1.11 and fixes 3 test failures. The bump is required to enable the libtorch_ops fallback due to DLPack version incompatibilities.

QAT training has its own fuse_modules version (fuse_modules_qat) in PyTorch, so I changed the test.

Two amendments to the front end:

  • searchsorted gets more (optional) parameters to its signature,
  • There is a sub variant with alpha (a - alpha * b). PyTorch rewrites rsub with alpha to this, but we ignored it. Now we handle sub with alpha.

Thank you, @masahi for getting me started with the bump and pointing out the test failures. Any errors are my own.

@t-vi
Copy link
Contributor Author

t-vi commented Mar 26, 2022

Caffe2 has been dropped from PyTorch, which is what we are getting here: pytorch/pytorch#67151

@t-vi
Copy link
Contributor Author

t-vi commented Mar 26, 2022

I think I could need a hint how to proceed, given that this is likely a major issue for all caffe2 use in TVM.

@masahi
Copy link
Member

masahi commented Mar 26, 2022

Unless there is a standalone way to install caffe2, then I think removing caffe2 support entirely is the only way forward. Our caffe2 frontend hasn't been updated for years, so I don't think people would object...

@masahi
Copy link
Member

masahi commented Mar 26, 2022

We can propose dropping caffe2 support to the community next week. In the meantime, we can remove caffe2 bits from CI to unblock this PR:

@masahi
Copy link
Member

masahi commented Mar 26, 2022

Also ci-qemu has failed to install PT 1.11 for some version conflict reason https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/PR-10794/1/pipeline/67. But I don't know why it needs to install pytorch... probably it only needs onnx. So I suggest decoupling pytorch install from ubuntu_install_onnx.sh and create ubuntu_install_pytorch.sh.

@@ -2938,7 +2948,7 @@ def create_convert_map(self):
"aten::pixel_shuffle": self.pixel_shuffle,
"aten::device": self.none,
"prim::device": self.none,
"aten::sub": self.make_elemwise("subtract"),
"aten::sub": self.sub,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this change breaks test_lstm.py:

FAILED test_lstm.py::test_custom_lstm - AttributeError: 'function' object has no attribute 'dtype'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll see to fixing it. Thank you.

@t-vi
Copy link
Contributor Author

t-vi commented Mar 27, 2022

Also ci-qemu has failed to install PT 1.11 for some version conflict reason

Ohoh. I think this is because Python 3.6 is EOL upstream and PyTorch doesn't support it anymore...
This is a larger can of worms than I had hoped for. 🙂

@masahi
Copy link
Member

masahi commented Mar 27, 2022

I thought we are now running Python 3.7, see

but maybe the qemu image hasn't been updated yet.

@leandron
Copy link
Contributor

I thought we are now running Python 3.7, see

but maybe the qemu image hasn't been updated yet.

On this, it looks like ci_qemu (https://github.com/apache/tvm/blob/main/docker/Dockerfile.ci_qemu) is not installing Python using the commons script: https://github.com/apache/tvm/blob/main/docker/install/ubuntu1804_install_python.sh.

This probably needs to be fixed in a separate PR. Do you want to send that fix? (Asking just because I won’t be able to take this for the next ~two weeks).

@masahi
Copy link
Member

masahi commented Mar 28, 2022

Note that probably we need to update the ci-qemu image version in https://github.com/apache/tvm/blob/main/Jenkinsfile#L54 as well. For that, we need to wait until a new nightly image containing the docker script update from your previous PR is pushed to https://hub.docker.com/r/tlcpackstaging/ci_qemu/tags.

So please wait another day before resuming PT 1.11 update work, or remove pytorch install from ubuntu_install_onnx.sh.

@t-vi
Copy link
Contributor Author

t-vi commented Mar 28, 2022

@masahi Thank you for merging the qemu Python bump and the advice. I think waiting for it to show up might be the cleanest option, especially given that I still need to learn so much about how the CI works. :)

@t-vi
Copy link
Contributor Author

t-vi commented Mar 29, 2022

@masahi OK, so now we're part of the nightly with the qemu update. How would I get a version tag that is useful for bumping the version in the Jenkinsfile?

@masahi
Copy link
Member

masahi commented Mar 29, 2022

Yeah, so the way it works is as follows

  • We update the ci-docker-staging branch https://github.com/apache/tvm/tree/ci-docker-staging to point Jenkinsfile to the new image in tlcpackstaging.
  • After the CI passes, we retag the nightly image as v0.83 etc and push the image to tlcpack dockerhub.
  • Update Jenkinsfile in main to point to the new image on tlcpack, and send a PR to merge this change.

Note that the above process needs to happen for every CI image update. Right now we are in the middle of ci-qemu update, but after we merge this PR to update PT, we need to go through the same exercises to update ci-gpu.

I've been trying to run a CI job on ci-docker-staging, I keep getting various errors but hopefully I can pass it in a few hours.

@t-vi
Copy link
Contributor Author

t-vi commented Mar 29, 2022

Thank you @masahi . So if I understand this right, the next step is something you need to do? I'd appreciate a shout if I can proceed here or help other bits along.

@masahi
Copy link
Member

masahi commented Mar 29, 2022

Yes, pushing changes to ci-docker-staging, or pushing a new image to tlcpack dockerhub need to be done by a committer. No worry, I've done this many times. But since we need to wait for at least two CI runs (one for ci-docker-staging to test the new image, another for main to actually update the image), things won't be done by today.

@leandron
Copy link
Contributor

Yes, pushing changes to ci-docker-staging, or pushing a new image to tlcpack dockerhub need to be done by a committer. No worry, I've done this many times. But since we need to wait for at least two CI runs (one for ci-docker-staging to test the new image, another for main to actually update the image), things won't be done by today.

Just a heads up that you're likely to see the issue with updated containers - #10696.

@masahi
Copy link
Member

masahi commented Mar 29, 2022

Yeah I hit that error once a couple of hours before, fortunately the ongoing run https://ci.tlcpack.ai/blue/organizations/jenkins/tvm/detail/ci-docker-staging/238/pipeline/ didn't hit it.

@masahi
Copy link
Member

masahi commented Mar 29, 2022

@t-vi the PR #10814 should unblock this work.

@masahi
Copy link
Member

masahi commented Mar 29, 2022

#10815 was merged so we can resume the work on this PR. I restarted a CI job by closing / reopening (there should be a better way than this...)

Also I dropped caffe2 deprecation announcement in https://discuss.tvm.apache.org/t/caffe2-frontend-support-is-being-dropped-to-unblock-pytorch-update/12442

@t-vi
Copy link
Contributor Author

t-vi commented Mar 30, 2022

[2022-03-29T23:00:39.045Z] unknown parent image ID sha256:f75815a47f249990da41ca0e349ede16f8710bd2d573de74a9afbd1a9b528055

in the docker build. I would not even know what to look at here... 😕

@masahi
Copy link
Member

masahi commented Mar 30, 2022

Hopefully it is just a flaky issue, since the error came from the unrelated image ci-hexagon

@t-vi
Copy link
Contributor Author

t-vi commented Mar 30, 2022

I'm not sure whether it is flakiness or merging main, but it seems to be past that bit now. Hopefully, if there are more failures, it'll be something I can look into to fix. Thank you for all your help @masahi !

@masahi
Copy link
Member

masahi commented Mar 30, 2022

you got an error in one of tests because the CI environment is still using PT 1.10. We get to use 1.11 at the last step of #10794 (comment)

So can you revert that change or use different code paths depending on versions?

@t-vi
Copy link
Contributor Author

t-vi commented Mar 30, 2022

I think I have the error from a premature 1.12 compat change, I'm fixing this right now but want to test locally.

@masahi
Copy link
Member

masahi commented Mar 30, 2022

@t-vi
Copy link
Contributor Author

t-vi commented Mar 30, 2022

All green. 🙂

@masahi masahi merged commit 6d42264 into apache:main Mar 30, 2022
@t-vi
Copy link
Contributor Author

t-vi commented Mar 30, 2022

Thank you, @masahi for merging and helping me. So next we would need to update the docker gpu image used before I can return to enabling the test that needs PyTorch 1.11?

@masahi
Copy link
Member

masahi commented Mar 30, 2022

Yes, the next step is to wait until the nightly image appears in https://hub.docker.com/r/tlcpackstaging/ci_gpu/tags. It will happen about 12h later.

@leandron
Copy link
Contributor

leandron commented Apr 5, 2022

To confirm here, as I did the update this time... the latest version of the images now contains PyTorch 1.11.

$ docker run -it --rm tlcpack/ci-gpu:v0.84 bash
root@6ba934f13b82:/# python3
Python 3.7.5 (default, Dec  9 2021, 17:04:37) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.11.0+cu102'
>>> 

@masahi
Copy link
Member

masahi commented Apr 5, 2022

The update to PT 1.11 was already done in #10849

pfk-beta pushed a commit to pfk-beta/tvm that referenced this pull request Apr 11, 2022
* bump PyTorch version to 1.11

* disable some caffe2 ci

* Fix sub conversion in PyTorch frontend

* use fuse_modules_qat if available, fallback to fuse_modules for older PyTorch

* Re-Run CI
mehrdadh pushed a commit to mehrdadh/tvm that referenced this pull request Apr 11, 2022
* bump PyTorch version to 1.11

* disable some caffe2 ci

* Fix sub conversion in PyTorch frontend

* use fuse_modules_qat if available, fallback to fuse_modules for older PyTorch

* Re-Run CI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants