Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

correct placement of +PTX flag + correcting cuda arch list #114

Closed
1 task done
Tracked by #107
ngam opened this issue Jun 2, 2022 · 25 comments · Fixed by #116
Closed
1 task done
Tracked by #107

correct placement of +PTX flag + correcting cuda arch list #114

ngam opened this issue Jun 2, 2022 · 25 comments · Fixed by #116
Labels
bug Something isn't working question Further information is requested

Comments

@ngam
Copy link
Contributor

ngam commented Jun 2, 2022

Solution to issue cannot be found in the documentation.

  • I checked the documentation.

Issue

Following (now deleted) comments in #108:

maybe +PTX should be removed or replaced at the end of the arch list we have here.

https://pytorch.org/docs/stable/cpp_extension.html

The +PTX option causes extension kernel binaries to include PTX instructions for the specified CC. PTX is an intermediate representation that allows kernels to runtime-compile for any CC >= the specified CC (for example, 8.6+PTX generates PTX that can runtime-compile for any GPU with CC >= 8.6). This improves your binary’s forward compatibility. However, relying on older PTX to provide forward compat by runtime-compiling for newer CCs can modestly reduce performance on those newer CCs. If you know exact CC(s) of the GPUs you want to target, you’re always better off specifying them individually. For example, if you want your extension to run on 8.0 and 8.6, “8.0+PTX” would work functionally because it includes PTX that can runtime-compile for 8.6, but “8.0 8.6” would be better.

Also, currently, this feedstock uses:

    export TORCH_CUDA_ARCH_LIST="3.5;5.0+PTX"

as base cuda_arch list, but it should actually be "3.7+PTX;5.0" to match upstream. Debatable if the PTX is useful here --- see above.

Installed packages

n/a

Environment info

n/a
@ngam ngam added the bug Something isn't working label Jun 2, 2022
@ngam ngam closed this as completed Jun 2, 2022
@ngam ngam reopened this Jun 3, 2022
@ngam ngam changed the title correct placement of +PTX flag correct placement of +PTX flag + correcting cuda arch list Jun 3, 2022
@hmaarrfk
Copy link
Contributor

hmaarrfk commented Jun 4, 2022

It seems we are only missing 3.7 which is useful for the K80
It seems that we are only missing 3.7, which is useful for the K80 pytorch/pytorch#24205 (comment)

Anyway, pytorch wants to drop it :/

@ngam
Copy link
Contributor Author

ngam commented Jun 4, 2022

Anyway, pytorch wants to drop it :/

I am 100% fine with dropping it! 😜

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Jun 4, 2022

ok, lets not change the build script until we get a user that specifically asks for it.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Jun 4, 2022

No, not that I am aware of. I am also fine with waiting (or potentially dropping the older GPUs as pytorch upstream drops them).

I am still confused about the validity and usefulness of PTX here. I have a negative feeling about it tbh, because I am afraid it is stalling the compilation because it is pushing to take into accounts things are already taken into account (i.e. 5.0+PTX means 5.0 and all after it, but we are already listing all those after it). I will try to run this separately to see what's going on, but for this PR, we can ignore it :)

I think PTX allows you to use an

  • OLD build of software (pytorch for example)
  • NEW ptx
  • NEW cuda driver
  • NEW GPU

@ngam
Copy link
Contributor Author

ngam commented Jun 4, 2022

Okay, sounds good to me. Let's close this then for the time being!

@ngam ngam closed this as not planned Won't fix, can't repro, duplicate, stale Jun 4, 2022
@ngam
Copy link
Contributor Author

ngam commented Jun 4, 2022

Thanks for engaging productively/constructively and patiently btw, not just with me but with others as well. 🌟

@h-vetinari
Copy link
Member

That's a seriously short architecture list, and in fact both of those have been deprecated as of cuda 11. I've had something like

    export TORCH_CUDA_ARCH_LIST="3.5;5.0;5.2;6.0;6.1;7.0;7.5;8.0;8.6"

in the faiss-feedstock for a long time already (actually dropping 3.5/5.0 for cuda 11+).

PTX is not really great because the runtime compilation can look like a mysterious hang, and it can take a while.

IMO we should pre-compile for the most common GPU architectures, even though that will unfortunately further increase compile times.

@h-vetinari
Copy link
Member

Not sure if the links are all up-to-date, but I tried to document (mostly for myself) what I was doing in the build script: https://github.com/conda-forge/faiss-split-feedstock/blob/main/recipe/build-lib.sh

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Jun 5, 2022

The is is appended to based on the cuda version

https://github.com/conda-forge/pytorch-cpu-feedstock/blob/main/recipe/build_pytorch.sh#L86

I don't think anybody is falling back on PTX. I'm not sure we even ship it by default.

@ngam
Copy link
Contributor Author

ngam commented Jun 8, 2022

export TORCH_CUDA_ARCH_LIST="3.5;5.0;5.2;6.0;6.1;7.0;7.5;8.0;8.6"

Yes, that was my point of this issue, so to move from:

3.5;5.0+PTX;6.0;6.1;7.0;7.5;8.0;8.6

to:

3.7;5.0;6.0;6.1;7.0;7.5;8.0;8.6

i.e. dropping "PTX" and changing 3.5 to 3.7. As a diff starting L84 in build.sh:

 if [[ ${cuda_compiler_version} != "None" ]]; then
     export USE_CUDA=1
-    export TORCH_CUDA_ARCH_LIST="3.5;5.0+PTX"
+    export TORCH_CUDA_ARCH_LIST="3.7;5.0"
     if [[ ${cuda_compiler_version} == 9.0* ]]; then
         export TORCH_CUDA_ARCH_LIST="$TORCH_CUDA_ARCH_LIST;6.0;7.0"

Sorry for the miscommunication. I concur with the suggestion that we don't change things unless someone asks for the time being, I am okay with that too. It is not an urgent issue

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Jun 8, 2022

I think we should keep the PTX. It ensure that those that buy the newest GPUs still get a working package.

It should not affect those with existing GPUs.

@ngam
Copy link
Contributor Author

ngam commented Jun 8, 2022

My concern with PTX is the speed of compilation.

I understand what you're saying about newer GPUs, but I would be blunt and say those fancy new GPUs should probably get the newer pytorch releases.

My vague worry is something along the lines:

  • we currently have: 5.0+PTX
  • this means (according to the docs above) that it will make it work with anything above 5.0, right?
  • but we have all those we want above 5.0
  • so if we want to keep +PTX, we should move to the end of each list, e.g. 8.6+PTX
  • but even then, what's the +PTX doing anyway?
  • I am afraid it is consuming compilation time for no reason other than decorative/failsafe optionality...
  • (I am quite puzzled why they still have it upstream?!)

@ngam
Copy link
Contributor Author

ngam commented Jun 8, 2022

It's on my list of items to test next time I run/build with singularity on an HPC (I try to keep a low profile...). If I find the compilation time differs, I will come back and reopen this issue...

@isuruf
Copy link
Member

isuruf commented Jun 8, 2022

so if we want to keep +PTX, we should move to the end of each list, e.g. 8.6+PTX

Agreed.

but even then, what's the +PTX doing anyway?

For compute capability greater than 8.6, it will generate PTX code and SASS will be generated at runtime.

My concern with PTX is the speed of compilation.

Why? Generating for PTX is only one and removing it won't magically speed things up. For cuda 11.2, you'll be reducing 1/9th of the total time

@ngam
Copy link
Contributor Author

ngam commented Jun 8, 2022

Why? Generating for PTX is only one and removing it won't magically speed things up. For cuda 11.2, you'll be reducing 1/9th of the total time

Oh, so adding PTX is only like adding for one additional arch? I obviously do not know enough. My worry/concern was that, e.g.,

3.5;5.0+PTX;6.0;6.1;7.0;7.5;8.0;8.6

was being compiled as

3.5;5.0+PTX;6.0+PTX;6.1+PTX;7.0+PTX;7.5+PTX;8.0+PTX;8.6+PTX

and so in other words, each one above the +PTX flag is being repeated twice...

I agree with 1/9th if we move it to the end, that makes sense to me.

@rgommers
Copy link

rgommers commented Jun 8, 2022

Oh, so adding PTX is only like adding for one additional arch?

Exactly. There's no magic reshuffling of that list going on, every specifier seperated by ; denotes a separate compile for the given architecture.

I agree with 1/9th if we move it to the end, that makes sense to me.

That really doesn't change anything, and seems like a gratuitous deviation from upstream, so I suggest to just not touch it.

(I am quite puzzled why they still have it upstream?!)

Removal has been discussed, in particular because one issue is JIT startup time the first time you do import torch on a GPU type which relies on PTX (~30 min). However, there are valid use cases for it.

One real-world example: say you are a visual effects studio and have a large software stack, e.g. a rendering pipeline for 3D animation, of which some PyTorch models are one small piece. Then you work on a game or a movie; at the start you do quality assurance (QA) for your whole software stack. Now you are two years in, and new GPU types have become the standard. What you want to do is plus those in, and have them "just work" (you are relying on there being no bugs in CUDA's PTX support, but nothing else - and that turns out to be acceptable). If you instead pull in a new release from the pytorch or conda-forge channel or rebuild PyTorch binaries by yourself, you will have to redo your QA process, which can be hugely expensive.

@isuruf
Copy link
Member

isuruf commented Jun 8, 2022

That really doesn't change anything, and seems like a gratuitous deviation from upstream, so I suggest to just not touch it.

It does change something. The PTX can be optimized for compute capability 8.6+ devices when 8.6+PTX is given, whereas 5.0+PTX generates PTX valid for compute capability 5.0+ devices and may miss optimizations.

@rgommers
Copy link

rgommers commented Jun 8, 2022

Thanks for the correction - I did some digging through the CUDA docs, and it indeed seems to be recommended to include PTX only for the newest architecture.

Since this hasn't been tested upstream, a one-off test with export CUDA_FORCE_PTX_JIT=1 to make sure that the 8.6+ PTX is all good will be useful when making this change.

@hmaarrfk hmaarrfk reopened this Jun 8, 2022
@hmaarrfk hmaarrfk added the question Further information is requested label Jun 8, 2022
@hmaarrfk
Copy link
Contributor

hmaarrfk commented Jun 8, 2022

Since this hasn't been tested upstream, a one-off test with export CUDA_FORCE_PTX_JIT=1 to make sure that the 8.6+ PTX is all good will be useful when making this change.

Do you want to run this test at runtime on a machine with a real GPU?

@rgommers
Copy link

rgommers commented Jun 8, 2022

Yes, if only just once - and can even be post-release, because no user will run into it straight away. But it seems healthy to do. I have a GPU, so I can volunteer to run the tests against a new conda-forge package once it's built.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Jun 8, 2022

Can you review the changes in #116
I'm going to start a rebuild now.

@ngam
Copy link
Contributor Author

ngam commented Jun 8, 2022

The one last thing I would inquire about here is: 3.5 vs 3.7. Should we match upstream replacing 3.5 with 3.7 or should we just let it go for the time being? Do people have strong opinions?

@ngam
Copy link
Contributor Author

ngam commented Jun 8, 2022

One real-world example: say you are a visual effects studio and have a large software stack, e.g. a rendering pipeline for 3D animation, of which some PyTorch models are one small piece. Then you work on a game or a movie; at the start you do quality assurance (QA) for your whole software stack. Now you are two years in, and new GPU types have become the standard. What you want to do is plus those in, and have them "just work" (you are relying on there being no bugs in CUDA's PTX support, but nothing else - and that turns out to be acceptable). If you instead pull in a new release from the pytorch or conda-forge channel or rebuild PyTorch binaries by yourself, you will have to redo your QA process, which can be hugely expensive.

Wow, I never thought this would be an application. Makes sense. I guess it would also apply to people using pytorch in like a deployment setting (e.g. an app or inference model or something). My point of view is often limited to research settings and so discussing things with people like you helps a lot! Thanks!

@mikemhenry
Copy link

@rgommers
Copy link

rgommers commented Jun 8, 2022

pytorch upstream just tacks PTX on the end of the arch list https://github.com/pytorch/pytorch/blob/88f4a1240275205f630dbaca97b1f5d7977f25f8/torch/utils/cpp_extension.py#L1751

That's for dynamic compilation IIRC, for package builds it's quite hard to figure out what the up-to-date scripts are, but I believe it's still these: https://github.com/pytorch/builder/search?q=TORCH_CUDA_ARCH_LIST&type=code

mikemhenry added a commit to regro-cf-autotick-bot/pytorch_sparse-feedstock that referenced this issue Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants