correct placement of +PTX flag + correcting cuda arch list #114

ngam · 2022-06-02T04:37:25Z

Solution to issue cannot be found in the documentation.

I checked the documentation.

Issue

Following (now deleted) comments in #108:

maybe +PTX should be removed or replaced at the end of the arch list we have here.

https://pytorch.org/docs/stable/cpp_extension.html

The +PTX option causes extension kernel binaries to include PTX instructions for the specified CC. PTX is an intermediate representation that allows kernels to runtime-compile for any CC >= the specified CC (for example, 8.6+PTX generates PTX that can runtime-compile for any GPU with CC >= 8.6). This improves your binary’s forward compatibility. However, relying on older PTX to provide forward compat by runtime-compiling for newer CCs can modestly reduce performance on those newer CCs. If you know exact CC(s) of the GPUs you want to target, you’re always better off specifying them individually. For example, if you want your extension to run on 8.0 and 8.6, “8.0+PTX” would work functionally because it includes PTX that can runtime-compile for 8.6, but “8.0 8.6” would be better.

Also, currently, this feedstock uses:

    export TORCH_CUDA_ARCH_LIST="3.5;5.0+PTX"

as base cuda_arch list, but it should actually be "3.7+PTX;5.0" to match upstream. Debatable if the PTX is useful here --- see above.

Installed packages

n/a

Environment info

n/a

The text was updated successfully, but these errors were encountered:

hmaarrfk · 2022-06-04T22:53:18Z

It seems we are only missing 3.7 which is useful for the K80
It seems that we are only missing 3.7, which is useful for the K80 pytorch/pytorch#24205 (comment)

Anyway, pytorch wants to drop it :/

ngam · 2022-06-04T23:06:06Z

Anyway, pytorch wants to drop it :/

I am 100% fine with dropping it! 😜

hmaarrfk · 2022-06-04T23:07:58Z

ok, lets not change the build script until we get a user that specifically asks for it.

hmaarrfk · 2022-06-04T23:14:47Z

No, not that I am aware of. I am also fine with waiting (or potentially dropping the older GPUs as pytorch upstream drops them).

I am still confused about the validity and usefulness of PTX here. I have a negative feeling about it tbh, because I am afraid it is stalling the compilation because it is pushing to take into accounts things are already taken into account (i.e. 5.0+PTX means 5.0 and all after it, but we are already listing all those after it). I will try to run this separately to see what's going on, but for this PR, we can ignore it :)

I think PTX allows you to use an

OLD build of software (pytorch for example)
NEW ptx
NEW cuda driver
NEW GPU

ngam · 2022-06-04T23:36:47Z

Okay, sounds good to me. Let's close this then for the time being!

ngam · 2022-06-04T23:56:37Z

Thanks for engaging productively/constructively and patiently btw, not just with me but with others as well. 🌟

h-vetinari · 2022-06-05T09:15:34Z

That's a seriously short architecture list, and in fact both of those have been deprecated as of cuda 11. I've had something like

    export TORCH_CUDA_ARCH_LIST="3.5;5.0;5.2;6.0;6.1;7.0;7.5;8.0;8.6"

in the faiss-feedstock for a long time already (actually dropping 3.5/5.0 for cuda 11+).

PTX is not really great because the runtime compilation can look like a mysterious hang, and it can take a while.

IMO we should pre-compile for the most common GPU architectures, even though that will unfortunately further increase compile times.

h-vetinari · 2022-06-05T09:16:45Z

Not sure if the links are all up-to-date, but I tried to document (mostly for myself) what I was doing in the build script: https://github.com/conda-forge/faiss-split-feedstock/blob/main/recipe/build-lib.sh

hmaarrfk · 2022-06-05T10:37:35Z

The is is appended to based on the cuda version

https://github.com/conda-forge/pytorch-cpu-feedstock/blob/main/recipe/build_pytorch.sh#L86

I don't think anybody is falling back on PTX. I'm not sure we even ship it by default.

ngam · 2022-06-08T00:14:52Z

export TORCH_CUDA_ARCH_LIST="3.5;5.0;5.2;6.0;6.1;7.0;7.5;8.0;8.6"

Yes, that was my point of this issue, so to move from:

3.5;5.0+PTX;6.0;6.1;7.0;7.5;8.0;8.6

to:

3.7;5.0;6.0;6.1;7.0;7.5;8.0;8.6

i.e. dropping "PTX" and changing 3.5 to 3.7. As a diff starting L84 in build.sh:

 if [[ ${cuda_compiler_version} != "None" ]]; then
     export USE_CUDA=1
-    export TORCH_CUDA_ARCH_LIST="3.5;5.0+PTX"
+    export TORCH_CUDA_ARCH_LIST="3.7;5.0"
     if [[ ${cuda_compiler_version} == 9.0* ]]; then
         export TORCH_CUDA_ARCH_LIST="$TORCH_CUDA_ARCH_LIST;6.0;7.0"

Sorry for the miscommunication. I concur with the suggestion that we don't change things unless someone asks for the time being, I am okay with that too. It is not an urgent issue

hmaarrfk · 2022-06-08T00:31:28Z

I think we should keep the PTX. It ensure that those that buy the newest GPUs still get a working package.

It should not affect those with existing GPUs.

ngam · 2022-06-08T00:53:29Z

My concern with PTX is the speed of compilation.

I understand what you're saying about newer GPUs, but I would be blunt and say those fancy new GPUs should probably get the newer pytorch releases.

My vague worry is something along the lines:

we currently have: 5.0+PTX
this means (according to the docs above) that it will make it work with anything above 5.0, right?
but we have all those we want above 5.0
so if we want to keep +PTX, we should move to the end of each list, e.g. 8.6+PTX
but even then, what's the +PTX doing anyway?
I am afraid it is consuming compilation time for no reason other than decorative/failsafe optionality...
(I am quite puzzled why they still have it upstream?!)

ngam · 2022-06-08T00:55:17Z

It's on my list of items to test next time I run/build with singularity on an HPC (I try to keep a low profile...). If I find the compilation time differs, I will come back and reopen this issue...

isuruf · 2022-06-08T02:34:26Z

so if we want to keep +PTX, we should move to the end of each list, e.g. 8.6+PTX

Agreed.

but even then, what's the +PTX doing anyway?

For compute capability greater than 8.6, it will generate PTX code and SASS will be generated at runtime.

My concern with PTX is the speed of compilation.

Why? Generating for PTX is only one and removing it won't magically speed things up. For cuda 11.2, you'll be reducing 1/9th of the total time

ngam · 2022-06-08T03:56:38Z

Why? Generating for PTX is only one and removing it won't magically speed things up. For cuda 11.2, you'll be reducing 1/9th of the total time

Oh, so adding PTX is only like adding for one additional arch? I obviously do not know enough. My worry/concern was that, e.g.,

3.5;5.0+PTX;6.0;6.1;7.0;7.5;8.0;8.6

was being compiled as

3.5;5.0+PTX;6.0+PTX;6.1+PTX;7.0+PTX;7.5+PTX;8.0+PTX;8.6+PTX

and so in other words, each one above the +PTX flag is being repeated twice...

I agree with 1/9th if we move it to the end, that makes sense to me.

rgommers · 2022-06-08T05:33:18Z

Oh, so adding PTX is only like adding for one additional arch?

Exactly. There's no magic reshuffling of that list going on, every specifier seperated by ; denotes a separate compile for the given architecture.

I agree with 1/9th if we move it to the end, that makes sense to me.

That really doesn't change anything, and seems like a gratuitous deviation from upstream, so I suggest to just not touch it.

(I am quite puzzled why they still have it upstream?!)

Removal has been discussed, in particular because one issue is JIT startup time the first time you do import torch on a GPU type which relies on PTX (~30 min). However, there are valid use cases for it.

One real-world example: say you are a visual effects studio and have a large software stack, e.g. a rendering pipeline for 3D animation, of which some PyTorch models are one small piece. Then you work on a game or a movie; at the start you do quality assurance (QA) for your whole software stack. Now you are two years in, and new GPU types have become the standard. What you want to do is plus those in, and have them "just work" (you are relying on there being no bugs in CUDA's PTX support, but nothing else - and that turns out to be acceptable). If you instead pull in a new release from the pytorch or conda-forge channel or rebuild PyTorch binaries by yourself, you will have to redo your QA process, which can be hugely expensive.

isuruf · 2022-06-08T05:40:47Z

That really doesn't change anything, and seems like a gratuitous deviation from upstream, so I suggest to just not touch it.

It does change something. The PTX can be optimized for compute capability 8.6+ devices when 8.6+PTX is given, whereas 5.0+PTX generates PTX valid for compute capability 5.0+ devices and may miss optimizations.

rgommers · 2022-06-08T06:59:30Z

Thanks for the correction - I did some digging through the CUDA docs, and it indeed seems to be recommended to include PTX only for the newest architecture.

Since this hasn't been tested upstream, a one-off test with export CUDA_FORCE_PTX_JIT=1 to make sure that the 8.6+ PTX is all good will be useful when making this change.

hmaarrfk · 2022-06-08T11:10:27Z

Since this hasn't been tested upstream, a one-off test with export CUDA_FORCE_PTX_JIT=1 to make sure that the 8.6+ PTX is all good will be useful when making this change.

Do you want to run this test at runtime on a machine with a real GPU?

rgommers · 2022-06-08T11:12:41Z

Yes, if only just once - and can even be post-release, because no user will run into it straight away. But it seems healthy to do. I have a GPU, so I can volunteer to run the tests against a new conda-forge package once it's built.

hmaarrfk · 2022-06-08T11:13:50Z

Can you review the changes in #116
I'm going to start a rebuild now.

ngam · 2022-06-08T14:15:20Z

The one last thing I would inquire about here is: 3.5 vs 3.7. Should we match upstream replacing 3.5 with 3.7 or should we just let it go for the time being? Do people have strong opinions?

ngam · 2022-06-08T14:25:53Z

One real-world example: say you are a visual effects studio and have a large software stack, e.g. a rendering pipeline for 3D animation, of which some PyTorch models are one small piece. Then you work on a game or a movie; at the start you do quality assurance (QA) for your whole software stack. Now you are two years in, and new GPU types have become the standard. What you want to do is plus those in, and have them "just work" (you are relying on there being no bugs in CUDA's PTX support, but nothing else - and that turns out to be acceptable). If you instead pull in a new release from the pytorch or conda-forge channel or rebuild PyTorch binaries by yourself, you will have to redo your QA process, which can be hugely expensive.

Wow, I never thought this would be an application. Makes sense. I guess it would also apply to people using pytorch in like a deployment setting (e.g. an app or inference model or something). My point of view is often limited to research settings and so discussing things with people like you helps a lot! Thanks!

mikemhenry · 2022-06-08T16:30:22Z

pytorch upstream just tacks PTX on the end of the arch list https://github.com/pytorch/pytorch/blob/88f4a1240275205f630dbaca97b1f5d7977f25f8/torch/utils/cpp_extension.py#L1751

rgommers · 2022-06-08T17:03:56Z

pytorch upstream just tacks PTX on the end of the arch list https://github.com/pytorch/pytorch/blob/88f4a1240275205f630dbaca97b1f5d7977f25f8/torch/utils/cpp_extension.py#L1751

That's for dynamic compilation IIRC, for package builds it's quite hard to figure out what the up-to-date scripts are, but I believe it's still these: https://github.com/pytorch/builder/search?q=TORCH_CUDA_ARCH_LIST&type=code

…ck#114

ngam added the bug Something isn't working label Jun 2, 2022

ngam closed this as completed Jun 2, 2022

ngam reopened this Jun 3, 2022

ngam changed the title ~~correct placement of +PTX flag~~ correct placement of +PTX flag + correcting cuda arch list Jun 3, 2022

This was referenced Jun 3, 2022

CHECKLIST FOR v1.12 #107

Closed

Prefer GPU, Link blas privately, position PTX Flag correctly #116

Merged

ngam closed this as not planned Won't fix, can't repro, duplicate, stale Jun 4, 2022

hmaarrfk reopened this Jun 8, 2022

hmaarrfk added the question Further information is requested label Jun 8, 2022

hmaarrfk mentioned this issue Jun 8, 2022

pytorch_sparse v0.6.13 conda-forge/pytorch_sparse-feedstock#40

Merged

3 tasks

mikemhenry added a commit to regro-cf-autotick-bot/pytorch_sparse-feedstock that referenced this issue Jun 8, 2022

add PTX to highest supported arch per conda-forge/pytorch-cpu-feedsto…

4cf205e

…ck#114

hmaarrfk closed this as completed in #116 Jun 11, 2022

isuruf mentioned this issue Nov 4, 2023

Splitting this package in managable chunks #108

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

correct placement of +PTX flag + correcting cuda arch list #114

correct placement of +PTX flag + correcting cuda arch list #114

ngam commented Jun 2, 2022 •

edited

Loading

hmaarrfk commented Jun 4, 2022

ngam commented Jun 4, 2022

hmaarrfk commented Jun 4, 2022

hmaarrfk commented Jun 4, 2022

ngam commented Jun 4, 2022

ngam commented Jun 4, 2022 •

edited

Loading

h-vetinari commented Jun 5, 2022

h-vetinari commented Jun 5, 2022

hmaarrfk commented Jun 5, 2022

ngam commented Jun 8, 2022

hmaarrfk commented Jun 8, 2022

ngam commented Jun 8, 2022 •

edited

Loading

ngam commented Jun 8, 2022 •

edited

Loading

isuruf commented Jun 8, 2022

ngam commented Jun 8, 2022

rgommers commented Jun 8, 2022

isuruf commented Jun 8, 2022

rgommers commented Jun 8, 2022

hmaarrfk commented Jun 8, 2022

rgommers commented Jun 8, 2022

hmaarrfk commented Jun 8, 2022

ngam commented Jun 8, 2022

ngam commented Jun 8, 2022

mikemhenry commented Jun 8, 2022

rgommers commented Jun 8, 2022

correct placement of +PTX flag + correcting cuda arch list #114

correct placement of +PTX flag + correcting cuda arch list #114

Comments

ngam commented Jun 2, 2022 • edited Loading

Solution to issue cannot be found in the documentation.

Issue

Installed packages

Environment info

hmaarrfk commented Jun 4, 2022

ngam commented Jun 4, 2022

hmaarrfk commented Jun 4, 2022

hmaarrfk commented Jun 4, 2022

ngam commented Jun 4, 2022

ngam commented Jun 4, 2022 • edited Loading

h-vetinari commented Jun 5, 2022

h-vetinari commented Jun 5, 2022

hmaarrfk commented Jun 5, 2022

ngam commented Jun 8, 2022

hmaarrfk commented Jun 8, 2022

ngam commented Jun 8, 2022 • edited Loading

ngam commented Jun 8, 2022 • edited Loading

isuruf commented Jun 8, 2022

ngam commented Jun 8, 2022

rgommers commented Jun 8, 2022

isuruf commented Jun 8, 2022

rgommers commented Jun 8, 2022

hmaarrfk commented Jun 8, 2022

rgommers commented Jun 8, 2022

hmaarrfk commented Jun 8, 2022

ngam commented Jun 8, 2022

ngam commented Jun 8, 2022

mikemhenry commented Jun 8, 2022

rgommers commented Jun 8, 2022

ngam commented Jun 2, 2022 •

edited

Loading

ngam commented Jun 4, 2022 •

edited

Loading

ngam commented Jun 8, 2022 •

edited

Loading

ngam commented Jun 8, 2022 •

edited

Loading