Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

register NNlibCUDA as a subpackage of NNlib #300

Open
CarloLucibello opened this issue Mar 16, 2021 · 30 comments
Open

register NNlibCUDA as a subpackage of NNlib #300

CarloLucibello opened this issue Mar 16, 2021 · 30 comments

Comments

@CarloLucibello
Copy link
Member

No description provided.

@CarloLucibello
Copy link
Member Author

@JuliaRegistrator register() subdir=lib/NNlibCUDA

@JuliaRegistrator
Copy link

Registration pull request created: JuliaRegistries/General/32119

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a NNlibCUDA-v0.1.0 -m "<description of version>" ca82fb23928c7ee7d08afb722718cf93be13f81c
git push origin NNlibCUDA-v0.1.0

@DilumAluthge
Copy link
Member

@JuliaRegistrator register() subdir=lib/NNlibCUDA

@JuliaRegistrator
Copy link

Registration pull request updated: JuliaRegistries/General/32119

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a NNlibCUDA-v0.1.0 -m "<description of version>" fb5377811b3bd37d87a504d5506cc676826b9648
git push origin NNlibCUDA-v0.1.0

@CarloLucibello
Copy link
Member Author

@DilumAluthge there is also failing import NNlibCUDA with CUDA 2.6, while CUDA 3.0.0 not being released yet. I just wanted to start the registration process to make testing against CUDA#master easier, but if that's a problem we can just wait for CUDA 3.0 release

@DilumAluthge
Copy link
Member

Yeah no worries at all! We can merge it manually in 3 days.

@DhairyaLGandhi
Copy link
Member

It might make sense to register https://github.com/FluxML/NNlibCUDA.jl instead. We wouldn't have to deal with registering it oddly and it keeps the projects separate in their own little silos. Currently NNlibCUDA doesn't get used by NNlib internally and developing it means getting unintended code alongside.

@CarloLucibello
Copy link
Member Author

We wouldn't have to deal with registering it oddly

don't see any oddity here, this subpackage thing is actually quite cool and already well supported by julia's toolings

it keeps the projects separate in their own little silos

That's exactly what we don't want since the two packages are tightly coupled. When we change NNlib we want to make sure everything is fine with NNlibCUDA by running its tests, and sometimes we have to make changes that involve both packages (adding some function requiring also a cuda kernel). Having NNlibCUDA as a subpackage is much safer since it allows combined CI very easily (I already put that in place), makes contributing easier, and relieves a lot of the maintenance burden.

Currently NNlibCUDA doesn't get used by NNlib internally and developing it means getting unintended code alongside.

I don't see what's the problem in also seeing the folder lib/NNlibCUDA when developing, if anyone gets confused by that he probably shouldn't be developing NNlib at all. Actually, as I said above, the possibility to make changes to both packages in a single PR is great.

@DhairyaLGandhi
Copy link
Member

Not really, since most of the work in NNlib proper is not focused on CuDNN at all, in fact, the code in the two achieve very different goals. You typically either work on the kernels or the plumbing (NNlibCUDA is a bridge to CUDNN), rarely both. Most new kernels shouldn't be written twice anyway. Waiting around for GPU ci when not changing any GPU code is also a thing here, since most nnlib code isn't meant for GPUs.

if anyone gets confused by that he probably shouldn't be developing NNlib at all.

I don't think that's the message we want to send to contributors...

@ToucheSir
Copy link
Member

ToucheSir commented Mar 16, 2021

I'm a little late to this, but can someone explain why NNlibCUDA exists as a subdir and a separate repo and why there wasn't a consensus made about which one it should be before now? Getting a whiff of some communications breakdown or personal enmity being involved and I don't like it.

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Mar 16, 2021

I needed something to get testing with CUDA#master, Julia 1.6 and share that work easily and quickly. See also https://github.com/FluxML/Flux.jl/tree/dg/cuda16 which gets flux+ cuda + 1.6 usable.

Either way is fine in the long run as long as we get our stack functional :)

@DhairyaLGandhi
Copy link
Member

I like to be able to do ] add NNlibCUDA#branch can we do that with subdirs as well?

@CarloLucibello
Copy link
Member Author

@ToucheSir Dhairya and I disagree on a lot of stuff, which is of course unfortunate for two people working on the same project. There is nothing about "personal" and "enmity" involved here though.

can someone explain why NNlibCUDA exists as a subdir and a separate repo and why there wasn't a consensus made about which one it should be before now?

The first comment pointing about the possibility of using multi-package repos for this stuff was
#224 (comment)
and then a follow-up at the end of that thread.
Then there was discussion in CUDA.jl's PRs/issues and zulip, then there was #286 and here we are.
So discussion was scattered but there were some entry points to it,
no one at any point before now expressed negative feedback against multi-package repo,
so I volunteered a few hours of my time and did this very boring work that needed to be done
in the best way I could think of. I did make some decisions and I could have communicated more or better along the way,
but I think the process was perfectly reasonable by community-driven project standards.

As someone doing a lot of maintenance here, I know very well that having a separate NNlibCUDA.jl would be an additional maintenance hassle (a finite and non-small fraction of which weighing on me), not a big deal but something I'd rather avoid if possible.

I see a few pros in having the multi-repo in one package approach.
The only potential con I see is

I like to be able to do ] add NNlibCUDA#branch can we do that with subdirs as well?

but probably that actually works.

Finally, as Dhairyia said, we can split out a repo at any point, nothing here is irreversible, but the fact that someone already put the things in a functional state and with a nicely integrated CI (so better situation than before) is something that should be exploited.

Most new kernels shouldn't be written twice anyway. Waiting around for GPU ci when not changing any GPU code is also a thing here, since most nnlib code isn't meant for GPUs.

We have a relevant fraction of code (e.g. activation functions) that works on both gpu and cpu and should be tested on both

@GunnarFarneback
Copy link

I like to be able to do ] add NNlibCUDA#branch can we do that with subdirs as well?

Yes.

@DhairyaLGandhi
Copy link
Member

I think setting up ci isn't something new. We also shouldn't have nnlib depending on so many heavy deps like cuda. That makes it very heavy for the rest of the ecosystem to use.

@GunnarFarneback
Copy link

For what it's worth, having a package in a subdir doesn't imply a dependency. Personally I wouldn't place a subdir package inside another package (rather place them as siblings in the repo) since it means that the subdir package code will be distributed with all copies of the main package, but the dependencies of the subdir package still don't affect the main package at all. Well, unless the subdir package is a dependency of the main package, but then you would get exactly the same effect in a separate repository.

@DhairyaLGandhi
Copy link
Member

I don't think the cuda version is a dep, so it's good that those are separate

@CarloLucibello
Copy link
Member Author

I think setting up ci isn't something new. We also shouldn't have nnlib depending on so many heavy deps like cuda. That makes it very heavy for the rest of the ecosystem to use.

As @GunnarFarneback said, NNlib's doesn't depend on anything CUDA-related, NNlibCUDA is its own package (depending on NNlib).

Here we just have the same scaffolding of KernelAbstractions with CUDAKernels and ROCKernels https://github.com/JuliaGPU/KernelAbstractions.jl/tree/master/lib
which seems nicely scalable, since we may soon have NNlibCUDA and NNlibAMDGPU.

@CarloLucibello
Copy link
Member Author

So I´d like to give this package-in-a-subdir a try, there is no problem in pulling out at any moment anyway. Maybe the people from KernelAbstractions @vchuravy @jpsamaroo can tell us whether their experience with this has been positive or not?

@DilumAluthge
Copy link
Member

there is no problem in pulling out at any moment anyway

As I've mentioned elsewhere, moving the location of a subdirectory package is nontrivial. You have to make sure that the new location of the package includes all of the Git tree SHAs that have already been registered in the General registry.

So I would recommend that you first make a definitive decision on which of the two options you want:

  1. Subdirectory package
  2. Package in its own repo

And then register based on that decision, with the knowledge that switching from option 1 to option 2 requires a nontrivial amount of work.

@CarloLucibello
Copy link
Member Author

@DilumAluthge there is something I don´t understand, from the answer to this question, the relocation process seemed quite easy...

@DilumAluthge
Copy link
Member

DilumAluthge commented Apr 10, 2021

Idk, I've never done it myself, but here's one data point: SnoopCompileBot was originally a subdirectory package. At one point, the maintainer wanted to move it to its own repo. The maintainer found the process difficult enough that instead of moving it to its own repo, they just made a new package called CompileBot.

You can probably find that conversation by searching for SnoopCompileBot and CompileBot in PRs to the General registry.

Again, I've never done this personally, but if it was painful enough that one package author gave up and just made a new package, I find that concerning.

I would recommend at least looking over the historical discussion for SnoopCompileBot/CompileBot.

@CarloLucibello
Copy link
Member Author

ok, that seems something to consider. Let´s wait for some feedback from @vchuravy and @jpsamaroo

@jpsamaroo
Copy link

I personally don't think we've had enough time to evaluate the merits of the subpackage approach, since KA 0.6 was only released 11 days ago. I'm personally in favor of the separate package approach, which you can always "promote" to a subpackage later without much difficulty (there are tools to combine git histories of different repos too).

@CarloLucibello
Copy link
Member Author

it seems I'm the only one hyped by this subdir approach, let's move everything to the separate repo https://github.com/FluxML/NNlibCUDA.jl then. I'll make a PR removing NNlibCUDA from here soon

@maxfreu
Copy link
Contributor

maxfreu commented Apr 21, 2021

Oh! Unfortunately I didn't see this discussion in time. Now it's probably too late (hopefully not?), but I am very much in favor of keeping things as a single package with sub-packages. I maybe don't overlook the whole picture, but here are two reasons / examples:

  1. I just ported a C cuda kernel to julia into the NNlibCUDA sub-package. Before this was 1 fork, 1 branch, 1 PR. Now it's two each, which is annoying.
  2. This gets even worse further down the road: Consider we have NNlib, NNlibCUDA, NNlibAMD & NNlibKA. Let's suppose I have written two special GPU kernels for CUDA & AMD and now want to merge them into a single one using kernel abstractions. Maybe I even wrote a KA kernel for CPU, so I can remove the core CPU code from NNlib. So I'll have to make 4 forks, 4 PRs etc and each of the packages has its own version. I fear this will be versioning and maintenance hell. You'd have to merge everything more or less at once and tag new versions & compat entries or the packages become incompatible.

Did I get sth completely wrong?

At the danger of riding a dead horse here, but maybe @timholy can share his experiences first hand? Maybe his reasons for moving it out don't apply here.

@DhairyaLGandhi
Copy link
Member

DhairyaLGandhi commented Apr 21, 2021

As the maintainer of this package and others, I understand these concerns, deal with the consequences and think this was the right move. You see:

  1. setting up the development environment is nontrivial esp over subpackages with different or conflicting needs (thus discouraging contributions) (see Add scatter for CUDA support #296)
  2. The disruptions to NNlib itself over something trivial (New package: NNlibCUDA v0.1.0 JuliaRegistries/General#32119 (comment))
  3. That recent NNlib releases contained GPU dependencies (thus saying the setup is prone to errors)(5c2690b#commitcomment-49262320)
    its fair to say that this kind of change requires some very strong gains of whether it is worth sacrificing the stability of over 250 packages in the ecosystem.

Remeber too that a design that expects an NNlibX necessitates some amount of code/ logic repetition. It's also fair to say that NNlib devs shouldn't be expected to maintain every single spin-off, thus also making the monorepo difficult to scale and QA.

The GPU backend really does achieve different end goals so it makes sense to be separate. In fact, when the time comes, it would be better to remove the CUDA moniker and have it depend on GPUCompiler only, thus obviating the need for an AMD and KA flavour.

The versioning and following proper semver can be handled like we do with so many dependencies in the ecosystem. Yes it would mean working over multiple repos, but the versioning standard isn't any different for sub packages, so it's the same regardless of where the code lives.

Recently, a faulty NNlib was also the reason behind the breakage of the tutorials that involve plotting, which also includes all the SciML ones, so I would want to ensure some amount of stability, which does require us to separate the concerns of different sections of code.

@maxfreu
Copy link
Contributor

maxfreu commented Apr 21, 2021

Ok, thank you for the write-up. I'm still not convinced, but let's see where it takes us :)

@CarloLucibello CarloLucibello changed the title register NNlibCUDA register NNlibCUDA as a subpackage of NNlib Jul 11, 2021
@CarloLucibello
Copy link
Member Author

reopening this after a few months of working with NNlibCUDA as a separate package. In the work for gather/scatter we add to ping-pong between the two libraries multiple times, it has been very annoying, so I encourage people to reconsider the option of having NNlibCUDA as a subpackage. Also see #332.

Moreover, according to https://github.com/FluxML/Flux.jl/pull/1566/files we are going to have FluxCUDA, FluxAMDGPU etc.. as subpackages of Flux, so NNlib should reflect that.

@DhairyaLGandhi
Copy link
Member

No decision has been made on that front. The issue seems to be that there isn't so much a need for the cuda package at all but that the code for gather should have been vetted to work with the Flux counterpart before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants