Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to (Re)TestItems #262

Merged
merged 21 commits into from
Dec 23, 2023
Merged

Switch to (Re)TestItems #262

merged 21 commits into from
Dec 23, 2023

Conversation

ToucheSir
Copy link
Member

The impetus for this PR was twofold:

  1. Make use of the dedicated Julia test action as mentioned in use julia-actions/cache in CI #260 (comment).
  2. Migrate off ReTest, which has not seen any activity in 2 years.

Along the way, I found some additional changes which could either be tackled here or in a follow-up PR:

  • We don't actually test any CUDA code on GPU CI! Buildkite merely runs the CPU test suite end-to-end, which feels wasteful
  • We don't have any tests which involve GPU arrays
  • We weren't testing WideResNet on GHA (fixed)
  • We aren't testing 1.6 outside of doctests

My feeling is that we'd want to set aside a subset of faster tests for 1.6/nightly/GPU CI. Maybe the smallest variant of each model. Then we can decrease our overall runtime while expanding our version matrix to cover everything we probably should've been covering.

PR Checklist

  • Tests are added
  • Documentation, if applicable

test/Project.toml Outdated Show resolved Hide resolved
test/convnet_tests.jl Outdated Show resolved Hide resolved
@darsnack darsnack closed this Dec 12, 2023
@darsnack darsnack reopened this Dec 12, 2023
We can guarantee these test images will always be available, which is not the case for the current sample image.
@ToucheSir
Copy link
Member Author

Still needs a bit of work for GPU CI on 1.6 and nightly (possibly disabling the latter for now), but this is mostly good to go.

Some timings:

Group

Model

GHA time (PR)

Buildkite time (master)

AlexNet|VGG

AlexNet

47.5s

47.3s

VGG

2m56.7s

6m54.1s

GoogLeNet|SqueezeNet|MobileNet|MNASNet

GoogLeNet

5m03.4s

9m56.8s

SqueezeNet

32.3s

46.9s

MobileNet

32.4s + 52.1s + 2m32.0s

5m29.0s + 57.7s + 13.2s + 18.1s

MNASNet

1m38.3s

EfficientNet

EfficientNet

8m00.6s + 6m46.4s

6m51.6s + 4m42.0s

ResNet|WideResNet

ResNet

16m36.8s

16m41.3s

WideResNet

1m49.0s

2m41.8s

ResNeXt

ResNeXt

8m37.2s

4m40.0s

SEResNet|SEResNeXt

SEResNet

7m25.8s

6m25.1s

SEResNeXt

2m20.6s

2m35.5s

Res2Net|Res2NeXt

Res2Net

8m49.7s

7m53.6s

Res2NeXt

34.3s

23.7s

Inception

Inception

7m31.9s = 2m12.7s + 1m34.7s + 2m26.1s + 1m10.6s

6m52.9s

DenseNet

DenseNet

8m56.4s

8m48.7s

Unet

Unet

3m44.7s

3m21.8s

ConvNeXt|ConvMixer

ConvNeXt

4m40.1s

6m06.1s

ConvMixer

3m43.5s

4m16.3s

MLP-Mixer|ResMLP|gMLP

MLP-Mixer

3m10.9s

3m39.8s

ResMLP

1m58.4s

2m31.7s

gMLP

1m58.9s

2m19.5s

ViT

ViT

1m34.4s

1m02.0s

It appears we spend a lot of time compiling, as evidenced by the large time savings when similar models are run one after another. ViTs are an outlier despite their relative runtime slowness because they use the (type unstable under AD) Vector Chain. I wonder if we should explore expanding that to more models or having a serious look into other ideas for reducing compile times, but that's a discussion for another PR.

@theabhirath
Copy link
Member

It appears we spend a lot of time compiling, as evidenced by the large time savings when similar models are run one after another. ViTs are an outlier despite their relative runtime slowness because they use the (type unstable under AD) Vector Chain. I wonder if we should explore expanding that to more models or having a serious look into other ideas for reducing compile times, but that's a discussion for another PR.

During my GSoC, we explored this and I had noticed that when training, the Vector Chain gave me extremely bumpy loss curves – one of the reason we removed them from 0.7 to 0.8. A lot of this can come back slowly if we train more to isolate the exact problem, I think.

@ToucheSir
Copy link
Member Author

With the renewed interest in #198 (comment), now may be the time to revisit what's causing these mysterious instabilities during training. Shall we continue the discussion there?

test/model_tests.jl Outdated Show resolved Hide resolved
ToucheSir and others added 2 commits December 14, 2023 19:33
Co-authored-by: Kyle Daruwalla <daruwalla.k.public@icloud.com>
@ToucheSir
Copy link
Member Author

Ok, Buildkite is happy and so am I. This should be good to go. We now should have a pretty good picture of what works and doesn't on GPU too!

@ToucheSir ToucheSir merged commit b2ec1d6 into master Dec 23, 2023
41 of 42 checks passed
@ToucheSir ToucheSir deleted the bc/testitems branch December 23, 2023 01:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants