Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OpenCL] Add vectorization to cuda conv2d_nhwc schedule #8636

Merged

Conversation

echuraev
Copy link
Contributor

@echuraev echuraev commented Aug 3, 2021

Adding vectorization significantly improved performance. About 6-7x
boost.

Thanks for contributing to TVM! Please refer to guideline https://tvm.apache.org/docs/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers by @ them in the pull request thread.

@echuraev echuraev force-pushed the echuraev/improve_opencl_conv2d_nhwc_schedule branch from dbbc678 to 5fa169a Compare August 3, 2021 11:04
@echuraev echuraev marked this pull request as ready for review August 3, 2021 11:05
@echuraev echuraev changed the title Add vectorization to cuda conv2d_nhwc schedule [OpenCL] Add vectorization to cuda conv2d_nhwc schedule Aug 3, 2021
@echuraev echuraev force-pushed the echuraev/improve_opencl_conv2d_nhwc_schedule branch from 5fa169a to 1a931a1 Compare August 4, 2021 09:50
jcf94
jcf94 previously requested changes Aug 5, 2021
Copy link
Contributor

@jcf94 jcf94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@echuraev Thanks!

Another thing is that you're doing test on a OpenCL device? While the schedule is in a topi/cuda dir. It would be very nice of you if you can help move this schedule to topi/gpu dir

python/tvm/topi/cuda/conv2d_nhwc.py Outdated Show resolved Hide resolved
python/tvm/topi/cuda/conv2d_nhwc.py Show resolved Hide resolved
@echuraev
Copy link
Contributor Author

echuraev commented Aug 6, 2021

@jcf94 thank you for your comments. I applied your comments and moved conv2d_nhwc to topi/gpu in separate commits to make it easier to review. Please take a look one again.

Another thing is that you're doing test on a OpenCL device?

Yes, I run topology from #8146 on my Samsung Galaxy A71 and got about 6-7x performance boost.

@echuraev echuraev force-pushed the echuraev/improve_opencl_conv2d_nhwc_schedule branch from a44c81e to ae90d88 Compare August 6, 2021 13:58
@echuraev
Copy link
Contributor Author

echuraev commented Aug 7, 2021

@jcf94 also, I suppose that we should update statistic in tophub and rename conv2d_nhwc.cuda to conv2d_nhwc.gpu in all files. Am I right?

@tmoreau89
Copy link
Contributor

CC-ing @masahi - would this also benefit Vulkan FP16 codegen? And in general do we want to maintain TOPI definitions for both CUDA and GPU, or rename it all to GPU, to show that it encompasses more than just Nvidia backends? That would I believe be the topic of a larger RFC.

@echuraev could you confirm that making those changes to the schedule templates doesn't lead to regressions on the CUDA codegen on Nvidia GPUs? Thank you!

@masahi
Copy link
Member

masahi commented Aug 24, 2021

Thanks @echuraev, this is really interesting. With the new commit, I was able to generate a packed float16x2 instruction for AMD Vega or newer GPUs, which is supposed to be 2x faster than normal fp32 or fp16 instructions https://gist.github.com/masahi/2de1a7dc87e2068ffb50ba6135273f95#file-conv2d_nhwc_float16x2-s-L495-L496. I'm going to experiment further.

@tmoreau89
Copy link
Contributor

CC @mei-ye on the above comment - investigating if using vector types can lead to improved FP16 codegen on AMD GPUs

@echuraev echuraev force-pushed the echuraev/improve_opencl_conv2d_nhwc_schedule branch from e19c67a to 256f524 Compare September 2, 2021 17:12
@echuraev
Copy link
Contributor Author

echuraev commented Sep 2, 2021

@mbrookhart I wasn't able to reproduce the regression on NVidia 3070. As a possible solution, I can create a separate conv2d_nhwc schedule for OpenCL. But I saw that my fix for OpenCL issue with global_work_size also works for default schedule on Cuda: https://github.com/apache/tvm/pull/8636/files#diff-05fdfdcbc0bdf86e1df35950ae34877c2f9dbddab6a99ca630582547d4e7e0faL88-L89

Results on NVidia 3090 (1024 trials per kernel) Results on NVidia 3070 Mobile (512 trials per kernel)
Main (AutoTVM) 0.17 ms 0.4504 ms
Main (Ansor) 0.18 ms 0.4284 ms
Code from the PR (AutoTVM) 0.26 ms 0.4200 ms
Code from the PR (Ansor) 0.19 ms 0.4200 ms

@echuraev echuraev force-pushed the echuraev/improve_opencl_conv2d_nhwc_schedule branch from 256f524 to 6d9d210 Compare September 3, 2021 08:32
@echuraev
Copy link
Contributor Author

echuraev commented Sep 3, 2021

@jcf94, @masahi, @tmoreau89 could you please take a look one again on this PR?

@tmoreau89
Copy link
Contributor

So it appears @mbrookhart has encountered a regression and that regression wasn't reproduced by @echuraev on a slightly different GPU. The possible solution in order to be extra safe would be to introduce a new set of TOPI schedules for vectorized GPU operator implementations. WDYT @jwfromm and @mbrookhart

@echuraev
Copy link
Contributor Author

echuraev commented Sep 4, 2021

So it appears @mbrookhart has encountered a regression and that regression wasn't reproduced by @echuraev on a slightly different GPU. The possible solution in order to be extra safe would be to introduce a new set of TOPI schedules for vectorized GPU operator implementations. WDYT @jwfromm and @mbrookhart

Sorry, my mistake. I didn't write an update. @mbrookhart has remeasured autotvm tuning on the latest code from my branch and he got the same numbers as he got on the main branch. Now the table can be updated:

Results on NVidia 3090 (1024 trials per kernel) Results on NVidia 3070 Mobile (512 trials per kernel)
Main (AutoTVM) 0.17 ms 0.4504 ms
Main (Ansor) 0.18 ms 0.4284 ms
Code from the PR (AutoTVM) 0.173 ms 0.4200 ms
Code from the PR (Ansor) 0.19 ms 0.4200 ms

So, we don't have regression on the latest code.

@masahi
Copy link
Member

masahi commented Sep 9, 2021

I think it is good to go as long as there would be no concern for regression. Does the new schedule subsume the old schedule in its search space?

@echuraev
Copy link
Contributor Author

echuraev commented Sep 9, 2021

I think it is good to go as long as there would be no concern for regression. Does the new schedule subsume the old schedule in its search space?

Yes, it does. When vectorization factor is equal to 1, then generated code should be the same as it was with old schedule.

Copy link
Member

@masahi masahi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jcf94

@areusch areusch removed their request for review September 14, 2021 15:05
@jwfromm
Copy link
Contributor

jwfromm commented Sep 16, 2021

@jcf94 can you please take another look at this? It's currently being blocked by your request for changes.

@echuraev
Copy link
Contributor Author

@jcf94 Could you please take a look one again?

@jwfromm
Copy link
Contributor

jwfromm commented Sep 23, 2021

@tqchen what's the protocol for a case like this where a PR is blocked by requested changes but the requester is away for a long period of time? Is there a way to override that or should @echuraev close and reopen the PR?

@tmoreau89
Copy link
Contributor

@jcf94 - please update the PR review based on latest comments and updates to the PR. I will dismiss your review in 24hrs so we can unblock this PR given that we haven't heard back from you.

@echuraev echuraev force-pushed the echuraev/improve_opencl_conv2d_nhwc_schedule branch from 6d9d210 to 828cdfa Compare September 23, 2021 16:35
@tmoreau89 tmoreau89 dismissed jcf94’s stale review September 26, 2021 21:31

Change requested occurred on Aug 5th by jcf96. On 6th Egor addressed the changes. We then pinged 23, 17, 10, 6, and 3 days ago but no response. As such in order to unlock this PR I will dismiss the review.

@tmoreau89
Copy link
Contributor

I've dismissed the review that was blocking progress. In my view we've provided sufficient ping to notify the reviewer who requested changes. @jwfromm I'll leave it to you to merge the PR when convenient.

@jwfromm jwfromm merged commit 719d2f6 into apache:main Sep 30, 2021
@jwfromm
Copy link
Contributor

jwfromm commented Sep 30, 2021

Thanks @echuraev, this is now merged.

AndrewZhaoLuo added a commit to AndrewZhaoLuo/tvm that referenced this pull request Sep 30, 2021
* main: (80 commits)
  Introduce centralised name transformation functions (apache#9088)
  [OpenCL] Add vectorization to cuda conv2d_nhwc schedule (apache#8636)
  [6/6] Arm(R) Ethos(TM)-U NPU codegen integration with `tvmc` (apache#8854)
  [microTVM] Add wrapper for creating project using a MLF (apache#9090)
  Fix typo (apache#9156)
  [Hotfix][Testing] Wait for RPCServer to be established (apache#9150)
  Update find cublas so it search default path if needed. (apache#9149)
  [TIR][LowerMatchBuffer] Fix lowering strides when source region has higher dimension than the buffer (apache#9145)
  Fix flaky NMS test by making sure scores are unique (apache#9140)
  [Relay] Merge analysis/context_analysis.cc and transforms/device_annotation.cc (apache#9038)
  [LLVM] Make changes needed for opaque pointers (apache#9138)
  Arm(R) Ethos(TM)-U NPU codegen integration (apache#8849)
  [CI] Split Integration tests out of first phase of pipeline (apache#9128)
  [Meta Schedule][M3b] Runner (apache#9111)
  Fix Google Mock differences between Ubuntu 18.04 and 16.04 (apache#9141)
  [TIR] add loop partition hint pragma (apache#9121)
  fix things (apache#9146)
  [Meta Schedule][M3a] SearchStrategy (apache#9132)
  [Frontend][PyTorch] support for quantized conv_transpose2d op (apache#9133)
  [UnitTest] Parametrized test_conv2d_int8_intrinsics (apache#9143)
  ...
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 7, 2022
* Add vectorization to cuda conv2d_nhwc schedule

Adding vectorization significantly improved performance. About 6-7x
boost.

* Apply comment

* Move schedule to topi/gpu dir

* Add vectorization to inner loop

* Update values of vectorization factor
ylc pushed a commit to ylc/tvm that referenced this pull request Jan 13, 2022
* Add vectorization to cuda conv2d_nhwc schedule

Adding vectorization significantly improved performance. About 6-7x
boost.

* Apply comment

* Move schedule to topi/gpu dir

* Add vectorization to inner loop

* Update values of vectorization factor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants