Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CUTLASS] Conv2d dgrad #10110

Merged
merged 7 commits into from
Feb 1, 2022
Merged

[CUTLASS] Conv2d dgrad #10110

merged 7 commits into from
Feb 1, 2022

Conversation

masahi
Copy link
Member

@masahi masahi commented Jan 31, 2022

Adds dgrad support. Wgrad is more complicated and I'm having weird accuracy issues, so it will come later.

UPDATE: See the latest result in #10110 (comment)

@comaniac @Laurawly @junrushao1994 @vinx13 @YuchenJin @hwu36 @manishucsd

Old results below, not relevant anymore


Linked below is a benchmark result against cuDNN on resnet50 workloads, with batch size 8 and 256. All numbers in milli second, generated on RTX 3070 by this script

It's interesting to note that, on batch size = 8, cutlass is mostly faster while on batch size = 256, cuDNN is faster. Looking at nvprof dump, it turns out that even if the e2e time, as reported by TVM's time_evaluator, shows cutlass being faster, cuDNN could be winning in the kernel-only time. For example, the first row of batch 8 case shows that cutlass vs cudnn = 54 vs 109 usec. But nvprof shows:

CUDA Kernel Statistics:                                                                        

 Time(%)  Total Time (ns)  Instances    Average     Minimum    Maximum                                                   Name                                                
 -------  ---------------  ---------  -----------  ---------  ---------  ----------------------------------------------------------------------------------------------------
    19.5       35,226,252        703     50,108.5     43,199     52,287  _ZN7cutlass6KernelINS_4conv6kernel23ImplicitGemmConvolutionINS1_11threadblock21ImplicitGemmPipeline…
    16.1       28,981,319        603     48,061.9     45,440     52,416  void xmma_cudnn::gemm::kernel<xmma_cudnn::implicit_gemm::dgrad::Kernel_traits<xmma_cudnn::Ampere_hm…

This means more than half of cuDNN e2e time is spent on overhead, either inside TVM during cuDNN call or within cuDNN itself. Apparently, cutlass has much smaller overhead.

cuDNN is mostly faster in batch 256 case. This could be due to overhead being small. In particular, the difference is large for stride = 2 cases. For example, on the 5-th row, which shows cutlass vs cudnn = 4.18 vs 1.71 msec, nvprof shows

 Time(%)  Total Time (ns)  Instances     Average       Minimum      Maximum                                                    Name                                                
 -------  ---------------  ---------  -------------  -----------  -----------  ----------------------------------------------------------------------------------------------------
    20.1    2,813,627,405        703    4,002,314.9    2,886,772    4,418,645  _ZN7cutlass6KernelINS_4conv6kernel35ImplicitGemmConvolutionStridedDgradINS1_11threadblock22Implicit…
11threadblock21Implicit…
     6.9      964,399,377        603    1,599,335.6    1,575,004    1,722,893  void xmma_cudnn::implicit_gemm::strided_dgrad_indexed::kernel<xmma_cudnn::implicit_gemm::strided_dg…
e>(int, int, int, __hal…
     0.3       37,115,603        603       61,551.6       57,113       69,561  void xmma_cudnn::implicit_gemm::strided_dgrad_indexed::kernel_helper_stage_3<xmma_cudnn::implicit_g…

     0.0        1,780,581        603        2,952.9        2,783        3,456  void xmma_cudnn::implicit_gemm::strided_dgrad_indexed::kernel_helper_stage_1<xmma_cudnn::implicit_g…
     0.0        1,489,770        603        2,470.6        2,303        9,631  void xmma_cudnn::implicit_gemm::strided_dgrad_indexed::kernel_helper_stage_2<xmma_cudnn::implicit_g…
     0.0        1,448,384        603        2,402.0        2,240        2,816  void xmma_cudnn::implicit_gemm::strided_dgrad_indexed::kernel_helper_stage_0<xmma_cudnn::implicit_g…

which suggests cuDNN's strided dgrad being significantly better than cutlass (?) @manishucsd

However, even on the larger batch size, cutlass is always winning on workloads with filter size 3. For example, here is the nvprof dump for the thrid row.

 Time(%)  Total Time (ns)  Instances     Average       Minimum      Maximum                                                    Name                                                           
 -------  ---------------  ---------  -------------  -----------  -----------  ----------------------------------------------------------------------------------------------------           
    13.4      760,521,450        703    1,081,822.8      866,120    1,135,883  _ZN7cutlass6KernelINS_4conv6kernel23ImplicitGemmConvolutionINS1_11threadblock21ImplicitGemmPipeline…           
    12.6      717,499,306        603    1,189,882.8    1,127,820    1,333,838  void xmma_cudnn::gemm::kernel<xmma_cudnn::implicit_gemm::dgrad::Kernel_traits<xmma_cudnn::Ampere_hm…           

Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the insightful investigation :)
This could also illustrate the scenario of developing Collage that deals with backend placement (https://arxiv.org/pdf/2111.00655.pdf).

@masahi
Copy link
Member Author

masahi commented Jan 31, 2022

I'm having a conversion about strided dgrad performance compared to cuDNN. Will give more update before merging.

@masahi
Copy link
Member Author

masahi commented Feb 1, 2022

HUGE UPDATE: Thanks to a tip from @manishucsd and @hwu36, it turns out upgrading the CUDA version from 11.3 to 11.6 alone gives 2x speedup on cutlass strided dgrad (unreal). Moreover, there was a critical bug in the parameter beta initialization, which was causing unnecessary memory traffic. That was hurting a lot on batch 256 case. The result was still correct because the C tensor, which points to the output pointer, was initialized with zeros.

Here are the updated results after these two fixes:

Now, cutlass is winning in ALL but one case in batch size 256, which is still 0.96 vs 0.94 difference. Note that activation fusion is not enabled for dgrad yet. So I expect the cutlass perf to be much better in practice for DL training use cases.

@hwu36
Copy link

hwu36 commented Feb 1, 2022

The real world training would require fp32 accumulation. In that case, the kernel will be more compute-bounded and the better kernel will have more advantages.

@masahi
Copy link
Member Author

masahi commented Feb 1, 2022

Merging, I'll follow up with wgrad + parallel split-k support.

@masahi masahi merged commit a1f51aa into apache:main Feb 1, 2022
ylc pushed a commit to ylc/tvm that referenced this pull request Feb 16, 2022
* add conv2d transpose nhwc cudnn test

* support conv2d transpose nhwc direct offload to cudnn

* add cutlass dgrad support

* remove unused arg

* allow target none

* fix beta initiaization condition

* disable dynamic dense fp16 test since it fails on cuda 11.6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants