[CUTLASS] Conv2d dgrad #10110

masahi · 2022-01-31T06:07:50Z

Adds dgrad support. Wgrad is more complicated and I'm having weird accuracy issues, so it will come later.

UPDATE: See the latest result in #10110 (comment)

@comaniac @Laurawly @junrushao1994 @vinx13 @YuchenJin @hwu36 @manishucsd

Old results below, not relevant anymore

Linked below is a benchmark result against cuDNN on resnet50 workloads, with batch size 8 and 256. All numbers in milli second, generated on RTX 3070 by this script

It's interesting to note that, on batch size = 8, cutlass is mostly faster while on batch size = 256, cuDNN is faster. Looking at nvprof dump, it turns out that even if the e2e time, as reported by TVM's time_evaluator, shows cutlass being faster, cuDNN could be winning in the kernel-only time. For example, the first row of batch 8 case shows that cutlass vs cudnn = 54 vs 109 usec. But nvprof shows:

CUDA Kernel Statistics:                                                                        

 Time(%)  Total Time (ns)  Instances    Average     Minimum    Maximum                                                   Name                                                
 -------  ---------------  ---------  -----------  ---------  ---------  ----------------------------------------------------------------------------------------------------
    19.5       35,226,252        703     50,108.5     43,199     52,287  _ZN7cutlass6KernelINS_4conv6kernel23ImplicitGemmConvolutionINS1_11threadblock21ImplicitGemmPipeline…
    16.1       28,981,319        603     48,061.9     45,440     52,416  void xmma_cudnn::gemm::kernel<xmma_cudnn::implicit_gemm::dgrad::Kernel_traits<xmma_cudnn::Ampere_hm…

This means more than half of cuDNN e2e time is spent on overhead, either inside TVM during cuDNN call or within cuDNN itself. Apparently, cutlass has much smaller overhead.

cuDNN is mostly faster in batch 256 case. This could be due to overhead being small. In particular, the difference is large for stride = 2 cases. For example, on the 5-th row, which shows cutlass vs cudnn = 4.18 vs 1.71 msec, nvprof shows

 Time(%)  Total Time (ns)  Instances     Average       Minimum      Maximum                                                    Name                                                
 -------  ---------------  ---------  -------------  -----------  -----------  ----------------------------------------------------------------------------------------------------
    20.1    2,813,627,405        703    4,002,314.9    2,886,772    4,418,645  _ZN7cutlass6KernelINS_4conv6kernel35ImplicitGemmConvolutionStridedDgradINS1_11threadblock22Implicit…
11threadblock21Implicit…
     6.9      964,399,377        603    1,599,335.6    1,575,004    1,722,893  void xmma_cudnn::implicit_gemm::strided_dgrad_indexed::kernel<xmma_cudnn::implicit_gemm::strided_dg…
e>(int, int, int, __hal…
     0.3       37,115,603        603       61,551.6       57,113       69,561  void xmma_cudnn::implicit_gemm::strided_dgrad_indexed::kernel_helper_stage_3<xmma_cudnn::implicit_g…

     0.0        1,780,581        603        2,952.9        2,783        3,456  void xmma_cudnn::implicit_gemm::strided_dgrad_indexed::kernel_helper_stage_1<xmma_cudnn::implicit_g…
     0.0        1,489,770        603        2,470.6        2,303        9,631  void xmma_cudnn::implicit_gemm::strided_dgrad_indexed::kernel_helper_stage_2<xmma_cudnn::implicit_g…
     0.0        1,448,384        603        2,402.0        2,240        2,816  void xmma_cudnn::implicit_gemm::strided_dgrad_indexed::kernel_helper_stage_0<xmma_cudnn::implicit_g…

which suggests cuDNN's strided dgrad being significantly better than cutlass (?) @manishucsd

However, even on the larger batch size, cutlass is always winning on workloads with filter size 3. For example, here is the nvprof dump for the thrid row.

 Time(%)  Total Time (ns)  Instances     Average       Minimum      Maximum                                                    Name                                                           
 -------  ---------------  ---------  -------------  -----------  -----------  ----------------------------------------------------------------------------------------------------           
    13.4      760,521,450        703    1,081,822.8      866,120    1,135,883  _ZN7cutlass6KernelINS_4conv6kernel23ImplicitGemmConvolutionINS1_11threadblock21ImplicitGemmPipeline…           
    12.6      717,499,306        603    1,189,882.8    1,127,820    1,333,838  void xmma_cudnn::gemm::kernel<xmma_cudnn::implicit_gemm::dgrad::Kernel_traits<xmma_cudnn::Ampere_hm…

comaniac

LGTM. Thanks for the insightful investigation :)
This could also illustrate the scenario of developing Collage that deals with backend placement (https://arxiv.org/pdf/2111.00655.pdf).

masahi · 2022-01-31T20:53:14Z

I'm having a conversion about strided dgrad performance compared to cuDNN. Will give more update before merging.

masahi · 2022-02-01T13:21:37Z

HUGE UPDATE: Thanks to a tip from @manishucsd and @hwu36, it turns out upgrading the CUDA version from 11.3 to 11.6 alone gives 2x speedup on cutlass strided dgrad (unreal). Moreover, there was a critical bug in the parameter beta initialization, which was causing unnecessary memory traffic. That was hurting a lot on batch 256 case. The result was still correct because the C tensor, which points to the output pointer, was initialized with zeros.

Here are the updated results after these two fixes:

Now, cutlass is winning in ALL but one case in batch size 256, which is still 0.96 vs 0.94 difference. Note that activation fusion is not enabled for dgrad yet. So I expect the cutlass perf to be much better in practice for DL training use cases.

hwu36 · 2022-02-01T15:54:28Z

The real world training would require fp32 accumulation. In that case, the kernel will be more compute-bounded and the better kernel will have more advantages.

masahi · 2022-02-01T20:51:53Z

Merging, I'll follow up with wgrad + parallel split-k support.

* add conv2d transpose nhwc cudnn test * support conv2d transpose nhwc direct offload to cudnn * add cutlass dgrad support * remove unused arg * allow target none * fix beta initiaization condition * disable dynamic dense fp16 test since it fails on cuda 11.6

comaniac approved these changes Jan 31, 2022

View reviewed changes

masahi added 4 commits February 1, 2022 21:42

add conv2d transpose nhwc cudnn test

a49f7c2

support conv2d transpose nhwc direct offload to cudnn

74503b0

add cutlass dgrad support

3e98120

remove unused arg

83e7af7

masahi added 2 commits February 1, 2022 21:42

allow target none

a031085

fix beta initiaization condition

2bb1705

masahi force-pushed the cutlass-dgrad branch from 9b8e56d to 2bb1705 Compare February 1, 2022 12:42

disable dynamic dense fp16 test since it fails on cuda 11.6

e0ca0e6

junrushao approved these changes Feb 1, 2022

View reviewed changes

masahi merged commit a1f51aa into apache:main Feb 1, 2022

driazati mentioned this pull request Jul 14, 2022

TVM v0.9.0.rc0 Release Candidate Notes #12102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CUTLASS] Conv2d dgrad #10110

[CUTLASS] Conv2d dgrad #10110

masahi commented Jan 31, 2022 •

edited

Loading

comaniac left a comment

masahi commented Jan 31, 2022

masahi commented Feb 1, 2022

hwu36 commented Feb 1, 2022

masahi commented Feb 1, 2022

[CUTLASS] Conv2d dgrad #10110

[CUTLASS] Conv2d dgrad #10110

Conversation

masahi commented Jan 31, 2022 • edited Loading

comaniac left a comment

Choose a reason for hiding this comment

masahi commented Jan 31, 2022

masahi commented Feb 1, 2022

hwu36 commented Feb 1, 2022

masahi commented Feb 1, 2022

masahi commented Jan 31, 2022 •

edited

Loading