-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Runtime][Contrib] Support cudnn softmax #5214
Conversation
|
||
// Set mode and shape descriptor | ||
if (axis == ndim - 1) { | ||
int64_t N = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused why we need int64_t
here but later cast to int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because DLTensor defines the shape in int64_t
type. There'll be a cast anyway.
as part of the principle, would be great if we can lookinto making the native op as fast |
@tqchen Yes, I understand that. But the latency difference could be 10x between tvm schedule and cudnn for the input shape like [100, 1024] on V100. I guess to achieve such performance it requires fusion across multiple stage of reduction, which it seems not easy to be implemented in tir. |
k, I am not trying to blocking the PR, merely trying to say it would be great to have such investigation :) |
@wpan11nv @yongfeng-nv can you suggest a bit about possible optimizations that can be done? |
We don't know the details, but will look into it. |
The cuda schedule emits 4 kernels, which cause lots of IO overhead. Ideally, we may emit a single kernel for small reduction sizes (e.g. reduction dim n <= 1024) |
#5600 for improving softmax with warp shuffle. |
Using cudnn can improve the softmax performance on Nvidia GPU.
@yzhliu @Laurawly @ZihengJiang