Add Flash Decoding #1151

zfang · 2023-10-14T18:10:33Z

Feature request

See https://pytorch.org/blog/flash-decoding/#:~:text=Flash%2DDecoding%20works%20in%203,exp%20of%20the%20attention%20values.

Motivation

Flash decoding further improves attention mechanism compared to FlashAttention V2 on long context

Your contribution

None

ssmi153 · 2023-10-16T00:00:41Z

+1!

It looks like this might be included in FlashAttention v2.2. It's not clear from the blog whether any inference code needs to be changed to see the benefits of this.

taishan1994 · 2023-10-16T01:51:47Z

+1

OlivierDehaene · 2023-10-19T08:41:02Z

It's not clear if it is superior to paged attention: all the tests I saw were vs native Transformers which we know is not optimised.
The kernel fusing is nice though. I will make some tests and report back if we want to have this or not.

dongs0104 · 2023-10-20T00:26:55Z

@OlivierDehaene PA announce to PagedAttention V2 implements a similar idea to boost the performance when the batch size or the number of attention heads per GPU is small.

OlivierDehaene · 2023-10-23T10:35:12Z

#1183 instead.

OlivierDehaene closed this as completed Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Flash Decoding #1151

Add Flash Decoding #1151

zfang commented Oct 14, 2023

ssmi153 commented Oct 16, 2023

taishan1994 commented Oct 16, 2023

OlivierDehaene commented Oct 19, 2023

dongs0104 commented Oct 20, 2023

OlivierDehaene commented Oct 23, 2023

Add Flash Decoding #1151

Add Flash Decoding #1151

Comments

zfang commented Oct 14, 2023

Feature request

Motivation

Your contribution

ssmi153 commented Oct 16, 2023

taishan1994 commented Oct 16, 2023

OlivierDehaene commented Oct 19, 2023

dongs0104 commented Oct 20, 2023

OlivierDehaene commented Oct 23, 2023