[Speculative decoding] [Help wanted] [Performance] Optimize draft-model speculative decoding #4630

cadedaniel · 2024-05-06T17:16:50Z

Proposal to improve performance

With the end-to-end correctness tests merged in #3951, now we will optimize the implementation to get ~50% speedup on 70B model with temperature 1.0.

Work required:

P0/P1 -- priority
(Small/Medium/Large) -- relative size estimate

Optimizing proposal time
- P0 (Large) Reduce draft model control-plane communication from O(num_steps) to O(1)
- P0 (Medium) Support draft model on different tensor-parallel-size than target model [Performance] [Speculative decoding]: Support draft model on different tensor-parallel size than target model #4632
Optimizations for scoring time
- P0 (Medium) Re-enable bonus tokens to increase % accepted tokens [Speculative decoding] [Performance]: Re-enable bonus tokens #4212
- P1 (Large) Replace CPU-based batch expansion with multi-query attention kernel call
- P1 (Medium) Automate speculative decoding [RFC]: Automate Speculative Decoding #4565
Optimizations for both proposal and scoring time [Performance] [Speculative decoding] Speed up autoregressive proposal methods by making sampler CPU serialization optional #5561
- P0 (Medium) Decouple sampling serialization from sampling
- P1 (Large) Amortize prepare_inputs over multiple forward passes
Optimizations for scheduling time
- P0 (Medium) Profile & optimize block manager V2 [Performance]: Profile & optimize the BlockManagerV2 #4536

FAQ

What should the target configuration be for 50% speedup?

In the Anyscale fork we saw a 50% speedup on bs=8 with a 68m-sized draft model on TP1/70B target model on TP8 and a 7B draft model on TP(1|8)/70B target model on TP8. This was with the optimizations listed above as "P0".

Note we can do much better than this, with multi-query scoring (P1), GQA for target model scoring, and a dynamic speculation policy. This is just the starting point!

Why not implement Medusa / tree-attention?

We should implement this! The work here will lay the foundation for future improvements in speculative decoding. For example, Eagle uses the Medusa approach (fine-tuned heads plus tree attention) and even claims to beat Medusa. But for Eagle to work well in vLLM we need to optimize the sampler as listed above.

The north star should be: configurable tree size (top-k .. top-1), which uses multi-query attention for scoring (no batch expansion). This issue is about optimizing vLLM in the top-1 speculation case to get 50% speedup with draft models.

The text was updated successfully, but these errors were encountered:

youkaichao · 2024-05-06T17:29:27Z

Support draft model on different tensor-parallel-size than target model

This should be doable. Just need to figure out the UX change of how users use it.

Do spec workers and non-spec workers share process/device? e.g. when we have tp=8 in current code, and want to add another tp=2 for spec decoding, do we want tp=2 to be another 2 processes, or from the subset of the tp=8 processes?

cadedaniel · 2024-05-06T18:16:31Z

See the code linked here @youkaichao : #4632. The spec worker and non-spec workers share the same process.

KexinFeng · 2024-05-07T16:01:50Z

About the tree-attention/Medusa/Eagle, one of the core implementation will be tree attention mask in flash attention, which is currently not ready. I'd like to bring your attention to it Dao-AILab/flash-attention#924. If anyone would like to contribute to it, it would be great.

sighingnow · 2024-05-09T09:07:17Z

In the Anyscale fork we saw a 50% speedup on bs=8 with a 68m-sized draft model on TP1/70B target model on TP8 and a 7B draft model on TP(1|8)/70B target model on TP8. This was with the optimizations listed above as "P0".

Hi @cadedaniel, I have tried current main branch to evaluate the acceleration of speculative decoding, but encountered the following assertion error:

vllm/vllm/executor/ray_gpu_executor.py

Lines 28 to 32 in 190bc83

    
           class RayGPUExecutor(DistributedGPUExecutor): 
        
               def _init_executor(self) -> None: 
        
                   assert (not self.speculative_config 
        
                           ), "Speculative decoding not yet supported for RayGPU backend."

I'm wondering how the 50% speedup is measured, is there still further pending PRs? And, as the draft-model looks so small (64m-sized), may I know if the 50% speedup is measured with greedy sampling or random sampling?

Thanks!

cadedaniel · 2024-05-09T18:42:40Z

About the tree-attention/Medusa/Eagle, one of the core implementation will be tree attention mask in flash attention, which is currently not ready. I'd like to bring your attention to it Dao-AILab/flash-attention#924. If anyone would like to contribute to it, it would be great.

@LiuXiaoxuanPKU has more on this

cadedaniel · 2024-05-09T18:43:19Z

@sighingnow this issue is for getting the 50% speedup. once the P0s are done we will get it with temperature 1.0.

ChuanhongLi · 2024-05-10T01:36:01Z

In the Anyscale fork we saw a 50% speedup on bs=8 with a 68m-sized draft model on TP1/70B target model on TP8 and a 7B draft model on TP(1|8)/70B target model on TP8. This was with the optimizations listed above as "P0".

Hi @cadedaniel, I have tried current main branch to evaluate the acceleration of speculative decoding, but encountered the following assertion error:

vllm/vllm/executor/ray_gpu_executor.py

Lines 28 to 32 in 190bc83

class RayGPUExecutor(DistributedGPUExecutor):

def _init_executor(self) -> None:

assert (not self.speculative_config

), "Speculative decoding not yet supported for RayGPU backend."

I'm wondering how the 50% speedup is measured, is there still further pending PRs? And, as the draft-model looks so small (64m-sized), may I know if the 50% speedup is measured with greedy sampling or random sampling?

Thanks!

I have met the same problem. Is there a solution? By the way, is there any documentation on how to evaluate the acceleration of speculative decoding? Thanks!

sighingnow · 2024-05-10T02:05:43Z

@sighingnow this issue is for getting the 50% speedup. once the P0s are done we will get it with temperature 1.0.

May I know more about the accept rate when we get the 50% speedup? Thanks!

cadedaniel · 2024-05-10T02:16:36Z

May I know more about the accept rate when we get the 50% speedup? Thanks!

On llama2 7b / llama2 70b, the acceptance rate was like 80% (no fine tuning). we trained a 68m draft model at anyscale that gets ~50% acceptance rate. btw you can run acceptance rate experiments today (I will push a PR tomorrow for TP>1 support)

I have met the same problem. Is there a solution? By the way, is there any documentation on how to evaluate the acceleration of speculative decoding? Thanks!

Thanks @ChuanhongLi -- FYI there is no acceleration yet. we'll share documentation once there is a useful speedup.

sighingnow · 2024-05-10T03:05:49Z

On llama2 7b / llama2 70b, the acceptance rate was like 80% (no fine tuning). we trained a 68m draft model at anyscale that gets ~50% acceptance rate. btw you can run acceptance rate experiments today (I will push a PR tomorrow for TP>1 support)

Thanks for the information! Looking forward to the complete speculative decoding support!

ChuanhongLi · 2024-05-10T03:12:11Z

Thanks for the information! Looking forward to the complete speculative decoding support!

Thanks for your reply!

caddfa31434 · 2024-05-11T06:21:51Z

I noticed there's a feature request related to Medusa/Eagle at #4669

Wanglongzhi2001 · 2024-07-03T07:38:02Z

On llama2 7b / llama2 70b, the acceptance rate was like 80% (no fine tuning). we trained a 68m draft model at anyscale that gets ~50% acceptance rate. btw you can run acceptance rate experiments today (I will push a PR tomorrow for TP>1 support)

@cadedaniel May I know how you calculated the acceptance rate？On llama2 7b / llama2 70b, this acceptance rate seems a little high but just 50% speedup.

sighingnow · 2024-07-07T08:47:51Z

P1 (Large) Replace CPU-based batch expansion with multi-query attention kernel call

Hi @cadedaniel @LiuXiaoxuanPKU, I have pushed a multi-query scorer implementation in #6185. Could you please take a look at it and let me know how do you think about it?

Thanks!

cadedaniel · 2024-08-05T18:04:44Z

Thanks everyone for the help! We hit a 45% latency reduction. Big thanks to @sroy745 @alexm-neuralmagic @comaniac @wooyeonlee0 @zifeitong @LiuXiaoxuanPKU @rkooo567 @ruisearch42 and everyone else who has helped reduced vLLM overheads!

I expect there to be more performance gains once we move the API server outside of the worker, we can re-run evals then.

alexm-neuralmagic · 2024-08-05T18:28:12Z

@cadedaniel thanks for leading this project!

sroy745 · 2024-08-05T19:44:12Z

@cadedaniel Thanks for leading this effort.

cadedaniel added performance Performance-related issues speculative-decoding labels May 6, 2024

richardliaw added the help wanted Extra attention is needed label May 6, 2024

cadedaniel mentioned this issue May 6, 2024

[Performance] [Speculative decoding]: Support draft model on different tensor-parallel size than target model #4632

Closed

hmellor mentioned this issue May 31, 2024

[Performance]: What can we learn from OctoAI #5167

Closed

LiuXiaoxuanPKU mentioned this issue Jun 4, 2024

[Performance]: Speculative Performance almost same or lower #5239

Open

Dbxwz mentioned this issue Jun 7, 2024

[Feature] [Spec decode]: Combine chunked prefill with speculative decoding #5016

Open

LiuXiaoxuanPKU mentioned this issue Jun 30, 2024

[Feature]: Request for SmartSpec Method Support #5886

Closed

sighingnow mentioned this issue Jul 7, 2024

[Core][Speculative Decoding] Add multi-query verifier for speculative decoding without batch expansion #6185

Open

ShangmingCai mentioned this issue Jul 10, 2024

[Feature]: Multi-Proposers support for speculative decoding. #6300

Open

cadedaniel closed this as completed Aug 5, 2024

cadedaniel mentioned this issue Aug 5, 2024

[WIP] Speculative decoding using a draft model #2188

Closed

kerthcet mentioned this issue Aug 6, 2024

Support speculative decoding InftyAI/llmaz#59

Closed

3 tasks

josephrocca mentioned this issue Sep 11, 2024

[Feature] Speculative Decoding InternLM/lmdeploy#1738

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Speculative decoding] [Help wanted] [Performance] Optimize draft-model speculative decoding #4630

[Speculative decoding] [Help wanted] [Performance] Optimize draft-model speculative decoding #4630

cadedaniel commented May 6, 2024 •

edited

Loading

youkaichao commented May 6, 2024 •

edited

Loading

cadedaniel commented May 6, 2024

KexinFeng commented May 7, 2024

sighingnow commented May 9, 2024

cadedaniel commented May 9, 2024

cadedaniel commented May 9, 2024

ChuanhongLi commented May 10, 2024

sighingnow commented May 10, 2024

cadedaniel commented May 10, 2024 •

edited

Loading

sighingnow commented May 10, 2024

ChuanhongLi commented May 10, 2024

caddfa31434 commented May 11, 2024

Wanglongzhi2001 commented Jul 3, 2024 •

edited

Loading

sighingnow commented Jul 7, 2024

cadedaniel commented Aug 5, 2024

alexm-neuralmagic commented Aug 5, 2024

sroy745 commented Aug 5, 2024

[Speculative decoding] [Help wanted] [Performance] Optimize draft-model speculative decoding #4630

[Speculative decoding] [Help wanted] [Performance] Optimize draft-model speculative decoding #4630

Comments

cadedaniel commented May 6, 2024 • edited Loading

Proposal to improve performance

Work required:

FAQ

What should the target configuration be for 50% speedup?

Why not implement Medusa / tree-attention?

youkaichao commented May 6, 2024 • edited Loading

cadedaniel commented May 6, 2024

KexinFeng commented May 7, 2024

sighingnow commented May 9, 2024

cadedaniel commented May 9, 2024

cadedaniel commented May 9, 2024

ChuanhongLi commented May 10, 2024

sighingnow commented May 10, 2024

cadedaniel commented May 10, 2024 • edited Loading

sighingnow commented May 10, 2024

ChuanhongLi commented May 10, 2024

caddfa31434 commented May 11, 2024

Wanglongzhi2001 commented Jul 3, 2024 • edited Loading

sighingnow commented Jul 7, 2024

cadedaniel commented Aug 5, 2024

alexm-neuralmagic commented Aug 5, 2024

sroy745 commented Aug 5, 2024

cadedaniel commented May 6, 2024 •

edited

Loading

youkaichao commented May 6, 2024 •

edited

Loading

cadedaniel commented May 10, 2024 •

edited

Loading

Wanglongzhi2001 commented Jul 3, 2024 •

edited

Loading