Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. #2729

Open
gty111 opened this issue Feb 2, 2024 · 20 comments

Comments

@gty111
Copy link
Contributor

gty111 commented Feb 2, 2024

When executing script examples/offline_inference_with_prefix.py, it will call context_attention_fwd from vllm.model_executor.layers.triton_kernel.prefix_prefill, which triggered the following error

python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.

Platform :

  • V100
  • CUDA 12.0
  • python 3.11.6
  • vllm 0.3.0+cu120
  • triton 2.1.0
  • torch 2.1.2

related to #1669

@eric8607242
Copy link

Same problem!
Have you address this issue?

@gty111
Copy link
Contributor Author

gty111 commented Mar 8, 2024

Not yet.

@ikushare
Copy link

Same problem!
Have you address this bug?

@ucasAliy
Copy link

the same issue. use the latest version of vllm, it says V100 is not supported. Have you found a workaround for this problem?

@ucasAliy
Copy link

but when setting the prefix_pos<=15, it's running

@gty111
Copy link
Contributor Author

gty111 commented Mar 14, 2024

the same issue. use the latest version of vllm, it says V100 is not supported. Have you found a workaround for this problem?

Not yet.

but when setting the prefix_pos<=15, it's running

Since the block size is 16, VLLM won't cache prefix if prefix_pos<=15.

@Qinyu-Xu
Copy link

I am using A10 GPU. Upgrading triton version from 2.1.0 to 2.2.0 solved my problem. After reading this issue (https://github.com/openai/triton/issues/1298) , I found that triton has already removed this assertion in the newest version.

@gty111
Copy link
Contributor Author

gty111 commented Mar 15, 2024

I am using A10 GPU. Upgrading triton version from 2.1.0 to 2.2.0 solved my problem. After reading this issue (https://github.com/openai/triton/issues/1298) , I found that triton has already removed this assertion in the newest version.

not work for me

@wenqf11
Copy link

wenqf11 commented Apr 1, 2024

same issue on V100, any update to support V100?

@matthieu-zimmer
Copy link

matthieu-zimmer commented Apr 3, 2024

same problem with V100, is there a way to rely on the page attention kernel instead of the context_attention_fwd @caoshiyi ?

This might be a solution otherwise triton-lang/triton#1420 (comment)

@matthieu-zimmer
Copy link

As suggested by Jokeren, storing the temporary values to the global memory and then reload from it with latest triton version is working on V100.

@naturomics
Copy link

same error on V100.

  • vllm: 0.5.1
  • pytorch: 2.3.0+cu121

@wukonggeo wukonggeo mentioned this issue Jul 10, 2024
2 tasks
@wukonggeo
Copy link

wukonggeo commented Jul 10, 2024

Same error on V100.
Ubuntu 22,
GPU A100 32G,
Python 3.10,
cuda 12.1,
vllm 0.5.0.post1
Trition 2.3.0
Python Code:
max_model_len, tp_size = 131072, 1
model_name = "THUDM/glm-4-9b-chat"
prompt = [{"role": "user", "content": "你好"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
model=model_name,
tensor_parallel_size=tp_size,
max_model_len=max_model_len,
trust_remote_code=True,
enforce_eager=True,
dtype='half',
enable_chunked_prefill=True,
max_num_batched_tokens=8192
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)
Log:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 07-09 17:48:14 config.py:1222] Casting torch.bfloat16 to torch.float16.
INFO 07-09 17:48:14 config.py:707] Chunked prefill is enabled (EXPERIMENTAL).
INFO 07-09 17:48:14 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='THUDM/glm-4-9b-chat', speculative_config=None, tokenizer='THUDM/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=model/glm/glm-4-9b-chat)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 07-09 17:48:15 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 07-09 17:48:15 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-09 17:48:15 selector.py:51] Using XFormers backend.
INFO 07-09 17:48:18 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-09 17:48:18 selector.py:51] Using XFormers backend.
INFO 07-09 17:48:28 model_runner.py:160] Loading model weights took 17.5635 GB
INFO 07-09 17:48:30 gpu_executor.py:83] # GPU blocks: 13067, # CPU blocks: 6553
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]python: /project/lib/Analysis/Allocation.cpp:43: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.

@tricky61
Copy link

As suggested by Jokeren, storing the temporary values to the global memory and then reload from it with latest triton version is working on V100.

which version will work?
vllm==0.4.3 and triton==2.3.0 have the same error

@ZG2017
Copy link

ZG2017 commented Jul 18, 2024

similar problem with v100.

Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.

python==3.10
triton==2.3.0
vllm==0.5.1

@MazarineGlacier
Copy link

MazarineGlacier commented Jul 26, 2024

This problem arises from https://github.com/triton-lang/triton/pull/2627/files . vllm implemented a fwd kernel in prefix_prefill.py, thus triggering this issue. I think we should modify the _fwd_kernel in vllm/vllm/attention/ops/prefix_prefill.py.

@geekchen007
Copy link

same problem with Nvidia-v100.

vllm-0.5.3 
nvidia-nccl-cu12-2.20.5
outlines-0.0.46 
torch-2.3.1 
triton-2.3.1 
xformers-0.0.27

@grgkovac
Copy link

grgkovac commented Aug 8, 2024

Adding the following flag worked for me:

vllm serve ...  --enable-chunked-prefill=False

@MazarineGlacier
Copy link

MazarineGlacier commented Aug 15, 2024

There is a similar issue: #6723

On v100, --enable-chunked-prefill and --enable-prefix-caching might encounter this crash when doing serving, so these features might not be available on V100.

@K-Mistele
Copy link
Contributor

same issue here on v100 tesla 32gb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests