Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. #2729

gty111 · 2024-02-02T12:02:09Z

When executing script examples/offline_inference_with_prefix.py, it will call context_attention_fwd from vllm.model_executor.layers.triton_kernel.prefix_prefill, which triggered the following error

python: /project/lib/Analysis/Allocation.cpp:40: std::pair<llvm::SmallVector<unsigned int>, llvm::SmallVector<unsigned int> > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed.

Platform :

V100
CUDA 12.0
python 3.11.6
vllm 0.3.0+cu120
triton 2.1.0
torch 2.1.2

related to #1669

The text was updated successfully, but these errors were encountered:

eric8607242 · 2024-03-08T01:47:14Z

Same problem!
Have you address this issue?

gty111 · 2024-03-08T03:21:23Z

Not yet.

ikushare · 2024-03-11T07:58:08Z

Same problem!
Have you address this bug?

ucasAliy · 2024-03-14T08:03:57Z

the same issue. use the latest version of vllm, it says V100 is not supported. Have you found a workaround for this problem?

ucasAliy · 2024-03-14T08:05:52Z

but when setting the prefix_pos<=15, it's running

gty111 · 2024-03-14T08:12:49Z

the same issue. use the latest version of vllm, it says V100 is not supported. Have you found a workaround for this problem?

Not yet.

but when setting the prefix_pos<=15, it's running

Since the block size is 16, VLLM won't cache prefix if prefix_pos<=15.

Qinyu-Xu · 2024-03-15T04:39:11Z

I am using A10 GPU. Upgrading triton version from 2.1.0 to 2.2.0 solved my problem. After reading this issue (https://github.com/openai/triton/issues/1298) , I found that triton has already removed this assertion in the newest version.

gty111 · 2024-03-15T07:09:20Z

I am using A10 GPU. Upgrading triton version from 2.1.0 to 2.2.0 solved my problem. After reading this issue (https://github.com/openai/triton/issues/1298) , I found that triton has already removed this assertion in the newest version.

not work for me

wenqf11 · 2024-04-01T07:16:08Z

same issue on V100, any update to support V100?

matthieu-zimmer · 2024-04-03T09:33:48Z

same problem with V100, is there a way to rely on the page attention kernel instead of the context_attention_fwd @caoshiyi ?

This might be a solution otherwise triton-lang/triton#1420 (comment)

matthieu-zimmer · 2024-05-22T12:25:19Z

As suggested by Jokeren, storing the temporary values to the global memory and then reload from it with latest triton version is working on V100.

naturomics · 2024-07-09T06:55:09Z

same error on V100.

vllm: 0.5.1
pytorch: 2.3.0+cu121

wukonggeo · 2024-07-10T03:44:45Z

Same error on V100.
Ubuntu 22,
GPU A100 32G,
Python 3.10,
cuda 12.1,
vllm 0.5.0.post1
Trition 2.3.0
Python Code:
max_model_len, tp_size = 131072, 1
model_name = "THUDM/glm-4-9b-chat"
prompt = [{"role": "user", "content": "你好"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
model=model_name,
tensor_parallel_size=tp_size,
max_model_len=max_model_len,
trust_remote_code=True,
enforce_eager=True,
dtype='half',
enable_chunked_prefill=True,
max_num_batched_tokens=8192
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)
Log:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 07-09 17:48:14 config.py:1222] Casting torch.bfloat16 to torch.float16.
INFO 07-09 17:48:14 config.py:707] Chunked prefill is enabled (EXPERIMENTAL).
INFO 07-09 17:48:14 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='THUDM/glm-4-9b-chat', speculative_config=None, tokenizer='THUDM/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=model/glm/glm-4-9b-chat)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 07-09 17:48:15 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 07-09 17:48:15 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-09 17:48:15 selector.py:51] Using XFormers backend.
INFO 07-09 17:48:18 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-09 17:48:18 selector.py:51] Using XFormers backend.
INFO 07-09 17:48:28 model_runner.py:160] Loading model weights took 17.5635 GB
INFO 07-09 17:48:30 gpu_executor.py:83] # GPU blocks: 13067, # CPU blocks: 6553
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]python: /project/lib/Analysis/Allocation.cpp:43: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.

tricky61 · 2024-07-12T06:57:49Z

As suggested by Jokeren, storing the temporary values to the global memory and then reload from it with latest triton version is working on V100.

which version will work？
vllm==0.4.3 and triton==2.3.0 have the same error

ZG2017 · 2024-07-18T16:53:10Z

same problem with Nvidia-v100.

vllm-0.5.3 
nvidia-nccl-cu12-2.20.5
outlines-0.0.46 
torch-2.3.1 
triton-2.3.1 
xformers-0.0.27

grgkovac · 2024-08-08T15:16:50Z

Adding the following flag worked for me:

vllm serve ...  --enable-chunked-prefill=False

MazarineGlacier · 2024-08-15T03:26:57Z

There is a similar issue: #6723

On v100, --enable-chunked-prefill and --enable-prefix-caching might encounter this crash when doing serving, so these features might not be available on V100.

K-Mistele · 2024-09-06T21:56:32Z

same issue here on v100 tesla 32gb

matthieu637 mentioned this issue Apr 5, 2024

[Bug]: Automatic Prefix Caching doesn't support Nvidia Turing Arch. #3687

Closed

Naman-ntc mentioned this issue Apr 29, 2024

Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed LiveCodeBench/LiveCodeBench#14

Closed

wukonggeo mentioned this issue Jul 10, 2024

vllm demo bug THUDM/GLM-4#314

Closed

2 tasks

brosoul mentioned this issue Aug 30, 2024

[Bug]: deploy on V100, mma -> mma layout conversion is only supported on Ampere #8024

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. #2729

Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. #2729

gty111 commented Feb 2, 2024

eric8607242 commented Mar 8, 2024

gty111 commented Mar 8, 2024

ikushare commented Mar 11, 2024

ucasAliy commented Mar 14, 2024

ucasAliy commented Mar 14, 2024

gty111 commented Mar 14, 2024 •

edited

Loading

Qinyu-Xu commented Mar 15, 2024

gty111 commented Mar 15, 2024

wenqf11 commented Apr 1, 2024

matthieu-zimmer commented Apr 3, 2024 •

edited

Loading

matthieu-zimmer commented May 22, 2024

naturomics commented Jul 9, 2024

wukonggeo commented Jul 10, 2024 •

edited

Loading

tricky61 commented Jul 12, 2024

ZG2017 commented Jul 18, 2024

MazarineGlacier commented Jul 26, 2024 •

edited

Loading

geekchen007 commented Aug 2, 2024

grgkovac commented Aug 8, 2024

MazarineGlacier commented Aug 15, 2024 •

edited

Loading

K-Mistele commented Sep 6, 2024

Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. #2729

Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. #2729

Comments

gty111 commented Feb 2, 2024

eric8607242 commented Mar 8, 2024

gty111 commented Mar 8, 2024

ikushare commented Mar 11, 2024

ucasAliy commented Mar 14, 2024

ucasAliy commented Mar 14, 2024

gty111 commented Mar 14, 2024 • edited Loading

Qinyu-Xu commented Mar 15, 2024

gty111 commented Mar 15, 2024

wenqf11 commented Apr 1, 2024

matthieu-zimmer commented Apr 3, 2024 • edited Loading

matthieu-zimmer commented May 22, 2024

naturomics commented Jul 9, 2024

wukonggeo commented Jul 10, 2024 • edited Loading

tricky61 commented Jul 12, 2024

ZG2017 commented Jul 18, 2024

MazarineGlacier commented Jul 26, 2024 • edited Loading

geekchen007 commented Aug 2, 2024

same problem with Nvidia-v100.

grgkovac commented Aug 8, 2024

MazarineGlacier commented Aug 15, 2024 • edited Loading

K-Mistele commented Sep 6, 2024

gty111 commented Mar 14, 2024 •

edited

Loading

matthieu-zimmer commented Apr 3, 2024 •

edited

Loading

wukonggeo commented Jul 10, 2024 •

edited

Loading

MazarineGlacier commented Jul 26, 2024 •

edited

Loading

MazarineGlacier commented Aug 15, 2024 •

edited

Loading