-
-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mma -> mma layout conversion"' failed. #2729
Comments
Same problem! |
Not yet. |
Same problem! |
the same issue. use the latest version of vllm, it says V100 is not supported. Have you found a workaround for this problem? |
but when setting the prefix_pos<=15, it's running |
Not yet.
Since the block size is 16, VLLM won't cache prefix if prefix_pos<=15. |
I am using A10 GPU. Upgrading triton version from 2.1.0 to 2.2.0 solved my problem. After reading this issue (https://github.com/openai/triton/issues/1298) , I found that triton has already removed this assertion in the newest version. |
not work for me |
same issue on V100, any update to support V100? |
same problem with V100, is there a way to rely on the page attention kernel instead of the context_attention_fwd @caoshiyi ? This might be a solution otherwise triton-lang/triton#1420 (comment) |
As suggested by Jokeren, storing the temporary values to the global memory and then reload from it with latest triton version is working on V100. |
same error on V100.
|
Same error on V100. tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True) print(outputs[0].outputs[0].text) |
which version will work? |
similar problem with v100. Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed. python==3.10 |
This problem arises from https://github.com/triton-lang/triton/pull/2627/files . vllm implemented a fwd kernel in |
same problem with Nvidia-v100.
|
Adding the following flag worked for me:
|
There is a similar issue: #6723 On v100, |
same issue here on v100 tesla 32gb |
When executing script
examples/offline_inference_with_prefix.py
, it will callcontext_attention_fwd
fromvllm.model_executor.layers.triton_kernel.prefix_prefill
, which triggered the following errorPlatform :
related to #1669
The text was updated successfully, but these errors were encountered: