Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vllm demo bug #314

Closed
1 of 2 tasks
wukonggeo opened this issue Jul 9, 2024 · 4 comments
Closed
1 of 2 tasks

vllm demo bug #314

wukonggeo opened this issue Jul 9, 2024 · 4 comments
Assignees

Comments

@wukonggeo
Copy link

System Info / 系統信息

Ubuntu 22, GPU A100 32G, Python 3.10, cuda 12.1,vllm 0.5.0.post1

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

code:
max_model_len, tp_size = 131072, 1
model_name = "THUDM/glm-4-9b-chat"
prompt = [{"role": "user", "content": "你好"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
model=model_name,
tensor_parallel_size=tp_size,
max_model_len=max_model_len,
trust_remote_code=True,
enforce_eager=True,
dtype='half',
enable_chunked_prefill=True,
max_num_batched_tokens=8192
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

log info:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 07-09 17:48:14 config.py:1222] Casting torch.bfloat16 to torch.float16.
INFO 07-09 17:48:14 config.py:707] Chunked prefill is enabled (EXPERIMENTAL).
INFO 07-09 17:48:14 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='THUDM/glm-4-9b-chat', speculative_config=None, tokenizer='THUDM/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=model/glm/glm-4-9b-chat)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 07-09 17:48:15 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 07-09 17:48:15 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-09 17:48:15 selector.py:51] Using XFormers backend.
INFO 07-09 17:48:18 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-09 17:48:18 selector.py:51] Using XFormers backend.
INFO 07-09 17:48:28 model_runner.py:160] Loading model weights took 17.5635 GB
INFO 07-09 17:48:30 gpu_executor.py:83] # GPU blocks: 13067, # CPU blocks: 6553
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]python: /project/lib/Analysis/Allocation.cpp:43: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.

Expected behavior / 期待表现

How to solve this issue?

@wukonggeo
Copy link
Author

@zRzRzRzRzRzRzR

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR self-assigned this Jul 9, 2024
@zRzRzRzRzRzRzR
Copy link
Member

mma layout conversion is only supported on Ampere
这个是不是依赖安装错误了呢?怎么出现了vLLM级别的bug,而不是代码的bug,这是一个依赖上的错误

@wukonggeo
Copy link
Author

wukonggeo commented Jul 10, 2024

安装时直接 pip install -r
https://github.com/THUDM/GLM-4/blob/main/basic_demo/requirements.txt
取消了vllm的注释
还能出问题么?是不是requirements中的版本没匹配好呀还是v100显卡问题,vllm那边反馈也挺多的。

@zRzRzRzRzRzRzR
Copy link
Member

没有测过V100,但是因为flash attn支持安培以上架构所以这可能是一个问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants