vllm demo bug #314

wukonggeo · 2024-07-09T10:02:32Z

System Info / 系統信息

Ubuntu 22, GPU A100 32G, Python 3.10, cuda 12.1,vllm 0.5.0.post1

Who can help? / 谁可以帮助到您？

No response

Information / 问题信息

The official example scripts / 官方的示例脚本
My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

code:
max_model_len, tp_size = 131072, 1
model_name = "THUDM/glm-4-9b-chat"
prompt = [{"role": "user", "content": "你好"}]

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
model=model_name,
tensor_parallel_size=tp_size,
max_model_len=max_model_len,
trust_remote_code=True,
enforce_eager=True,
dtype='half',
enable_chunked_prefill=True,
max_num_batched_tokens=8192
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)

inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)

print(outputs[0].outputs[0].text)

log info:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 07-09 17:48:14 config.py:1222] Casting torch.bfloat16 to torch.float16.
INFO 07-09 17:48:14 config.py:707] Chunked prefill is enabled (EXPERIMENTAL).
INFO 07-09 17:48:14 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='THUDM/glm-4-9b-chat', speculative_config=None, tokenizer='THUDM/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=model/glm/glm-4-9b-chat)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 07-09 17:48:15 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 07-09 17:48:15 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-09 17:48:15 selector.py:51] Using XFormers backend.
INFO 07-09 17:48:18 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-09 17:48:18 selector.py:51] Using XFormers backend.
INFO 07-09 17:48:28 model_runner.py:160] Loading model weights took 17.5635 GB
INFO 07-09 17:48:30 gpu_executor.py:83] # GPU blocks: 13067, # CPU blocks: 6553
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]python: /project/lib/Analysis/Allocation.cpp:43: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.

Expected behavior / 期待表现

How to solve this issue?

wukonggeo · 2024-07-09T10:09:40Z

@zRzRzRzRzRzRzR

zRzRzRzRzRzRzR · 2024-07-09T12:41:05Z

mma layout conversion is only supported on Ampere
这个是不是依赖安装错误了呢？怎么出现了vLLM级别的bug，而不是代码的bug，这是一个依赖上的错误

wukonggeo · 2024-07-10T03:39:19Z

安装时直接 pip install -r
https://github.com/THUDM/GLM-4/blob/main/basic_demo/requirements.txt
取消了vllm的注释
还能出问题么？是不是requirements中的版本没匹配好呀还是v100显卡问题，vllm那边反馈也挺多的。

zRzRzRzRzRzRzR · 2024-07-12T09:25:18Z

没有测过V100，但是因为flash attn支持安培以上架构所以这可能是一个问题

zRzRzRzRzRzRzR self-assigned this Jul 9, 2024

wukonggeo closed this as completed Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vllm demo bug #314

vllm demo bug #314

wukonggeo commented Jul 9, 2024

wukonggeo commented Jul 9, 2024

zRzRzRzRzRzRzR commented Jul 9, 2024

wukonggeo commented Jul 10, 2024 •

edited

Loading

zRzRzRzRzRzRzR commented Jul 12, 2024

vllm demo bug #314

vllm demo bug #314

Comments

wukonggeo commented Jul 9, 2024

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

wukonggeo commented Jul 9, 2024

zRzRzRzRzRzRzR commented Jul 9, 2024

wukonggeo commented Jul 10, 2024 • edited Loading

zRzRzRzRzRzRzR commented Jul 12, 2024

wukonggeo commented Jul 10, 2024 •

edited

Loading