You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
log info:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 07-09 17:48:14 config.py:1222] Casting torch.bfloat16 to torch.float16.
INFO 07-09 17:48:14 config.py:707] Chunked prefill is enabled (EXPERIMENTAL).
INFO 07-09 17:48:14 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='THUDM/glm-4-9b-chat', speculative_config=None, tokenizer='THUDM/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=model/glm/glm-4-9b-chat)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 07-09 17:48:15 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 07-09 17:48:15 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-09 17:48:15 selector.py:51] Using XFormers backend.
INFO 07-09 17:48:18 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-09 17:48:18 selector.py:51] Using XFormers backend.
INFO 07-09 17:48:28 model_runner.py:160] Loading model weights took 17.5635 GB
INFO 07-09 17:48:30 gpu_executor.py:83] # GPU blocks: 13067, # CPU blocks: 6553
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]python: /project/lib/Analysis/Allocation.cpp:43: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
Expected behavior / 期待表现
How to solve this issue?
The text was updated successfully, but these errors were encountered:
System Info / 系統信息
Ubuntu 22, GPU A100 32G, Python 3.10, cuda 12.1,vllm 0.5.0.post1
Who can help? / 谁可以帮助到您?
No response
Information / 问题信息
Reproduction / 复现过程
code:
max_model_len, tp_size = 131072, 1
model_name = "THUDM/glm-4-9b-chat"
prompt = [{"role": "user", "content": "你好"}]
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
llm = LLM(
model=model_name,
tensor_parallel_size=tp_size,
max_model_len=max_model_len,
trust_remote_code=True,
enforce_eager=True,
dtype='half',
enable_chunked_prefill=True,
max_num_batched_tokens=8192
)
stop_token_ids = [151329, 151336, 151338]
sampling_params = SamplingParams(temperature=0.95, max_tokens=1024, stop_token_ids=stop_token_ids)
inputs = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)
outputs = llm.generate(prompts=inputs, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
log info:
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 07-09 17:48:14 config.py:1222] Casting torch.bfloat16 to torch.float16.
INFO 07-09 17:48:14 config.py:707] Chunked prefill is enabled (EXPERIMENTAL).
INFO 07-09 17:48:14 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='THUDM/glm-4-9b-chat', speculative_config=None, tokenizer='THUDM/glm-4-9b-chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=model/glm/glm-4-9b-chat)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING 07-09 17:48:15 tokenizer.py:126] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.
INFO 07-09 17:48:15 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-09 17:48:15 selector.py:51] Using XFormers backend.
INFO 07-09 17:48:18 selector.py:131] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 07-09 17:48:18 selector.py:51] Using XFormers backend.
INFO 07-09 17:48:28 model_runner.py:160] Loading model weights took 17.5635 GB
INFO 07-09 17:48:30 gpu_executor.py:83] # GPU blocks: 13067, # CPU blocks: 6553
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]python: /project/lib/Analysis/Allocation.cpp:43: std::pair<llvm::SmallVector, llvm::SmallVector > mlir::triton::getCvtOrder(mlir::Attribute, mlir::Attribute): Assertion `!(srcMmaLayout && dstMmaLayout && !srcMmaLayout.isAmpere()) && "mma -> mma layout conversion is only supported on Ampere"' failed.
Expected behavior / 期待表现
How to solve this issue?
The text was updated successfully, but these errors were encountered: